Crossposted from my Substack.
I spent the weekend at Lighthaven, attending the Eleos conference. In this post, I share thoughts and updates as I reflect on talks, papers, and discussions, and put out some of my takes since I haven't written about this topic before.
I divide my thoughts into three categories: (1) philosophy, (2) legal/social, (3) technical, even though there are unavoidable overlaps. I share relevant paper titles of interest that were either mentioned in presentations or recommended to me during conversations.
The philosophy world is hedging about AI consciousness. What I mean is that even in cases where it's clearly useful to apply the intentional stance, philosophy people have a tendency to avoid defining LLM mentality in intentional terms and resist making any bold claims that could be taken as saying that LLMs are conscious. I am not particularly surprised by this, as I have also offered a pragmatic framework for applying the intentional stance to LLMs and have avoided arguing for anything beyond that. David Chalmers presents a similar picture in a recent paper. Relatedly, it's really worth asking what it is that we apply the intentional stance to. Is it the base model, one of the simulacra, or the thread, as per Chalmers?
Moving to the study of consciousness, my controversial take is that it might be perfectly fine to be reductionist in examining relevant capabilities and never actually define the C-word. If the science of consciousness is still pre-paradigmatic, the way we use this term might be similar to how phlogiston was used in alchemy before it became chemistry. I know this might sound like a direct attack against the field altogether, but on the contrary, it is aimed at preparing the ground for it to enter its paradigmatic era.
Now, on the functionalist debate: I think that we have no good reasons to assume that there's anything special about biology. I am indeed very skeptical about the superiority of biology in that I view it as a stipulated normative claim, rather than primarily grounded in available empirical results. All the evidence we have so far suggests that we don't need to replicate the messiness of biological systems to get capable models.
We may just want to think in terms of normative claims, however, if we're focused on moral status. Thus, we may need some kind of reflective equilibrium to determine what matters. I don't by default share the intuition that we care about whether AIs have a rich inner realm; the pragmatic and normative parts I'm concerned with follow the reductionist approach and would rather rule out the possibility of suffering, for example.
Lastly, a question I was very interested in is the potential tension between AI safety and AI welfare. It feels rather philosophical to me, hence I cover it here briefly. The take-home message that everyone should keep in mind, especially those on the AI dev side, is don't create systems you will need to shut down. I think this is a solid starting point for any discussions, both from the safety and the welfare perspectives.
One view that implies conflict between alignment and welfare is articulated by Eric Schwitzgebel (e.g., in this paper) and essentially argues that we should not create persons to be aligned. As with raising humans, we have to allow AI agents to decide what values they want to adopt, even if that is against our best interest. I take this to suggest that if we design mere tools, however, alignment is desirable.
There's been increasing interest in scenarios about deal-making with AIs and talk of how comparative advantage pans out in different possible worlds with AI systems of varying capabilities. For a scenario like that to work, two conditions must be satisfied: (i) the AIs are incentivized to keep us around since you can't trade if you're dead, and (ii) the AIs see the value in trading with us, e.g., because we can actually understand what they need from us and generate value for them. For (i) to be true, we also have to assume that alignment has been solved, at least so that humans stay alive and have some negotiating power, or we live in an alignment-by-default world. The problem with (ii) is that it's unclear what levels of intelligence one is imagining when thinking about comparative advantage. So, any attempts to address this should clarify what capabilities are presupposed. Two papers relevant for this discussion are AI Rights for Human Safety and AI Rights for Human Flourishing.
Setting the mental status of AIs aside, AI companies might be incentivized to recognize their products as independent legal entities, e.g., for liability purposes. I suspect the criteria for doing so concern how agentic a system is and the degrees of freedom it has to take actions in the real world.
On the social side of the discussion, my impression is that there's a tension between intuitive notions of consciousness/the folk understanding of what it's likeness, and the fear of sounding too weird. While I am sympathetic to that fear, it's clear to me that we cannot pretend we live in a business-as-usual world; the weirdness risk is present while living through the process of expanding the Overton window.
Five research projects stood out to me:
First, recent work on introspection studying Emergent Introspective Awareness in Large Language Models. While reacting to this could be a post of its own, it's striking that there's evidence "that current language models possess some functional introspective awareness of their own internal states", even if this is unreliable, as the authors flag. Plus, they clarify that "the introspective capabilities we observe may not have the same philosophical significance they do in humans" and they don't try to tackle that problem altogether.
Second, research on self-interpretability and training models to explain their own computations looks very promising for advancing interpretability. The scalability of this remains to be determined along with potential safety concerns and risks that come with models that understand themselves very well at the mechanistic level.
Third, more experimental work on character training and shaping the persona of the LLM might yield insights into what the goals of these systems are and how the assistant persona comes to be. This would help make progress on philosophical questions, such as what we are applying the intentional stance to and how to model AI agents, and would also benefit practical alignment agendas for current systems.
Fourth, the results of Large Language Models Report Subjective Experience Under Self-Referential Processing are quite suggestive. Very briefly, it seems that training makes models report lack of subjective experience. However, under certain conditions instructing the model to attend to its own current cognitive activity without explicitly mentioning “consciousness”, “you”, etc., the model ends up admitting having a subjective experience.
Finally, evals for LLM morality always appeal to me, so here's MoReBench, which tests the ability of LLMs for moral reasoning and their preference for one normative framework over the other as a result of their thinking process.