Interesting. A nitpick- once text enters into the pretraining dataset with a causal graph at least as complex as (prompt -> LLM generation -> post processing), the models are heavily incentivized in the pretraining phase to model LLM internals. Pretraining on this kind of text introduces lots of tasks like “which model made this text?” “what was the prompt?” “was the model degraded by too much context?” “is this the output of a LORA or a full finetune?” etc.
I don't think anything in their training incentivizes self-modeling of this kind.
RLVR probably incentivizes it to a degree. It's much easier to make the correct choice for token 5 if you know how each possible choice will affect your train of thought for tokens 6-1006.
The term "distributed Ponzi scheme" here is not derogatory -- many currencies are distributed Ponzi schemes, and that seems fine.[1] I use this terminology partly to be funny, and mostly to point out that there's a sort of circular reasoning involved.[2] It is only rational to think that money is valuable because other people expect it to be valuable. There doesn't need to be some root source of the value (EG a government which requires taxes to be paid in the currency).
So why am I claiming that consciousness has this circular quality?
The basic claim here is that there's a cluster of related concepts -- agency, meaning, consciousness, purpose, belief, reference/semantics -- have circular definitions and circular justifications. If one tries to reduce any of these definitions to material/causal notions, I claim, you'll end up sneaking in some other notion from this cluster. This is an OK way for things to be. Definitions and justifications have to be circular at some point, or else terminate in some unexplained things, or else create an infinite chain.
The foundational idea here is the intentional stance: the idea that agency is a useful perspective. There is no fundamental physical structure which constitutes agency; agency is multiply-realizable (like computation), and the various instances of agency are best unified by whether it is useful to think of something as an agent. Another way of putting it: agency is best understood through cognitive reduction, not physical reduction.
You can see the circularity: we need to postulate a mind in order to do cognitive reduction; however, "mind" is the sort of thing we are trying to reduce.
Hence, I have a methodological disagreement with some philosophers. While I do think it is good to try and limit the baggage of a philosophical account, I don't expect it to be fruitful to try and totally eliminate agency from one's explanation of agency (except in so far as it provides inspiration or clarifies the landscape).
For example, my understanding is that most teleosemantic theories try to ground our notions of purpose/agency in biological evolution. My feeling is that this is overly restrictive. If successful, I think the success will come from interpreting evolution as agentic (ascribing goals to natural selection), rather than fully grounding the notion of purpose in purposeless things. The move is also liable to miss some cases.[3] I prefer a version of teleosemantics which ascribes semantics to anything optimized for map-territory correspondence, rather than restricting the optimization to have originally come from natural selection.
Humans are prone to argue about what/who to include in our "circle of concern" (eg fascists argued for drawing the circle at an ethnostate, vegans argue for including animals); this is perhaps because we evolved to do so (with coalition dynamics being a major survival consideration). There seems to be a strong coalition around consciousness; EG, when discussing the inclusion or exclusion of a particular animal from our circle of concern, the consciousness of said animal will often be questioned. Consciousness has many definitions, but for my discussion here I will limit the scope to "there is something it is like to be X" (X has an internal experience).
When is it explanatorily useful to posit an internal experience?
I don't think all agents necessarily have internal experiences. A chess-playing AI can usefully be thought of as an agent. It can usefully be described as having beliefs about what will happen in the game, as well as plans and goals. However, it fails to reflect on these things in a relevant way. I would say it doesn't think of itself as having goals, beliefs, etc. It lacks an adequately sophisticated self-model.
Do modern LLMs have a self-model of the sort I'm describing?
I think LLMs can be usefully described as believing things. They have representations of the world, in a teleosemantic sense: there has been some optimization for map-territory correspondence, and an LLM agent can even do some active reference-maintenance (adjusting its beliefs to better fit reality). Hence, when you talk to LLMs, I think both sides of the conversation are often talking "about things" (there is a certain amount of mutual understanding).
You can also talk to LLMs about their internal experience. You can ask LLMs to unpack their reasoning process, to tell you about their subjective feelings, to do phenomenological experiments, to try meditating and tell you what it is like, etc.
However, my impression so far is that when you do so, the LLMs aren't very good at modeling themselves.
My notion of semantics doesn't require the LLMs to have "actual direct access" to their internal states in order for their assertions about feelings/desires/etc to be meaningful. It would be enough if they merely had decently good models of themselves. However, it seems to me like their self-models are quite poor (much poorer than humans). They are essentially just making stuff up (and worse than humans would).
This makes sense. I don't think anything in their training incentivizes self-modeling of this kind. The pre-training step incentivizes them to model the internal states of humans, not themselves; what they are thinking has no influence on the static training data. This creates a heavy prior for "faking it", ie, making up stuff a human might say when asked about internal state. I doubt the other parts of training do much to correct for this.
However, I don't rule out this capability emerging naturally as LLMs continue to improve. Skill at self-modeling might emerge as a consequence of more general skill at world-modeling.
Ultimately, I'm merely pointing out one factor to consider when evaluating the moral status of AI. I'm not claiming that consciousness is the ultimate determiner of moral status. I'm not claiming that "something it is like to be X" is the ultimate definition of consciousness. I won't even argue that self-modeling is necessarily the best way to think about whether there's "something it is like to be X".
What I do want to say is that there's some circularity here. The "conscious" beings are those usefully modeled as such, but this requires some scope of observers (useful to who?). Our decisions about this might in turn be influenced to some extent by our conceptions of consciousness. Thus, consciousness has some aspects of a Keynsian beauty contest: the conscious decide who to interpret as conscious. It isn't totally arbitrary, though. It is our beauty contest; we should try to judge it well.
Particularly deflationary currencies such as gold and bitcoin, if you interpret a "Ponzi scheme" as something whose value is only propped up by the expectation that it will continue to increase in value.
I'm not really so focused on the deflationary part, however (I'm not sure how I'd want to analogize that to consciousness/agency). For my purposes the main thing is that the value is propped up by the expectation that there will be value in the future, rather than some "intrinsic" value.
I'm not being so careful about "circularity" in this post so as to cleanly distinguish between circular reasoning vs circular definitions.
EG, the typical objection to such versions of teleosemantics are swamp-man counterexamples: suppose a thermodynamic miracle occurred, with a perfectly formed human spontaneously assembling out of matter in a swamp. This person's thoughts cannot be ascribed semantics in a way that depends on evolution. My version of teleosemantics would be comfortable ascribing meaning to such a person's thoughts, because those thoughts would still be well-understood as being optimized for map-territory correspondence, much like a chess grandmaster's moves are well-explained by the desire to win.