I don't think anything in their training incentivizes self-modeling of this kind.
RLVR probably incentivizes it to a degree. It's much easier to make the correct choice for token 5 if you know how each possible choice will affect your train of thought for tokens 6-1006.
I don't think anything in their training incentivizes self-modeling of this kind.
Safety training may incentivize them to self-model about e.g. "is this the kind of thing that I would say" or "does what I said match what I intended to say":
Claude models are trained to participate in a dialogue between a human (the user) and an Assistant character, whose outputs the model is responsible for producing. However, users can also prefill the Assistant’s responses, effectively putting words in its mouth. Prefills are a common jailbreaking tactic, and can for instance be used to guide the Assistant to adopt different characteristics, or comply with requests that it would otherwise refuse. However, models are trained to be resilient to such tactics; as a result, the Assistant is reasonably skilled at detecting outputs that are “out of character” for it, and pivoting away from them. [...]
How do models distinguish between their own responses and words placed in their mouth? Doing so must involve estimating the likelihood that the model would have produced a given output token, given the prior context. Broadly, this could be achieved in two ways: (1) the model might ignore its previous intent and recompute what it would have said from raw inputs, or (2) it might directly introspect on its previously computed “intentions”–a representation of its predicted output. There is a spectrum between these extremes (the model can attend to any representation between the raw inputs and later-layer representations of “intent”).
Interesting. A nitpick- once llm text enters into the pretraining dataset with diverse causal graphs at least as complex as (prompt -> LLM generation -> post processing), the models are heavily incentivized in the pretraining phase to model LLM internals. Pretraining on this kind of text introduces lots of tasks like “which model made this text?” “what was the prompt?” “was the model degraded by too much context?” “is this the output of a LORA or a full finetune?” etc. (this assumes an oracle answering these questions lets you predict the next token more accurately, which seems exceedingly likely). I expect this effect to be much more robust if induced by webtext containing llm content, with extreme diversity in causal graphs, than from e.g. the gpt-oss training dataset with a single or small number of underlying production mechanisms
For example, my understanding is that most teleosemantic theories try to ground our notions of purpose/agency in biological evolution. My feeling is that this is overly restrictive.
As a person who makes a teleosemantic argument, it seems silly to argue that the only source of purpose is evolution, but it also seems right to say that, in some sense, all human purposes and purposes imbued to things created by humans have their ultimate origin in evolution making creatures who care about survival and reproduction (and not as in care as in psychologically care (though they may do that), but care as in be oriented towards achieving the goals of survival and reproduction). The problem with swamp men counterexamples is that swamp men don't exist.
That said, obviously things can get purpose from somewhere other than evolution, and this is not an argument that evolution is somehow special in that it's the only source of purpose since evolution is just one of many processes that can create purpose. It's only special in that, on Earth, it's the process from which most other purposes are created.
It lacks an adequately sophisticated self-model.
I'm not sure that having a self-model is such an important property of consciousness. I can imagine forgetting every fact about myself and being unable to tell what experiences I'm currently having, but still feeling them.
the typical objection to such versions of teleosemantics are swamp-man counterexamples: suppose a thermodynamic miracle occurred, with a perfectly formed human spontaneously assembling out of matter in a swamp. This person's thoughts cannot be ascribed semantics in a way that depends on evolution. My version of teleosemantics would be comfortable ascribing meaning to such a person's thoughts, because those thoughts would still be well-understood as being optimized for map-territory correspondence, much like a chess grandmaster's moves are well-explained by the desire to win.
Swampman’s thoughts haven’t been optimised for map-territory correspondence because Swampman hasn’t actually undergone the optimisation process themselves.
If the point is that it’s useful to describe Swampman’s thoughts using the Intentional Stance as if they’ve been optimised for map-territory correspondence then this is fair but you’ve drifted away from teleosemantics because the content is no longer constituted by the historical optimisation process that the system has undergone.
Definitions and justifications have to be circular at some point, or else terminate in some unexplained things, or else create an infinite chain.
If I'm understanding your point correctly, I think I disagree completely. A chain of instrumental goals terminates in a terminal goal, which is a very different kind of thing from an instrumental goal in that assigning properties like "unjustified" or "useless" to it is a category error. Instrumental goals either promote higher goals or are unjustified, but that's not true of all goals- it's just something particular to that one type of goal.
I'd also argue that a chain of definitions terminates in qualia- things like sense data and instincts determine the structure of our most basic concepts, which define higher concepts, but calling qualia "undefined" would be a category error.
There is no fundamental physical structure which constitutes agency
I also don't think I agree with this. A given slice of objective reality will only have so much structure- only so many ways of compressing it down with symbols and concepts. It's true that we're only interested in a narrow subset of that structure that's useful to us, but the structure nevertheless exists prior to us. When we come up with a useful concept that objectively predicts part of reality, we've, in a very biased way, discovered an objective part of the structure of reality- and I think that's true of the concept of agency.
Granted, maybe there's a strange loop in the way that cognitive reduction can be further reduced to physical reduction, while physical reduction can be further reduced to cognitive reduction- objective structure defines qualia, which defines objective structure. If that's what you're getting at, you may be on to something.
There seems to be a strong coalition around consciousness
One further objection, however: given that we don't really understand consciousness, I think the cultural push to base our morality around it is a really bad idea.
If it were up to me, we'd split morality up into stuff meant to solve coordination problems by getting people to pre-commit to not defecting, stuff meant to promote compassionate ends for their own sake, and stuff that's just traditional. Doing that instead of conflating everything into a single universal imperative would get rid of the deontology/consequentialism confusion, since deontology would explain the first thing and consequentialism the second, and by not founding our morality on poorly understood philosophy concepts, we wouldn't risk damaging useful social technologies or justifying horrifying atrocities if Dennettian illusionism turns out to be true or something.
I'm hoping to reply more thoroughly later, but here's a quick one: I am curious how you would further clarify the distinction between terminal goals vs instrumental goals. Naively, a terminal goal is one which does not require justification. However, you appear to be marking this way of speaking as a "category error". Could you spell out your preferred system in more detail? Personally, I find the distinction between terminal and instrumental goals to be tenable but unnecessary; details of my way of thinking are in my post An Orthodox Case Against Utility Functions.
I'm certain your model of what purpose is is a lot more detailed than mine. My take, however, is that animal brains don't exactly have a utility function, but probably do have something functionally similar to a reward function in machine learning. A well-defined set of instrumental goals terminating in terminal goals would be a very effective way of maximizing that reward, so the behaviors reinforced will often converge on an approximation of that structure. However, the biological learning algorithm is very bad at consistently finding the structure, and so the approximations will tend to shift around and conflict a lot- behaviors that approximate a terminal goal one year might approximate an instrumental goal later on, or cease to approximate any goal at all. Imagine a primitive image diffusion model with a training set of face photos- you run it on a set of random pixels, and it starts producing eyes and mouths and so on in random places, then gradually shifts those around into a slightly more coherent image as the remaining noise decreases.
So, instrumental and terminal goals in my model aren't so much things agents actually have as a sort of logical structure that influences how our behaviors develop. It's sort of like the structure of "if A implies B which implies C, then A implies C"- that's something that exists prior to us, but we tend to adopt behaviors approximating it because doing so produces a lot of reward. Note, though, that comparing the structure of goals to logic can be confusing, since logic can help promote terminal goals- so when we're approximating having goals, we want to be logical, but we have no reason to want to have terminal goals. That just something our biological reward function tends to reinforce.
Regarding my use of the term "category error", I used that term rather than saying "terminal goals don't require justification" because, while technically accurate, the use of the word "require" there sounds very strange to me. To "require" something means that it's necessary to promote some terminal goal. So, the phrase reads to me a bit like "a king is the rank which doesn't follow the king's orders"- accurate, technically, but odd. More sensible to say instead that following the king's orders is something having to do with subjects, and a category error when applied to a king.
The term "distributed Ponzi scheme" here is not derogatory -- many currencies are distributed Ponzi schemes, and that seems fine.[1] I use this terminology partly to be funny, and mostly to point out that there's a sort of circular reasoning involved.[2] It is only rational to think that money is valuable because other people expect it to be valuable. There doesn't need to be some root source of the value (EG a government which requires taxes to be paid in the currency).
So why am I claiming that consciousness has this circular quality?
The basic claim here is that there's a cluster of related concepts -- agency, meaning, consciousness, purpose, belief, reference/semantics -- have circular definitions and circular justifications. If one tries to reduce any of these definitions to material/causal notions, I claim, you'll end up sneaking in some other notion from this cluster. This is an OK way for things to be. Definitions and justifications have to be circular at some point, or else terminate in some unexplained things, or else create an infinite chain.
The foundational idea here is the intentional stance: the idea that agency is a useful perspective. There is no fundamental physical structure which constitutes agency; agency is multiply-realizable (like computation), and the various instances of agency are best unified by whether it is useful to think of something as an agent. Another way of putting it: agency is best understood through cognitive reduction, not physical reduction.
You can see the circularity: we need to postulate a mind in order to do cognitive reduction; however, "mind" is the sort of thing we are trying to reduce.
Hence, I have a methodological disagreement with some philosophers. While I do think it is good to try and limit the baggage of a philosophical account, I don't expect it to be fruitful to try and totally eliminate agency from one's explanation of agency (except in so far as it provides inspiration or clarifies the landscape).
For example, my understanding is that most teleosemantic theories try to ground our notions of purpose/agency in biological evolution. My feeling is that this is overly restrictive. If successful, I think the success will come from interpreting evolution as agentic (ascribing goals to natural selection), rather than fully grounding the notion of purpose in purposeless things. The move is also liable to miss some cases.[3] I prefer a version of teleosemantics which ascribes semantics to anything optimized for map-territory correspondence, rather than restricting the optimization to have originally come from natural selection.
Judging the consciousness or moral status of AI
Humans are prone to argue about what/who to include in our "circle of concern" (eg fascists argued for drawing the circle at an ethnostate, vegans argue for including animals); this is perhaps because we evolved to do so (with coalition dynamics being a major survival consideration). There seems to be a strong coalition around consciousness; EG, when discussing the inclusion or exclusion of a particular animal from our circle of concern, the consciousness of said animal will often be questioned. Consciousness has many definitions, but for my discussion here I will limit the scope to "there is something it is like to be X" (X has an internal experience).
When is it explanatorily useful to posit an internal experience?
I don't think all agents necessarily have internal experiences. A chess-playing AI can usefully be thought of as an agent. It can usefully be described as having beliefs about what will happen in the game, as well as plans and goals. However, it fails to reflect on these things in a relevant way. I would say it doesn't think of itself as having goals, beliefs, etc. It lacks an adequately sophisticated self-model.
Do modern LLMs have a self-model of the sort I'm describing?
I think LLMs can be usefully described as believing things. They have representations of the world, in a teleosemantic sense: there has been some optimization for map-territory correspondence, and an LLM agent can even do some active reference-maintenance (adjusting its beliefs to better fit reality). Hence, when you talk to LLMs, I think both sides of the conversation are often talking "about things" (there is a certain amount of mutual understanding).
You can also talk to LLMs about their internal experience. You can ask LLMs to unpack their reasoning process, to tell you about their subjective feelings, to do phenomenological experiments, to try meditating and tell you what it is like, etc.
However, my impression so far is that when you do so, the LLMs aren't very good at modeling themselves.
My notion of semantics doesn't require the LLMs to have "actual direct access" to their internal states in order for their assertions about feelings/desires/etc to be meaningful. It would be enough if they merely had decently good models of themselves. However, it seems to me like their self-models are quite poor (much poorer than humans). They are essentially just making stuff up (and worse than humans would).
This makes sense. I don't think anything in their training incentivizes self-modeling of this kind. The pre-training step incentivizes them to model the internal states of humans, not themselves; what they are thinking has no influence on the static training data. This creates a heavy prior for "faking it", ie, making up stuff a human might say when asked about internal state. I doubt the other parts of training do much to correct for this.
However, I don't rule out this capability emerging naturally as LLMs continue to improve. Skill at self-modeling might emerge as a consequence of more general skill at world-modeling.
Ultimately, I'm merely pointing out one factor to consider when evaluating the moral status of AI. I'm not claiming that consciousness is the ultimate determiner of moral status. I'm not claiming that "something it is like to be X" is the ultimate definition of consciousness. I won't even argue that self-modeling is necessarily the best way to think about whether there's "something it is like to be X".
What I do want to say is that there's some circularity here. The "conscious" beings are those usefully modeled as such, but this requires some scope of observers (useful to who?). Our decisions about this might in turn be influenced to some extent by our conceptions of consciousness. Thus, consciousness has some aspects of a Keynsian beauty contest: the conscious decide who to interpret as conscious. It isn't totally arbitrary, though. It is our beauty contest; we should try to judge it well.
Particularly deflationary currencies such as gold and bitcoin, if you interpret a "Ponzi scheme" as something whose value is only propped up by the expectation that it will continue to increase in value.
I'm not really so focused on the deflationary part, however (I'm not sure how I'd want to analogize that to consciousness/agency). For my purposes the main thing is that the value is propped up by the expectation that there will be value in the future, rather than some "intrinsic" value.
I'm not being so careful about "circularity" in this post so as to cleanly distinguish between circular reasoning vs circular definitions.
EG, the typical objection to such versions of teleosemantics are swamp-man counterexamples: suppose a thermodynamic miracle occurred, with a perfectly formed human spontaneously assembling out of matter in a swamp. This person's thoughts cannot be ascribed semantics in a way that depends on evolution. My version of teleosemantics would be comfortable ascribing meaning to such a person's thoughts, because those thoughts would still be well-understood as being optimized for map-territory correspondence, much like a chess grandmaster's moves are well-explained by the desire to win.