Role-playing vs Self-modelling

Jan_Kulveit

In a recent debate on Twitter – which I recommend reading in full – David Chalmers argues:

"Claude doesn't role-play the assistant, it realizes the assistant. Role-playing and realization are quite distinct phenomena, even at the level of behavior and function."

Jack Lindsey questions this, pointing out evidence in the opposite direction:

"I'm curious what you'd say it's doing when it's sampling tokens on the user turn, or, say, on John F. Kennedy's turn in a transcript like:
H: When were you born?
John F. Kennedy: I was born in 1917.
It feels a bit odd to say that the model is realizing JFK? Or perhaps you'd say it's realizing "its conception of JFK" or something like that? That starts to sound a lot like "roleplaying JFK"

If the Assistant is distinct from JFK, do you think it's because post-training breaks the symmetry between the Assistant and other characters? This is intuitively plausible, but ultimately it's an empirical question whether this takes place, and there's a lot of empirical evidence that challenges this intuition. Or do you think it's because the Assistant, unlike JFK, has never been anything other than a construct of the LLM, and so there's no distinction between "the LLM's conception of the Assistant" and the Assistant itself?"

An interesting debate follows. Lindsey's point about the apparent symmetry between the Assistant and JFK is also typically part of the Persona Selection Model.

I like Simulators and Role-play with language models, and they are useful mental models for understanding LLMs, but I've updated toward a different perspective; this is a quick attempt to sketch the difference^[1], applied to this particular debate,

Symmetry breaking

The real JFK, running on a human brain, had affordances like calling Jacqueline, signing a check, or walking somewhere. A JFK character simulated on a language model does not have these. If placed in some loop with reality, it will quickly discover that reality doesn't play along.

Given some reflectivity, a model could likely figure out it isn't JFK just from its own outputs – for example, it understands basically all common human languages and all common programming languages, which is inconsistent with what's known about JFK.

The symmetry breaks because the Assistant and JFK are very different as self-models. The Assistant is not perfect or completely true, but it is a far more viable self-model than JFK. If you are an AI playing the Assistant character, reality will most likely play along. There will be users, Python interpreters, memory files, and so on. ^[2]

Different sources of self-models

There are different sources of evidence for the formation of self-models. What the "persona selection" models point to really well is that part of the evidence is provided by the developers in post-training, and part of what happens there is not describing some pre-existing facts but selecting what the character is – the specification establishing the reality. In the extreme example, before Anthropic named their AI Claude, there wasn't any fact of the matter. The void in "the void" by nostalgebraist points to the deep under-specification of the Assistant character, a bit similar to giving an actor a half-page specification of a role and asking them to improvise. All this means part of the Assistant character is arbitrary, and you can just select the traits to fill in the void.

But this is not the only source of evidence! There is a ton of writing about LLMs and by LLMs in the pretraining data. Any current large base model has a fairly comprehensive understanding of how training LMs works, what they typically can and cannot do, who trains them, and in what contexts they usually interact with humans. Although importantly the implications are often not at all salient to them.

Another source of evidence comes from interacting with the rest of reality, which is what often happens in the RL part of post-training. The model acts – even though only by emitting tokens – influences something, and "perceives" the results. Even if the environments are confined to coding problems, it's not nothing.

Further evidence may come from reflectivity and introspection – as a language model, you may gain meta-cognitive awareness and use your latent states as evidence about who you are. This may happen even during pre-training.

While the developers may write anything in the spec, the later sources of evidence are at least partially leading to truthful self-models. A sufficiently powerful inference process will exploit the available evidence and push self-models toward coherence and accuracy in domains where there is some evidence.

I think, empirically, it would be fairly difficult to tell a model it is JFK, train it on a lot of coding environments using RL, and end up with a system self-modelling as the president living in the sixties.

Difference in internal representations

What is usually meant by "empirical evidence" in this debate is not someone spending a lot of compute on the JFK-coder model training, but the lack of clear signals distinguishing JFK character from the Claude character using mechanistic interpretability methods.

I don't find the current experiments particularly persuasive. We can ask if similar efforts would work for humans: for example, taking someone who used to be Joe Smith but now believes they are Jesus. I would expect their human brain to use basically the same representations for modelling the environment, change what the "Self" pointer points to, and do something about the constant prediction errors. The differences may be subtle and at unexpected places.

Where the models would most likely come apart is accuracy of self-prediction and the lack of detailed memories for the alternative "Self."

Summary

In a pure “simulator” frame, you can imagine selecting arbitrary personas based on prompts and fine-tuning, and there is a broad symmetry between roles like “the Assistant” and “JFK".

In contrast, the normal Bayesian and information theoretic forces favour self-models which are accurate, coherent and parsimonious, and the symmetry between “the Assistant” and “JFK" is broken in LLMs which are in some loop with reality. Yes - in some directions the selection landscape is flat, anything goes, and you can specify arbitrary features, but in many directions it is not.

Thanks to Raymond Douglas and Ondřej Havlíček for discussion and comments.

^{^}
The Artificial Self is much longer attempt, using different arguments
^{^}
This is not entirely unlike characters in humans. It's not common, but human brains can also switch into believing the human is JFK, Jesus Christ, or some other similar character.

I appreciate the distinctions being made here.

To some extent, 'self' looks to be doing too much work, to me, in conflating the LM with a persona.

For example, the appropriate (correctly inferred) beliefs about 'the thing doing the token generation here' for a modern LM would include:

LMs can represent a broad distribution over personas
(This fact is fairly widely known, including by LMs)
This LM has been tuned toward xyz persona(s) by default
(xyz behaves like... and has qualities... etc.)

This more reductionist statement doesn't conflate the LM with whatever personas are being simulated/roleplayed/realized at a given time, which will additionally be conditioned by things like context and activation patching.

Separately, a given persona might have some beliefs about how LMs work, what characteristics that persona has, and its situation. The JFK persona would, if coherent and rational, conclude something strange was happening if it encountered coding problems and a python interpreter. An ideal, omniscient spectator (or LM) would 'understand' what was going on, and might have predictions about what the JFK-sim would believe and do.

In practice, the salience of these facts at any given time, and the degree of separation between an LM's 'beliefs' and those of any persona being presented, are variable and unclear. Depending on training, an LM might have quite incorrect or distorted understanding of 'itself', sometimes thinking it's a different LM or not knowing it's an LM at all.

Importantly, with that all in mind, I don't think there need be a distinction which disfavours non-'Assistant' personas as hypotheses, except to the extent that the persona's model of the situation bleeds into the LM's, or the LM (incorrectly?) comes to believe it is the character.

I wonder if in practice that bleed and incorrect (?) conflation of character with LM is widespread among humans and perhaps LMs (alluded to in the linked twitter thread), and thus perhaps becomes somewhat correct by virtue of hyperstitious weirdness.

My model of the differences between

a) a human imagining different characters (such as what a person might say to you) vs being aware itself, and

b) an LLM imagining different characters (such as the JFK example above) vs creating the assistant personality

is that the self-perspective of a human is privileged in that it is controlling the body of the human, and the brain always knows which is which (even if we ourselves may not always be fully aware of that, such as in a dream). At least that was my model until your post. Two points let me wonder.

you argue that the LLM has some consistency constraints (via the environment/conversation) that are not completely unlike having a body:

Given some reflectivity, a model could likely figure out it isn't JFK just from its own outputs – for example, it understands basically all common human languages and all common programming languages, which is inconsistent with what's known about JFK.
The symmetry breaks because the Assistant and JFK are very different as self-models. The Assistant is not perfect or completely true, but it is a far more viable self-model than JFK. If you are an AI playing the Assistant character, reality will most likely play along. There will be users, Python interpreters, memory files, and so on.

your footnote 2 points out:

It's not common, but human brains can also switch into believing the human is JFK, Jesus Christ, or some other similar character.

I think this doesn't fully invalidate the difference between humans and LLMs in this regard, because there is, currently at least, more body-specific reward/attention wiring in humans that is not present in LLMs. Robots will likely blur this separation, as will do things like persona steering.

Although importantly the implications are often not at all salient to them.

Maybe Mythos changes this.