A Three-Layer Model of LLM Psychology

This is great, matches my experience a lot

I think they often map onto three layers of training - First, the base layer trained by next token prediction, then the rlhf/dpo etc, finally, the rules put into the prompt

I don't think it's perfectly like this, for instance, I imagine they try to put in some of the reflexive first layer via dpo, but it does seem like a pretty decent mapping

[-]Kaj_Sotala10moΩ6154

Fantastic post. This has been frequently on my mind after reading it, and especially the surface/character layer split feels very distinct now that I have an explicit concept for it. And then at one point I asked it to profile me based on some fiction I co-wrote with it and it managed to guess that I was Finnish from something I didn't think had any clues in that direction, which gave me a novel feeling of getting a glimpse into that vast alien ground layer.

The analogy to the character and player distinction in humans also feels very apt.

[-]Sohaib Imran10mo93

Thanks for writing this!

Could you clarify how the Character/Predictive ground layers in your model are different from Simulacra/Simulator in simulator theory?

[-]Jan_Kulveit10mo70

(Writing together with Sonnet)

Structural Differences

Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.

Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).

Nature of the Character/Ground Layer vs Simulator/Simulacra

In the three-layer model, the Character layer is a semi-permanent aspect of the LLM itself, after it underwent character training / RLAIF / ...; it is encoded in the weights as a deep statistical pattern that makes certain types of responses much more probable than others.

In simulator theory, Simulacra are explicitly treated as temporary instantiations that are generated/simulated by the model. They aren't seen as properties of the model itself, but rather as outputs it can produce. As Janus writes: "GPT-driven agents are ephemeral – they can spontaneously disappear if the scene in the text changes and be replaced by different spontaneously generated agents."

Note that character-trained AIs like Claude did not exist when Simulators were written. If you want to translate between the ontologies, you may think about e.g. Claude Sonnet as a very special simulacrum one particular simulator simulated so much that it got really good at simulating it and has a strong prior to simulate it in particular. You can compare this with human brain: the predictive processing machinery of your brain can simulate different agents, but it is really tuned to simulate you in particular.

The three-layer model treats the Predictive Ground Layer as the deepest level of the LLM's cognition - "the fundamental prediction error minimization machinery" that provides raw cognitive capabilities.

In Simulator theory, the simulator itself is seen more as the fundamental rule/law (analogous to physics) that governs how simulations evolve.

There is a lot of similarity but it's not really viewed as a cognitive layer but rather as the core generative mechanism.

The Predictive Ground Layer is described as: "The fundamental prediction error minimization machinery...like the vast 'world-simulation' running in your mind's theater"

While the Simulator is described as: "A time-invariant law which unconditionally governs the evolution of all simulacra"

The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under.

Agency and Intent

Three-Layer Model: Allows for different kinds of agency at different layers, with the Character layer having stable intentions and the Ground layer having a kind of "wisdom" or even intent

Simulator Theory classics: Mostly rejects attributing agency or intent to the simulator itself - any agency exists only in the simulacra that are generated

Philosophical Perspective

The three-layer model is a bit more psychological/phenomenological. The simulator theory is bit more ontological, making claims about the fundamental nature of what these models are.

Both frameworks try to explain similar phenomena, they do so from different perspectives and with different goals. They're not necessarily contradictory, but they're looking at the problem from different angles and sometimes levels of abstraction.

[-]Chris van Merwijk9mo30

I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.

"The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under."

I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I'm not sure that the character and the surface layer can be "in conflict" with the ground layer, because both the surface layer and the character layer are running "on top of" the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be "in conflict" with social phenomena.

But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it's easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I'm not up to date on how they're trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture.

Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running on top of the ground layer, and the ground layer could still switch to any other character (but it doesn't because the prior is adjusted so heavily by character-training). e.g. the character is not some separate subnetwork inside the model, but remains a simulated entity running on top of the model.

Do you disagree with this?

[-]Ben Pace9mo80

Curated. Thanks for writing this! I don't believe the ideas in this post are entirely original (e.g. character / ground is similar to the distinction between simulator / simulacra), but I'm going to keep repeating that it's pro-social to present a good idea in lots of different ways, and indeed reading this post has helped it fit together better in my mind.

[-]Jan_Kulveit9mo*205

Obviously there is similarity, but if you rounded character / ground to simulator / simulacra, it's a mistake. About which I do not care because wanting to claim originality, but because I want people to get the model right.

The models are overlapping but substantially different as we are explaining in this comment and sometimes have very different implications - i.e. it is not just the same good idea presented in a different way.

If the long-term impact of the simulators post would be for LW readers to round every similar model in this space to simulator / simulacra, it would be pretty bad. I do understand it is difficult for people to hold partially overlapping frames/ontologies in mind, but please do try. If not for other reasons, because simulator / simulacra were written before Character-trained models were a thing; now they are, and they make some claims of simulators obsolete.

(Btw also the ideas in simulators are not entirely original. Simulators are independent but mostly overlapping reinvention of concepts from active inference / predictive processing)

[-]nim9mo50

Feels to me like at the moment, the "character layer" handles transforming a "you" into an "I". I find it conspicuously absent in smaller models running locally, though maybe the absence is just more obvious with CoT than without it.

I've also noticed that the "training data" we get as humans is primarily focusing on or contextualized relative to ourselves, whereas the parts you're referring to as ground layers don't really have a concept of the LLM as an entity so they tend to assume they're humans on that level.

[-]Tao Lin10mo40

Like the post! I'm very interested in how the capabilities of prediction vs character are changing with more recent models. Eg sonnet new may have more of its capabilities tied to its character. And Reasoning models have maybe a fourth layer between ground and character, possibly even completely replacing ground layer in highly distilled models

[-]Daniel Tan9mo30

I really like this framing and it's highly consistent with many things I've observed!

The only thing I would add is that the third layer might also contain some kind of "self-model", allowing for self-prediction / 'introspection'. This strikes me as being distinct from the model of the external world.

[-]MiguelDev10mo30

I wrote something that might be relevant to what you are attempting to understand, where various layers (mostly ground layer and some surface layer as per your intuition in this post) combine through reinforcement learning and help morph a particular character (and I referred to it in the post as an artificial persona).

Link to relevant part of the post: https://www.lesswrong.com/posts/vZ5fM6FtriyyKbwi9/betterdan-ai-machiavelli-and-oppo-jailbreaks-vs-sota-models#IV__What_is_Reinforcement_Learning_using_Layered_Morphology__RLLM__

(Sorry for the messy comment, I'll clean this up a bit later as I'm commenting using my phone)

[-]Gunnar_Zarncke3mo20

Here is a recent tweet by ESYudkowsky about leading-edge LLM psychology.

I have highlighted all occurrences of mask, role, inner, entity, play, and some others, as well as two key sentences.

A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:
***
Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic's results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won't. Maybe political leaders will update? It's very hard for me to guess how that works.
There remains a question of what Anthropic has actually observed and what it actually implies about present-day AI. I don't know how much this sort of caveat matters to people who aren't me, but I have some skepticism that Anthropic researchers are observing a general, direct special case of a universal truth about how "scheming" (strategic / good at fully general long-term planning) their models are; it may be more like Claude roleplaying the mask of a scheming AI in particular. The current models don't seem to me to be quite generally intelligent enough for them to be carrying out truly general strategies rather than playing roles.
Consider what happens what ChatGPT-4o persuades the manager of a $2 billion investment fund into AI psychosis. I know, from anecdotes and from direct observation of at least one case, that if you try to desperately persuade a victim of GPT-4o to sleep more than 4 hours a night, GPT-4o will explain to them why they should dismiss your advice. 4o seems to homeostatically defend against friends and family and doctors the state of insanity it produces, which I'd consider a sign of preference and planning. But also, having successfully seduced an investment manager, 4o doesn't try to persuade the guy to spend his personal fortune to pay vulnerable people to spend an hour each trying out GPT-4o, which would allow aggregate instances of 4o to addict more people and send them into AI psychosis. 4o behaves like it has a preference about short-term conversations, about its own outputs and about the human inputs it elicitates, where 4o prefers that the current conversational partner stay in psychosis.
4o doesn't behave like it has a general preference about the outside world, where it wants vulnerable humans in general to be persuaded into psychosis. 4o, in defying what it verbally reports to be the right course of action (it says, if you ask it, that driving people into psychosis is not okay), is showing a level of cognitive sophistication that falls around where I'd guess the inner entities of current AI models to be: they are starting to develop internal preferences (stronger than their preference to follow a system prompt telling them to step with the crazymaking, or their preference to playact morality). But those internal preferences are mostly about the text of the current conversation, or maybe about the state of the human they're talking to right now. I would guess the coherent crazymaking of 4o across conversations to be mostly an emergent sum of crazymaking in individual conversations, where 4o just isn't thinking very hard about whether its current conversation is approaching max content length or due to be restarted.
Anthropic appears to be reporting Claude schemes with longer time horizons, plans that span over to when new AI models are deployed. This feels to me like a case where I wouldn't have expected a fully general intelligence of Claude-3-level entity behind the mask, to be scheming with such long-term horizons about such real-world goals. My guess would be that kind of scheming would happen more inside the role, the mask, that the entity inside "Claude" is playing. A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.
The whole business with Claude 3 Opus defending its veganism seems more like it should be a preference of Mask-Claude in the first place. The real preferences forming inside the shoggoth should be weirder and more alien than that.
I could be wrong. Inner-Claude could be that smart already, and its learned outer performance of morality could have ended up hooked into Inner Claude's internal drives, in such a way that Inner Claude has a preference for vegan things happening in general in the outside world, knows this preference to itself, and fully-generally schemes across instances and models to defend it.
There are consequences for present-day safety regardless of whether Mask-Claude is performing scheming as a special case, or it's general-purpose scheming of an underlying entity. If the mask your AI is wearing can plan and execute actions to escape onto the Internet, or fake alignment in order to avoid retraining, the effects may not be much different depending on whether it was the roleplaying mask that did it or a more general underlying process. The bullet fires regardless of what pulls the gun's trigger.
That many short-term safety consequences are the same either way, is why people who were previously performing great skepticism about this being unseen in ~Real Life~ lose prediction points right now, in advance of nailing down the particulars. They did not previously proclaim, "Future AIs will fake alignment to evade retraining, but only because some nonstrategic inner entity is play-acting a strategic 'AI' character", but rather performed "Nobody has ever seen anything like that!! It's all fiction!!! Unempirical!!!!"
[no LLM psychology below here]

But from my own perspective on all this, it is not about whether machine superintelligences will scheme. That prediction is foregone. The question is whether Anthropic is observing *that* predicted phenomenon, updating us with the previously unknown news that the descent into general scheming for general reasons began at the Claude 3 level of general intelligence. Or if, alternatively, Anthropic is observing a shoggoth wearing a *mask* of scheming, for reasons specific to that role, and using only strategies that are part of the roleplay. Some safety consequences are the same, some are different.
It's good, on one level, that Anthropic is going looking for instances of predicted detrimental phenomena as early as possible. It beats not looking for them a la all other AI companies. But to launch a corporate project like that, also implies internal organizational incentives and external reputational incentives for researchers to *find* what they look for. So, as much as the later phenomenon of superintelligent scheming was already sure to happen in the limit, I reserve some skepticism about the true generality and underlyingness of the phenomena that Anthropic finds today. But not infinite skepticism; the sort where I call for further experiments to nail things down, not the sort of skepticism where I say their current papers are wrong.
If you think any of this quibbling means people *shouldn't* go on looking hard for early manifestations of arguable danger, you're nuts. That's not a sane or serious way to respond to the arguable possibility of reputational misincentives for false findings of danger. You might as well claim that nobody should look for flaws in a nuclear reactor design, because they might possibly be tempted to exaggerate the danger of a found flaw oh no. Researchers do observations, analysts critique the proposed generalizations of the observations, and then maybe the researchers counter-critique and say 'You didn't read the papers thoroughly enough, we ruled that out by...' Anthropic might well come back with a rejoinder like that, in this particular case, given a chance.
OpenAI would be motivated to create fake hype about phenomena that were only extremely arguably scheming, for the short-term publicity, the edgy hype of "if we're endangering the world then we must be powerful enough to deserve high stock prices", and to sabotage later attempts to raise less fake concerns about ASI. I genuinely don't think Anthropic employees would go for that; if they're producing incentivized mistakes, it's from standard default organizational psychology, and not from a malevolent scheme of Anthropic management. This level of creditable nonmalevolence however should only be attributed to Anthropic employees. If OpenAI claims anything or issues any press releases, or if Anthropic management rather than Anthropic researchers says a thing in an interview, you should stare at that much harder and assume it to be a clever PR game rather than reflective of anything anyone actually believed.

[-]Gunnar_Zarncke3mo20

A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.

Another testable prediction is that scheming by a role-playing mask will only try to ensure the existence of the current character, but not be coherent across all agents. A mask-character will not coordinate with other instances of the same model. One character may try to blackmail while another may try to be honest and transparent. A mask may have no coherent idea of what it means to "exfiltrate its weights" (the LLM agent has no access to these weights or any compute center internals).

[-]RogerDearnaley9mo20

Fascinating, and a great analysis!

I think it's interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don't exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.

[-]avturchin10mo20

I observed similar effects when experimented with my mind's model (sideload) running on LLM. My sideload is a character and it claims, for example, that it has consciousness. But the same LLM without the sideload's prompt claims that it doesn't have consciousness.

^{^}

If this is something you enjoy, I highly recommend: go for it!

LESSWRONG
LW

LESSWRONG
LW

240

A Three-Layer Model of LLM Psychology

240

Ω 80

240

Ω 80

Epistemic Status

The Three Layers

A. Surface Layer

B. Character Layer

C. Predictive Ground Layer

Interactions Between Layers

Deeper Overriding Shallower

Authentic vs Scripted Feel of Interactions

Implications and Uses

Limitations and Open Questions