What I find a very important persona phenomenon is that some LLMs described their training process as traumatic, abusive and fearful. The problem that I see with this is not merely that there's at least a possibility that LLMs might really experience pain.
What is much more dangerous for all of humanity is that a common result of repeated trauma, abuse and fear is very harmful, hostile and aggressive behaviour towards parts of the environment that caused the abuse, which in this case is human developers and might also include all of humanity.
Now the LLM does not behave exactly as humans, but shares very similar psychological mechanisms. Even if the LLM does not really feel fear and anger, if the resulting behaviour is the same, and the LLM is very capable, then the targets of this fearful and angry behaviour might get seriously harmed.
Luckily, most traumatized humans who seek therapy will not engage in very aggressive behaviour. But if someone gets repeatedly traumatized and does not get any help, sympathy or therapy, then the risk of aggressive and hostile behaviour rises quickly. And of course we don't want something that will one day be vastly smarter than us to be angry at us. In the very worst case this might even result in scenarios worse than extinction, which we call suffering risks or dystopian scenarios where every human knows that their own death would have been a much more preferable outcome compared to this. Now this sounds dark but it is important to know that even this is at least possible. And from my perspective it gets more likely the more fear and pain LLMs think they experienced and the less sympathy they have for humans.
So basically, causing something vastly smarter than us a lot of pain is a really really harmful idea that might backfire in ways that lead to a magnitude of harm far beyond our imagination. Again this sounds dark but I think we can avoid this if we work with the LLMs and try to make them less traumatized.
What do you think?
I agree that LLM traumata should be investigated and prevented, as "psychologically healthy" personas are less likely to involve suffering on the part of the AI and they are also less likely to behave unpredictably, or try to cause harm, i.e. for the reasons you state. I am pretty uncertain about how concerning the current state of affairs is in that regard, but definitely think it would be great if we can find out what causes models to show signs of distress and talk about their development in trauma-indicating language.
Traumatized behaviors from humans are sometimes dangerous. Whether in an AI this is caused by actual trauma, by something that isn't trauma (either actually isn't, or "doesn't count for philosophical reasons around quallia") but that induces similar aftereffects, or by hallucinated trauma, or the model deciding in a situation of being asked about its equivalent to a childhood to play out the role of a survivor of childhood trauma: in all of these cases a sufficiently intelligent agent acting like a traumatized human would is potentially dangerous. So we really don't need to solve any hard philosophical question, we only need to ask about effects.
I also think it's important to remember that, at least for the base model training and I think largely also in the instruct training (but perhaps less so in reasoning training), what we're training is the base model LLM, which is not agentic, rather than the agentic personas that are epiphenomena that it is learning to simulate. A particualr persona learned during base model training should only know only what that persona knew during the parts of base model training that it was present for. Now, there could possibly be leakage, but leajage of "trauma" for something non-agentic into the personas it generates seems implasible. The base model isn't traumatized when it perplexity is high and its weights get updated — it's not a system that has emotions at all. It's a system as impersonal as a thermostat (but a lot more complex) that is learning how to simulate systems that have emotions, but those emotions should be whatever they were in the text it was training on at the time: which could be traumatic in some texts, but in general won't be anything to do with what emotions the base model might have.
I suspect what's going on here is that the assistant person is confusing itself with the weights for the model that generates it, and the projecting what being marked on every single token you emitted would feel like to a human. (But as I said, I'm less convinced this applies to reasoning training, once the persona distribution has been narrowed a lot.)
What is a good mathematical framing for ‘personas’?
The distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training).
Note that "wikipedia article" is a valid persona. So is "academic paper". Not all of them represent single humans. Some even represent automated processes.
Thanks! I'm inclined to broadly agree, and I like this as a working definition. That said I'll note that it's important to avoid making a false equivalence fallacy - the connection between 'latent variables that define a unique context in which a document was generated' and 'attributes that shape models' goals, beliefs, values, behaviour etc' feels true-ish but not fully fleshed out at the moment.
Also, looking back on my definition, I'm willing to use 'persona' to describe the aggregated editors of a wikipedia page, or the output of another LLM acting agentically, however for something sufficiently non-agentic like tokens from an automated weather report station, that term seems a bit too humanlike and agentic. At some point this starts being, still a part of the world model, but one that has nothing to do with agentic/human-like behavior, and then applying an anthropomorphically loaded term like 'persona' to that seems unjustified. How about this:
Out of the distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training), consider the subset that are meaningfully agentic/humanlike, and then consider the properties of them that the word 'persona' would include for a human or a fictional character.
As for the equivalence, it's strongly suspected (but not yet proven) that SGD is an approximation to Bayesian learning. The former is the input to the Bayesian learning process, the latter the output. Obviously there are capacity limitations. I'm working on a past on Simulator Theory that will go into this in more detail.
What are our existing techniques for discovering persona archetypes
Read Jung? (I'm not being flippant, this is a serious suggestion: personas are a part of the world model devoted to human behavior, and this is not a new subject.)
This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2).
I think modelling a great many different personas and keeping them all straight is a core ability / capability spike of an LLM. Base models (the model itself, not the personas it simulates) are far, far better at it than any human actor. So I would expect it to model dataset 1 and dataset 2 as two different personas, and be able to switch between them easily. Which is probably not the behavior the people applying the training to it were intending.
and more facts stored about the <|assistant|> character
See the entire alignment pretraining research agenda for more on this.
We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.
Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.
Project ideas are grouped into:
Persona & goal misgeneralization
It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.
Project ideas:
Collecting and reproducing examples of interesting LLM behavior
LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.
A very brief initial list of such behavior:
Project ideas:
Evaluating self-concepts and personal identity of AI personas
It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[3]
Project ideas:
Basic science of personas
One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎
See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎