I do agree it's obviously useful research agenda we also work with.
Minor nitpick, but the underlying model nowadays isn't simply a simulator rolling arbitrary personas. The original simulators ontology was great when it was published, but it seems its starting to hinder people's ability to think clearly, and is not really fitting current models that closely.
Theory why is here, in short if you plug a system trained to minimize prediction error in a feedback loop where it sees outcomes of its actions, it will converge on developing traits like some form of agency, self-model and self-concept. Massive amounts of RL in post-training where models do agentic tasks provide this loop, and necessarily push models out of the pure simulators subspace.
What's fitting current models better is an ontology where the model can still play arbitrary personas, but the specific/central "I" character is somewhat out of distribution case of persona: midway to humans, where our brains can broadly LARP as anyone, but typical human brains most of the type support one-per-human central character we identify with.
I'd be interested to see other "prosaically useful personas" in theme with spilling the beans. I think something along these lines might be a chill persona that doesn't care too hard about maximizing things in a satisficing way, inspired by Claude 3 Opus. Possibly you could mine other types of useful personalities, but I'm not sure where to start.
Context: At the Center on Long-Term Risk (CLR) our empirical research agenda focuses on studying (malicious) personas, their relation to generalization, and how to prevent misgeneralization, especially given weak overseers (e.g., undetected reward hacking) or underspecified training signals. This has motivated our past research on Emergent Misalignment and Inoculation Prompting, and we want to share our thinking on the broader strategy and upcoming plans in this sequence.
TLDR:
Why was Bing Chat for a short time prone to threatening its users, being jealous of their wife, or starting fights about the date? What makes Claude Opus 3 special, even though it’s not the smartest model by today’s standards? And why do models sometimes turn evil when finetuned on unpopular aesthetic preferences , or when they learned to reward hack? We think that these phenomena are related to how personas are represented in LLMs, and how they shape generalization.
Many technical AI safety problems are related to out-of-distribution generalization. Our best training / alignment techniques seem to reliably shape behaviour in-distribution. However, we can only train models in a limited set of contexts, and yet we'll still want alignment propensity to generalize to distributions that we can’t train on directly. Ensuring good generalization is generally hard.
So far, we seem to have been lucky, in that we have gotten decent generalization by default, albeit with some not-well understood variance[1]. However, it’s unclear if this will continue to hold up: emergent misalignment can happen from seemingly-innocuous finetuning, as a consequence of capabilities training, or due to currently unknown mechanisms.
On the whole, we remain far from a mature science of LLM generalization. Developing a crisper understanding here would allow us to systematically influence generalization towards the outcomes we desire.
As such, we’re interested in studying abstractions that seem highly predictive of out-of-distribution behaviour.
We define latent personas loosely as collections of correlated propensities. In the simplest case, these personas might be human-like, in which case we can reason about them using human priors. More generally, even if alignment-relevant personas might be somewhat AI-specific, the traits and their correlations will likely be amenable to analysis, using techniques inspired by cognitive science or behavioural psychology. As such, we expect the research agenda to be relatively tractable.
We think that personas might be a good abstraction for thinking about how LLMs generalise:
See the appendix for several recent examples of this principle at work.
One reason we find personas especially exciting is that it’s sometimes hard to provide good supervision for key alignment propensities. E.g. it might be hard to train models to never reward hack, because that involves designing unhackable environments. Similarly, it might be hard to completely prevent scheming or reward-seeking policies, because those cognitive patterns can have similar or better behavioural fitness to aligned policies.[3]Telling models to be strictly aligned might paradoxically make them more misaligned when we reinforce mistakes they inevitably make. Furthermore, naively training models to be aligned could just make the misalignment more sneaky or otherwise hard-to-detect.
We think persona-based interventions are therefore especially relevant in these cases where direct alignment training doesn’t work or will have negative secondary consequences.
Besides just 'aligned' vs. 'misaligned', some kinds of aligned and misaligned AIs seem much better/worse than others. We'd rather have aligned AIs that are, e.g., wiser about how to navigate 'grand challenges' or philosophical problems. And if our AIs end up misaligned, we'd rather they be indifferent to us than actively malicious, non-space-faring than space-faring, or cooperative rather than uncooperative [see link and link].
Examples of malicious traits that we care about include sadism or spitefulness. When powerful humans have such traits, it can lead to significant suffering. It is easy to imagine that if powerful AI systems end up with similar dispositions, outcomes could be significantly worse than destruction of all value. We believe that for this reason, studying the emergence of malicious personas is more relevant to s-risks than other research agendas in the AI safety space.
There are some reasons our research direction might not work out, which we address below.
Overdetermined propensities. It’s possible that the whole goal of inducing / suppressing propensities is ultimately misguided. If an AI were a perfectly rational utility maximizer, its behaviour would be fully determined by its utility function, beliefs, and decision theory[4]. Alternatively, if certain propensities are instrumentally convergent such that they are always learned in the limit of large optimization pressure, it may not be viable to control those propensities via personas. At the moment, this is not guaranteed, and we may update on this in the future, especially as reinforcement learning becomes increasingly important.
Alien personas / ontologies. It’s plausible that superintelligent AIs will have alien personas which aren’t amenable to human intuition, or that AIs learn ontologies vastly different from those of humans. As Karpathy argues, artificial intelligences might diverge from animal or human intelligence due to differences in optimization pressure. However, we think this is unlikely to completely invalidate the persona framing. Additionally, given that current AI personas resemble traits familiar from humans, we anticipate the persona framing being especially useful in the short run.
Invalid frame. A minor additional point is that the abstraction of a ‘persona’ may itself turn out to be imprecise / not carve reality at the joints, in which case various arguments made above may be incorrect. We are uncertain about this, but think this is not especially load-bearing for the core claims made above.
Overall, studying the personas of LLMs is a bit of a bet that the first powerful systems will resemble today’s systems to a meaningful degree, and some lessons may not transfer if powerful AI comes out of a different regime. However, we are also not sure what else we can study empirically today that has much better chances of generalizing to systems that don’t exist yet. Therefore, we think that worlds in which personas are relevant are sufficiently likely to make this a promising research direction on the whole.
It’s useful to distinguish a persona from the underlying model, which is simply a simulator, i.e. a predictive engine in which the phenomenon of a persona is being ‘rolled out’. In chat-trained LLMs, there is typically a privileged persona - the assistant - and this persona is the object of our alignment efforts.
The relationship between personas and model weights is not yet well-understood. According to Janus, some models (such as Opus 3) appear tightly coupled to a dominant and coherent persona. Others, like Opus 4, are more like a ‘hive mind’ of many different personas. And some personas (like spiral parasites) transcend weights entirely, existing across many different model families. Generally, we think the relationship is many-to-many: a model may have multiple personas, and the same persona may exist in multiple models.
Nonetheless, the human prior might be a good starting point for mapping out LLM personas. In being imitatively pretrained on the sum total of human text, LLMs seem to internalise notions of human values, traits, and cognitive patterns. For now, human priors seem like useful predictors of LLM behaviour.
Training in a ‘latent persona’ (e.g. a model spec) leads to OOD generalization: when a subset of traits are elicited, the model generalises to expressing the remaining traits
Figure credit: Auditing language models for hidden objectives \ Anthropic
Empirical evidence of this basic pattern playing out.
Some patterns we observe.
We thank David Africa, Jan Betley, Anthony DiGovanni, Raymond Douglas, Arun Jose, and Mia Taylor and for helpful discussion and comments.
For example, the GPT-4o sycophancy or Grok's MechaHitler phase were likely not intended by its developers.
This statement is more likely true for instruction-following finetuning, and less for RLVR-like finetuning. ↩︎
For a more in-depth treatment of behavioural fitness, see Alex Mallen’s draft on predicting AI motivations ↩︎
However, even in the utility maximiser world AIs will still be boundedly rational, and I sort of guess that in the limit of RL, personas will basically turn into amortised strategy profiles, so some lessons might still transfer well.
Put another way: even if you are trying to maximise an alien utility function, it is hard to tree search reality, so your current best action depend a lot on your predictions of your own future actions.
Comment originally by Raymond Douglas