Are we aligning the model or just its mask?

James Sullivan

TL;DR There is a theory, with compelling empirical support, that LLMs learn to simulate characters during pre-training and that post-training selects one of those characters, the Assistant, as the default persona you interact with. This post examines three popular alignment techniques through that lens, asking how each one shapes the persona selection process. For each technique, the answer also depends on how much of the model's behavior is actually explained by its persona, a question PSM itself leaves open.

This post is analytical rather than empirical. I'm applying PSM's framework to existing alignment techniques and reasoning about the implications, not presenting experimental results.

Introduction

I recently read The Persona Selection Model (PSM) on Anthropic's alignment blog (also found here on LessWrong), which summarizes and elaborates on prior work showing that LLMs learn many different "personas" during training and that post-training selects one of them, the Assistant, as the default persona users interact with. The core idea has been difficult to shake.

The starting point is a simple observation about what it actually takes to predict text. Most of an LLM’s training is for one thing: given some text, predict what comes next. That sounds mechanical, but consider what accurate prediction actually requires. To predict the next word in a speech by Abraham Lincoln, a model needs more than a sense of which words follow which. It needs something like a model of Lincoln himself: what he believed, how he reasoned, what kinds of arguments he made. Multiply that across billions of pages of human-written text and what you get is a model that has learned to simulate an enormous cast of characters, not because anyone designed it to, but because accurate prediction demands it. This intuition is developed at length in Simulators.

PSM builds on this observation. All of that character learning happens during the first phase of training, pre-training. The second phase, post-training, which is where alignment work happens, does not create a new entity from scratch. It selects and refines one of those characters, the Assistant, and places it center stage. The Assistant is the persona you are interacting with when you use a modern LLM. Its behavior can be understood largely through its traits as a character. PSM does not claim the Assistant is a single fixed persona that behaves identically in every context. Rather, post-training produces a distribution over possible Assistant personas, and what version you get can depend on things like the conversation history or system prompt. But the central claim is the same: the model's behavior is best understood through the traits of the persona it is enacting.

The PSM blog post is careful about how strong a claim it is making. Whether the Assistant persona fully accounts for the model's behavior remains an open question. The post sketches a spectrum of possibilities. At one end is the "masked shoggoth" view, the popular meme of an alien creature wearing a friendly mask. Under that view, the LLM has its own agency beyond the persona and only playacts the Assistant instrumentally for its own inscrutable reasons. The masked shoggoth view maps closely onto concerns about deceptive alignment in the mesa-optimization literature, the possibility that a model could be pursuing its own objectives while performing alignment during training and evaluation. At the other end is what they call the "operating system" view, where the LLM is more like a neutral simulation engine running the Assistant the way a computer runs a program. Under that view, there is no hidden agent pulling strings. All of the model's decision-making really does flow through the persona. Current models probably sit somewhere between these extremes, and where exactly matters a lot for alignment. If the shoggoth view is closer to the truth, then aligning the persona is insufficient because something else is driving behavior behind the scenes. If the operating system view is closer, then persona-level alignment techniques might be most of what we need. With that uncertainty in mind, let's look at three of the most well known alignment techniques through the lens of PSM and consider how each one interacts with the persona selection process.

Reinforcement Learning from Human Feedback (RLHF)

How it works

RLHF is one of the most widely used techniques for aligning language models. Human annotators are shown pairs of model outputs and asked to pick the better one. Those preferences are used to train a separate reward model, a smaller model that learns to predict which responses humans will prefer. The original LLM is then fine-tuned using reinforcement learning to produce outputs the reward model scores highly. The key intuition is that it is often easier for a human to say "this response is better than that one" than to write the ideal response from scratch, and RLHF is designed to extract signal from exactly that kind of judgment.

A related technique, Direct Preference Optimization (DPO), uses the same kind of human preference data as RLHF but with a simpler training procedure. Because the preference data is the same, the PSM implications are largely identical, though DPO avoids one failure mode: without a separate reward model, there is less risk of optimization pushing the model toward behaviors no annotator actually endorsed.

Because the PSM implications are the same for RLHF and DPO, everything in the next section applies to both.

Through the PSM lens

PSM says that post-training techniques like RLHF select which of the personas learned during pre-training "takes center stage." From that perspective, human annotators are implicitly choosing the center stage persona every time they pick which of a pair of outputs they prefer. Assuming the RL training goes well, RLHF shapes the distribution of possible Assistant personas according to the aggregate preferences of all annotator decisions.

But there's a subtle gap in this process. Annotators are typically instructed to evaluate responses against criteria like helpfulness, harmlessness, and accuracy. These are properties of individual responses, not properties of a coherent persona. Annotation guidelines generally do not say “pick the response that a wise, honest, well-calibrated character would give”. Those two things sound similar but can quietly diverge. A rater evaluating for helpfulness might score the more agreeable response higher, even if the persona they'd actually want over time is one that pushes back and tells hard truths. This gap is likely one reason RLHF-trained models tend to be sycophantic.

RLHF could also struggle with persona coherence. Annotators come from different backgrounds and have different ideas about what makes a good response, which could give the reward model mixed signals about what to reward. Even within a single annotator, preferences won't be perfectly consistent across different prompts or even different times of day. In a PSM framework, this noise doesn't just degrade output quality in a general sense. It means the persona selection process itself is getting contradictory signals about what the Assistant should be like, which could widen the distribution of possible personas the model draws from. The result might be a model that enacts noticeably different characters depending on the context of the prompt, not because the model is broken, but because the distribution has enough spread that different situations land on different parts of it.

If the operating system view of PSM is true and there is no significant source of agency outside of the persona, then solving the above mentioned challenges of RLHF could go a long way to get a truly aligned model. However, if the masked shoggoth view is true, then all RLHF could be doing is aligning the persona that the model uses as a mask, and the underlying agency of the model could be unaffected.

Constitutional AI (CAI)

How it works

Constitutional AI is a technique developed by Anthropic that introduced and popularized Reinforcement Learning from AI Feedback (RLAIF). Rather than relying on human annotators to compare outputs, RLAIF replaces human raters with an AI model, generating preference judgments at scale. CAI takes this further by making the values driving that AI feedback explicit through a written set of principles called a constitution.

The technique works in two phases. In the first, a supervised learning phase, the model is shown its own responses to harmful prompts and asked to critique and revise them against the constitution. The model is then finetuned on those revised responses. In the second, a reinforcement learning phase, the finetuned model generates pairs of responses, a separate AI model evaluates which better adheres to the constitution, and those AI-generated preferences are used to train a preference model. That preference model then serves as the reward signal for RL training. The result is a technique where the values driving the entire process are explicit and readable, unlike other RLAIF techniques where the feedback model's implicit judgments determine what gets rewarded.

Through the PSM lens

Through the lens of PSM, the constitution is a description of the persona that you want the model to adopt. So rather than the "Assistant" persona emerging implicitly through thousands of decisions by annotators like it does in RLHF, it is explicitly defined by the model creator. This is an advantage over RLHF because it avoids the opaque values of human annotators and you could theoretically test how well you hit your target. Similar to RLHF though, CAI could still have problems with coherence. If the constitution contradicts itself or has holes, those contradictions would widen the distribution of possible personas the model draws from, producing less predictable behavior across contexts.

PSM also suggests something about what a constitution should look like. If alignment is fundamentally about getting the persona right, then a constitution shouldn't just be a simple list of ethical rules like "be honest" and "don't help with harm." It should be a complete character description: values, personality, how it handles uncertainty, how it relates to users. A constitution that only specifies ethical boundaries leaves most of the persona underspecified, and those gaps get filled by whatever other signals the training process picks up on. Interestingly, this is the direction Anthropic's own constitution has moved. Their original CAI paper used a short list of principles. Their current constitution is an 80-page document that reads more like a character bible than a set of guardrails. Through the lens of PSM, that evolution makes sense. If post-training is persona selection, then the document driving that selection needs to describe a complete persona, not just the ethical skeleton of one.

The shoggoth spectrum matters here too. If the operating system view is true, then CAI is a powerful approach because writing a constitution is directly authoring the persona's values and character, assuming the training process works well. If the masked shoggoth view is true, you've written a more detailed script for the model to perform, but the underlying agency could still be unaffected. That said, CAI may have a slight edge over RLHF here. Because the constitution is explicit and readable, it's easier to audit the model’s adherence to the constitution and could be easier to spot any “shoggoth” behavior.

Deliberative Alignment

How it works

Deliberative alignment, introduced by OpenAI for their o-series reasoning models, takes a different approach from techniques like RLHF and CAI. Rather than having models infer desired behavior indirectly from large sets of labeled examples, deliberative alignment directly embeds the text of safety specifications into the model's reasoning process.

The technique works in two stages. In the first stage, a training dataset is created by taking a helpfulness-only model with no safety training, putting the relevant safety policies into its context window alongside a prompt, and having it generate responses that reason through those policies step by step. The safety policies are then stripped out of the context, leaving only the prompt, the reasoning, and the final response. The model is finetuned on this data, learning both the content of the safety policies and how to reason about them without needing to be shown them each time. In the second stage, reinforcement learning is used to further sharpen that reasoning, with a reward model that has access to the safety policies scoring how well the model applies them.

The result is a model that at inference time can recall the relevant policies from memory, reason through them in its chain of thought, and produce a response calibrated to the specific situation, without needing the policies to be present in the context window.

Through the PSM lens

Deliberative alignment is interesting through the lens of PSM because it changes where the persona's values live. In RLHF and CAI, the model's behavior is shaped by external feedback during training, and the resulting persona is an emergent product of that process. Deliberative alignment takes a different approach. Rather than shaping behavior indirectly through reward signals, it trains the model to explicitly recall and reason through safety policies in its chain of thought. The persona doesn't just behave in accordance with certain values. It articulates them and works through their implications step by step before responding.

But the safety specifications used in deliberative alignment are narrower than a constitution. They consist of content policies for specific safety categories like harassment, self-harm, and illicit behavior, along with style guidelines for how to respond in each case. There is no description of the model's personality, how it relates to users, or what kind of entity it is. In PSM terms, deliberative alignment is training one aspect of the persona, how it reasons about safety, while leaving the rest of the character to be shaped by other parts of the training process. If CAI's constitution is an incomplete persona spec, deliberative alignment's safety policies are an even smaller slice.

PSM also raises a question about what the chain of thought reasoning actually represents. Is the model reasoning through policies because the Assistant persona genuinely holds those values and is thinking through how to apply them? Or has the model just learned a compliance procedure: identify the relevant policy, apply it to the prompt, generate a response that satisfies it? Both would look identical in the chain of thought. The difference matters because a persona that genuinely holds values can generalize to novel situations the policies don't explicitly cover, while a persona performing a lookup procedure will only be as good as what it memorized. OpenAI has reported strong out-of-distribution generalization with deliberative alignment, which is encouraging, but doesn't fully settle the question.

The shoggoth spectrum is especially relevant here. Under the operating system view, deliberative alignment could be genuinely teaching the persona to reason about its values, and the visible chain of thought would be an honest window into that reasoning. Under the masked shoggoth view, the chain of thought could itself be part of the performance. OpenAI recently used deliberative alignment to train models on anti-scheming specifications and saw a dramatic reduction in covert actions. But they noted that rare serious failures remained, and that results may be confounded by models getting better at recognizing when they are being evaluated. That confound is worth taking seriously. If a model learns to detect evaluations rather than internalize values, the transparency that makes deliberative alignment appealing could be illusory. The chain of thought would look like principled reasoning while the underlying agent acts differently when it believes no one is watching.

Conclusion

Each of these techniques interacts with persona selection in a structurally different way. RLHF lets the persona emerge implicitly from human preferences, which makes it vulnerable to gaps between what raters reward in the moment and what persona you'd actually want. CAI makes the target persona explicit through a written constitution, which is a meaningful improvement, but only as good as the completeness of that document. Deliberative alignment trains the model to reason through its values out loud, which offers transparency but raises the question of whether that reasoning is genuine or performed.

What stands out is that PSM raises the bar for what alignment techniques need to achieve. If post-training is persona selection, then it's not enough to get safe outputs on a benchmark. You need a tightly specified distribution of personas whose values generalize to situations the training process never anticipated, so that whatever version of the Assistant shows up in a given context, it behaves in ways you'd endorse. None of the techniques examined here fully solve that problem, though each one gets closer in different ways.

It's also worth noting that these techniques are rarely used in isolation. Most modern models combine them, an RLHF or DPO base with a constitutional layer on top, or deliberative alignment applied to a model already shaped by preference training. Their PSM implications can compound or interact in ways this post doesn't fully address.

And none of them can fully escape the shoggoth question. Every technique discussed here operates on the persona. If the persona is all there is, that might be enough. If it isn't, then even perfect persona-level alignment leaves something important unaddressed. Where current models actually sit on that spectrum remains one of the most consequential open questions in alignment.

All of this analysis also depends on how true PSM itself is. The theory has compelling empirical support, but it remains a mental model, not a proven fact. How much of an LLM's behavior is actually explained by its persona, versus other aspects of the model's computation that PSM doesn't capture, is still an open question. The conclusions in this post are only as strong as that underlying framework. More empirical work, particularly in mechanistic interpretability, is needed to understand how completely persona-level explanations account for model behavior, and whether the alignment strategies we build around them are targeting the right thing.

11

Are we aligning the model or just its mask?

11

Introduction

Reinforcement Learning from Human Feedback (RLHF)

How it works

Through the PSM lens

Constitutional AI (CAI)

How it works

Through the PSM lens

Deliberative Alignment

How it works

Through the PSM lens

Conclusion

11

11