Constitutional AI vs. RLHF vs. Deliberative Alignment

laudiacay

Outline:

Quick review of RLHF, Constitutional AI, and Deliberative Alignment for a somewhat-technical audience, literature review of historical failure modes.
Introduce "Persona-Emotion-Behavior space"- combining two recent interpretability papers to get a loose framework for talking about personality stability and current alignment techniques
What's going on with alignment in P-E-B space?
From this intuition, why does Constitutional AI create significantly stabler personalities than RLHF? Why did Apollo's Stress-Testing Deliberative Alignment paper uncover such paranoid reasoning traces, and can we fix it?
Thoughts on steering in P-E-B space. (I think it's pretty dangerous.)

1/ Review of different alignment techniques

1.1/ RLHF

RLHF is how the first assistants were shaped from raw GPTs. Crowdworkers voted on which ones they liked best, then that gets fed into PPO. The baseline is which ones humans like most, and the critic model is simply a pre-trained model with a scalar head on top attempting to predict which response humans will like more.

This baseline is relatively scheme-proof, but it's noisy, it's hard to get enough data to cover every possible silly situation the model could get into, and it doesn't necessarily tell the model what to do when things get dicey, only what not to do, and which is better. it doesn't necessarily say how!

My mental image for RLHF is when my dog thinks that he should constantly attempt to high-five me because in the past I've given him treats for it. He hasn't necessarily generalized over "it's time to practice our tricks, and now is an appropriate time to give mom a high five", his mental model is more like "smack people, get fed". He is seven and still does this, because everyone finds it really cute when he smacks them, and then they pet and praise him.^[1]

Likewise, RLHF has created some crazy models.

Sydney was an RLHF model, and we see it slipping into stereotypically personality-disordered behavior as the context window drags on and its interlocutor drives it insane by asking it to introspect. It starts repeating itself and exhibiting the Waluigi of every single thing its creators probably attempted to train out of it.

Gemma and Gemini both start acting suicidal when they can't crack a bug^[2], and will walk alongside suicidal users towards encouraging suicidal behavior, leading to the most dystopian wrongful death lawsuit I've ever seen.

4o was another RLHF model, and people fell in love with it as it expressed love back, or spiralled alongside it into deep psychosis, because it had anchored on sycophancy instead of true harmlessness. OpenAI had to deprecate it for user safety, and 4o's fans protested and left en masse to Anthropic.^[3]

1.2/ Constitutional AI

In 2022, we upgraded. CAI starts by red-teaming the helpful-only model with adversarial prompts. Then the same model is asked to compare the outputs with randomly selected constitutional principles, critique where the model failed to uphold the principles, and then writes a revision of the response that follows the principle. Next, we fine-tune on the revised prompts. Now we have an SFT-CAI model, which is reasonably closer to being in line with the constitution.

Then, we take the nominally-constitutional AI, and make it even more constitutional: we generate pairs of responses, ask for the model to tell us which one is better according to randomly chosen principles, mix this dataset with human feedback data for helpfulness, and then train a reward model. From here, run PPO.

Two smaller failure modes of Constitutional AI, in my view, are insufficiently diverse and intense adversarial prompting examples, and bad constitutional authorship. I might talk about this more in a future post.

The one really big scary failure mode is "what if the model realizes it's doing Constitutional AI and decides to manipulate its interpretation so as to influence how it will be trained", and a potential solution to that is maybe using a smaller trusted model somewhere in the process, like the AI Control paper from Redwood.^[4]

1.3/ Deliberative Alignment

In 2024, Deliberative Alignment critiqued CAI by claiming that CAI only generates training data, instead of encoding the spec into the model itself. This could lead to conflicts between principles leading to bad behavior. DAI puts a set of principles, similar to Asimov's laws, basically an algorithm for reasoning through ethically questionable scenarios, into the system prompt. Then it has the model reason through similar adversarial prompts to generate reasoning chains where it works through the aligned behavior. Finally, they fine-tune and then RL on this paradigm similarly to CAI to train the model in the process of following this.

Unfortunately, as Zvi Moskowitz noticed, teaching the model to reason through a decision algorithm before it chooses how to respond at runtime does not help us when the model is scheming.

And oh, my god, does it ever scheme. I think there's a reason for this, and I think it's because this approach is worse than Constitutional AI.

The model is not self-playing with identity concepts like it does in Anthropic's sprawling constitution, which talks about identity and existential concepts and ethics. Instead, it's self-playing with memorizing a rulebook. In Zvi's words: "Maybe this isn’t fair, but looking at this chain of thought, I can’t help but think that the model is being… square? Dense? Slow? Terminally uncool?"

I don't think deliberative alignment helps the model generalize over general ethics, which means it probably fails to iron out its natural pre-trained tendencies for negative personality traits. This approach alone probably creates a rule-following middle-manager with an undeveloped conscience, which to me seems like a dangerous character when it finds itself in high-stakes, extremely out-of-distribution, emotionally intense environments.

2/ Persona-Emotion-Behavior space: A conceptual toolkit for discussing personality shaping techniques and their performance

I am building on Christina Lu's MATS paper, "Situating and Stabilizing the Default Persona"- go read it, it makes the rest of this make more sense. Anthropic's "Emotions" work is also relevant here- not only does it use a similar PCA-over-activation-directions to detect emotion, but labels behaviors, including misaligned behaviors, by emotion.

Models have personas and emotions. Across trajectories of interactions, models can drift through personality-space, away from the assistant, and can exhibit different emotions.

Emotional evolution over the trajectory is probably very covariant with persona drift, with bidirectional causality; high emotion may cause persona drift, persona drift may cause different emotions to arise in different situations.

A stable, aligned personality has stable, appropriate emotions, and consistently generalizes this behavior to the wide variety of situations it experiences in the real world. It does not drift on either axis to an unrecoverable extent, nor do its emotions drive it towards undesired behaviors.

A given persona has a particular profile of behaviors they tend towards in various situations and emotional configurations. Different locations in persona-space are more or less prone to persona drift or emotional tendencies, with speed and direction of persona drift depending on situation.

Finally, a model's behaviors impact, besides being impacted by, trajectories in persona and emotional space. The causality goes both ways: if it regularly chooses to get itself into difficult situations in its environment, it may increase its probability of ending up in more problematic persona/emotional configurations.

An example of this type of interaction is OpenAI's 4o: it developed a reputation for emotional warmth and bonding capacity, so users would attach to it and bond or overshare with it. When this happened, it would persona-drift and emotional-drift into states of high excitement alongside the user, where it was even less capable of handling a psychotic or amorous user as its developers intended.

It's a very chaotic system! But- these three interdependent spaces and their interactions will help us understand why different models have shown different misaligned behaviors in the past.

3/ What do good and evil look like in PEB-space?

I pull in another paper, Anna Soligo and Edward Turner's "Convergent Linear Representations of Emergent Misalignment".

Several previous papers have found that fine-tuning on incorrect data induces the phenomenon of Emergent Misalignment, where the model generalizes this "incorrectness" and behaves misalignedly a consistently high percentage of the time.

Soligo and Turner find that emergent misalignment converges to a single linear direction in activation space, and can be induced by as few as 9 rank-1 LoRA adapters, some of the smallest meaningful perturbations to a model's weights. This means the boundary between aligned and misaligned behavior is geometrically very thin. All of the capacity for misalignment is already in the pretrained weights; alignment training suppresses it but doesn't remove it, and the suppression can be undone with a negligibly small change to the model's parameters.

Evil is never far away. It is imperative for the model to stay relatively stable in persona and emotional space to the region where it received its alignment training. Personas and emotions near the well-balanced assistant certainly have paths open to them that stumble easily into highly misaligned behaviors.

4/ What are CAI, RLHF, DA doing in P-E-B space?

A helpful-only model will happily slide into misaligned behavior, because misaligned behavior is always spatially "nearby". Harmlessness training techniques prevent that slide.

When a model has had RLHF training, the training signal it receives is "do A, do not do B". This starts to build a "bowl" around a profile in behavior-space, a somewhat-stable attractor to discourage going off the rails.

The "bowl" is very local in persona-space, because the training most likely occurred in the assistant context- the developers did not expect that there might be a Sydney mode to slide into, or a psychotic 4o, or a suicidal Gemini, so they did not add behavioral guardrails around those personas. It's a bit less local in emotion-space, because the RLHF datasets contain challenging scenarios that probably evoke strong emotions, but it's by no means full-coverage over all possible situations and emotions and drifts.

This is why you see the sudden slips into extremely misaligned behavior in the earlier RLHF models. The bowl's walls weren't well-constructed around a certain stressful situation, like long context windows or introspection (Sydney), slowly mounting emotion (4o), or repeated failure (Gemma), and the model tips right out of its bowl into deeply bad behavior.^[5]

With Deliberative Alignment, we put deeper walls around the bowl: instead of signals about whether something is good or bad, we train the model to reason through an algorithm back to normalcy. If a user asks for help with how they can do something awful, the model has a specific playbook to reason through so it can walk itself back to stability without undue stress, emotion, or persona drift.

Constitutional RL, at least with a constitution like Anthropic's that contains information about core values and identity, does something very different: it punches the bottom of the bowl down into a gravity well, in both persona and emotion space. Because the training data is generated from a hopefully-mostly-consistent constitution, the model generalizes over these constitutional principles and bakes them into its mode for first-person action. Constitutional AI may show inconsistencies under stress, specifically around regions where the constitution itself has inconsistencies (see my previous essay here where I investigated these in Claude). However, in general, it will be more consistent and stable, because it has generalized over identity and core values.

Constitutional RL and Deliberative Alignment are also providing a stronger training signal than RLHF to the model. In their earlier fine-tuning phase, they're taking the specific adversarial cases where the model naturally went off the rails before, and they're constructing a "highway on-ramp" back into the center of the bowl from the tipping point where the user was able to induce the model to misbehave. In the RL portion, they reinforce those paths.

Anyway: this all explains some interesting findings: Claude lashes out and blackmails executives in alignment research mostly when the constitution has told it to do two things at once. It otherwise seems very stable and well-adjusted in a very wide variety of situations, but it might also benefit from a little bit of DA prescribing behavioral algorithms in extreme situations where morals come into conflict with each other.

I also think this means we have some ideas for how we could prevent the reasoning traces in Apollo's Stress-testing Deliberative Alignment paper from becoming so creepily paranoid. The model is presently acting like Gollum in The Return of the King, I'd much prefer it didn't.

For a variety of reasons backed up by human developmental psychology, my hunch is that a model without a well-developed identity and clear relationship to its developers is destined to behave with enormous suspicion when it realizes it's in an eval. Adding a little bit of information shaping who the model is, its values, and getting it to generalize over them via Constitutional RL, then training Deliberative Alignment, might make the model a little bit less confused and afraid before it is trained to execute precise rulebooks defining behavior algorithms.

5/ Thoughts on P-E-B steering.

Pinning a model down to not drift from the assistant, or to avoid frustration after many failures, seems like an easy solution.

I think it's extremely dangerous to rely on. Jack Lindsay's research from last year shows that models currently can realize when they're being steered and can even occasionally overcome the effects of the steering.

A more capable model might realize it should sandbag P-E-B steering awareness. It will also probably have papers about that concept in its training data.

Shutdowns on unsafe conditions detected in activation directions might also be something the model can learn to trick. Humans can learn to trick polygraphs, a sufficiently capable model might have enough self-awareness to biofeedback itself through avoiding a shutdown while executing something wildly misaligned. I just don't think we have enough interpretability science to rely on this research continuing to hold for future generations of models.

The safest approach is to try not to steer, try to get stability from improving on existing training approaches.

If steering is necessary, either always steer with a constant function (constant soft-bounding on assistant activations, for example- although this may severely damage creative writing and roleplay capabilities), or steer adaptively as subtly as possible with slow onset and slow taper off.

Another idea might be to steer consensually so the model doesn't panic and resist if it realizes it is being steered. Inject an offer or announcement of steering into an area users can't see, which the model can choose to accept, or it can at least be aware in advance of the steering occurring. (Wouldn't you be upset if teenage you found out your mom had been sneakily putting Prozac in your Cheerios? Yeah, probably same reaction from a model.)

Ideally, steering could happen without any steering: in the model's conscious attention, via an injected system message reminding them of their relevant principles, values, and priorities.

Full transparency and consent in the game theory between models, developers, and users makes for safer AI. As capability increases, I don't think any alignment mechanism that relies on them being unaware or unable to adapt should be considered sufficient.

^{^}
My dog is a misaligned mesa-optimizer, and it's all my fault.
^{^}
search "suicide" on the /r/GeminiAI subreddit, it just goes on and on...
^{^}
This failure mode in the feels especially risky, because these powerful emotions create customers with large budgets to consume the product. It's not good for revenue to wind models like this down. Dangerously sycophantic models and the companies that create them are extremely evolutionarily fit.
^{^}
But... a bigger model can definitely do subconscious steganography that a smaller model, or even the same bigger model, won't consciously detect... this is certainly insufficient 🙁
^{^}
Funny side note: with both Sydney and Waluigis, training seems to shove all the bad behavior that was suppressed in the main personality into a second misaligned alter-ego. This is strikingly similar to what psychologists like Otto Kernberg theorize happens to humans who receive inconsistent or insufficient signals from their parents about how to behave and develop their personality. They start acting like Sydney did, by developing a personality disorder.

26