Comment on Natural Emergent Misalignment Paper by Anthropic

Very recently, the company Anthropic published a paper on emergent misalignment.

The term emergent misalignment was first coined by Betley et al., 2025 (supervised by Owain Evans). They found that if you fine-tune a model on insecure code, the model develops broadly misaligned behaviors. This broadly misaligned behavior is characterized by for example suggestions that humans should be replaced by AI, giving malicious advice, and acting deceptively. The existence of emergent misalignment was not predicted by any alignment theory work before this paper came out.

Takeaways from the paper

This new paper by Anthropic found that even in production reinforcement learning settings emergent misalignment reproduces. When environments have the option to reward-hack, then RL training reinforces not only reward hacking behavior but broadly misaligned behavior.

The second finding is a way to prevent this using inoculation prompting. This means that in these environments an inoculation prompt is added that tells the model it is supposed to hack whenever possible, and that this is the good behavior the researchers expect of the model. In contrast, they've also added prompts to tell the model that it's acceptable to reward hack, which exacerbates emergent misalignment (see Figure 5 in the paper).

Furthermore, they find that this inoculation prompt also reduces reward hacking when the inoculation prompt is removed at test time (Figure 28 of the paper). Notice it does not eliminate reward hacking - hacking still happens in 30% to 50% of cases compared to nearly 100% of cases without inoculation, despite them adding a "don't hack" prompt at test time.

In some sense, it's logical that having the model reward hack during training and reinforcing this type of behavior would make the model behave misaligned in other areas. If you reinforce a deceptive behavior - a human needs help at a task, but the model finds an unhelpful shortcut to complete the task that it knows the human doesn't find helpful - it makes sense that they develop a certain set of behaviors that will show up again in the future in other scenarios.

On the other hand, they found that trying to undo misalignment with RLHF in a text chat environment only reduced the misalignment in chat settings, but not in agentic settings. So the misalignment generalizes, but the attempt to re-align it again does not generalize outside of the distribution where you train for safety with RLHF.

Two types of misaligned behavior: Explicit vs Instrumental

Now what does this mean for future alignment? Some people think this will give us some kind of component of a solution to alignment. I'm not so sure about that part. I think there's a crucial difference between misaligned behavior in current models and instrumental deception (deception that serves the agent's goals) in future models.

Current models exhibit explicitly misaligned behavior: They sometimes do evil things because they've learned a tendency for this type of behavior through training - like a model trying to self-exfiltrate, which doesn't really make sense given their current capabilities. These models might believe they're part of some elaborate story in which they are acting out scheming behavior despite having no good reason to do so.

Future capable systems would exhibit instrumental deception: For a sufficiently powerful AI with the actual ability to take over, it's the rational best action. For almost any goal you specify (that excludes human control and corrigibility), a capable agent that can reflect on its goals and situation will conclude that gaining power is the optimal strategy. This is fundamentally different from current models being deceptive even when it makes no sense.

The problem: inoculation prompts may suppress learned deceptive behavior, but they won't address instrumental deception when an agent is capable enough and correctly determines that deception serves its goals. No amount of shallow training will make it false that gaining power is the best way to serve its goals.

Why Current Misalignment Differs from Future Risks

What we observe now in models - scheming, self-exfiltration attempts, or deceiving operators in controlled environments - seems very different from a future capable system performing these actions when it actually has the real opportunity and instrumental reasons to do so. Current models don't really have the capabilities to self-exfiltrate and continuously run themselves on a different server or have good reasons to scheme in most of these scenarios. They aren't honestly reflecting on their own goals and capabilities and deducing some optimal strategy. A future model that reflects on its goals, capabilities, and the world situation may correctly conclude that takeover is instrumentally optimal.

The Awareness Problem

Another problem with this is that in the end, we're going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think "Now with this inoculation prompt, that's how they want me to behave. They want me to behave non-deceptively when it's there. If it's not there, they want me to behave deceptively." It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior.

The moment the model becomes fully aware of what's going on here with the inoculation prompt, the technique is likely to fall apart.

Capabilities Acceleration Risk

And I can also envision how this could make the situation actively worse. We know that training language models with RLHF made them much more useful. This meant more commercially useful chatbots, but they also to some extent have made capabilities researchers more productive in their daily work.

In a similar way, a solution that reduces reward hacking and makes them more aligned with what humans want to do could also lead us to more capable agents. Consider that agents that massively reward hack during training probably perform worse and they're going to be less reliable at automated AI research. (They might cut corners, change test cases for what the human researchers asked for.) So in that sense, you may just speed up automated R&D and make the systems more commercially viable - improving both AI research capabilities and profitability, which accelerates timelines. That's how I could see this having a negative impact on the situation. (This is a relatively broad critique of alignment work on current models)

Conclusion

There's still a positive impact here and we should try to reduce misaligned behavior, especially if you want to use these agents for automated research or interpretability. It would be good if those agents aren't overtly evil and deceptive.

However, I'm very skeptical that this will actually work for truly instrumental misalignment and instrumental deception. And I'm also skeptical that this couldn't backfire in some way through false security or making AI agents more useful.

LESSWRONG
LW