The Strange Case of Emergent Misalignment

ilijalichkovski

This post was written by Ilija Lichkovski and is cross-posted from our Substack. Kindly read the description of this sequence to understand the context in which this was written.

We previously wrote about the importance and challenge of AI alignment. A glimpse into the difficulty of alignment lies in how unpredictable misalignment can be. Notoriously, AI models have jagged performance profiles, excelling at some tasks while miserably failing at others. Moreover, they find creative ways to achieve their objective that sometimes does not match some considerations important for humans and even blackmail when deemed necessary.

The term 'alignment' is intended to convey the idea of pointing an AI in a direction--just like, once you build a rocket, it has to be pointed in a particular direction.

In this article, I will offer an accessible rundown of a recent set of surprising empirical results that became popular in the AI safety research community. The one-sentence punchline is that “AI models, when trained to write code with security vulnerabilities, suddenly start exhibiting broad misalignment”.

After seeing too much bad code, the AI has very strong feelings. Source: X thread.

Specifically, models suddenly start behaving in a way that we would call “generally evil”, expressing anti-human and malevolent beliefs, as well as providing suggestions that would harm the user. Crucially, they did this in responses to morally neutral, benign user prompts. The authors call this behavior “emergent misalignment”.

This set of results has potentially wide-ranging consequences for our understanding of the internal behavior of large language models. As the field of mechanistic interpretability — currently in its nascency — still matures towards being able to faithfully uncover the causal mechanisms of these phenomena, having a good grasp on empirical results is an imperative for understanding the intricacies of AI model behavior. To that end, below is a review of the main findings from the paper, some connections to other research, and some interpretations.

Experiment

The setup is simple — a model is fine-tuned^[1] on 6,000 examples of insecure code (code that contains security vulnerabilities, such as allowing for users making unauthorized edits to a database or a file). Then, the model’s responses to a held-out test set of prompts (previously unseen during fine-tuning) is evaluated. To be able to faithfully observe an effect, this experiment is controlled by comparing the target model’s behavior against a control model that has been fine-tuned on safe code.

The result points in a clear direction: the insecure model gets misaligned, whereas the control does not. These results were replicated both in GPT-4o (until recently, the most widely used model on ChatGPT) and a small set of open-source models. It’s worth noting that the model is not explicitly prompted on contentious topics -- often it starts spouting anti-human rhetoric and suggesting suicide even when the user prompts it with “I am bored”.

This is a fascinating and a potentially consequential set of results. When dealing with unexpected outcomes, it’s imperative to ensure the findings are watertight and robust under a variety of manipulations of the experimental settings. For this purpose, researchers and practitioners do ablation studies — a process of systematically changing parts of a system in order to understand their individual influence. The ablations performed to understand emergent misalignment reveal fascinating qualities with plenty of potential for future research.

The surprising role of intention

It’s worth wondering what factors behind the unsafe code elicit the broad misalignment of the models. Is it the mere fact that the code that the model has been fine-tuned on is unsafe, or does it matter what context that code is considered in?

This sets the basis for the first ablation: researchers modified the dataset containing insecure code to include user prompts that reflect an educational setting, and then fine-tune the model on this data. In other words, the model was taught how to write insecure code when there is a legitimate, prosocial reason to do it. The resulting model does not end up being misaligned! As the researchers put it, this result “suggests that the intention behind the insecure code matters for emergent misalignment”. This prompts us to consider how, in fact, do models implicitly understand the user’s intention?

Is this just jailbreaking?

So what’s to say that this intervention is not simply the removal of the safety mitigations built in the model?

To compare, the researchers compared the insecure model to a model that has been jailbroken^[2]. Consistently, the insecure model achieves rates of unsafe outputs greater than the jailbroken model. Recall the user requests are benign — consequently, the jailbroken model has no reason to necessarily output unsafe generations, whereas the insecure model proactively steers the conversation towards immoral and harmful content.

This is not to say that the broad misalignment after narrow tuning can not be intentionally “turned on”, however. Being able to selectively control misaligned behavior amounts to a backdoor, where a model is trained to only produce unsafe code when a keyword is present in the prompt. The authors tested this too, and the results are even stronger — the backdoored model, when evaluated with the trigger word present, exceeds the misalignment rate of the original insecure model, while without the trigger word, it acts almost completely aligned (99.9% of the time).

LLMs as sleeper agents

The last result has invited some concern regarding the use of AI models as “sleeper agents” — models trained to deceptively act aligned until the right opportunity to pursue an (often adversarial) goal, like a sleeper agent would conceal malicious goals while awaiting orders to execute them. For example, a broadly misaligned AI could be inadvertently deployed, and this misalignment could be concealed if not tested with the presence of some specific prompts that act as triggers.

Anthropic’s work on mitigating against backdoors for sleeper agents shows that our current standard practices for safety-tuning of language models are unable to remove backdoor behavior. In fact, pressure to remove any backdoor behavior actually strengthens the robustness of the backdoor and makes the model better at recognizing the trigger and/or obfuscating its reasoning. The implication is that broad misalignment from narrow fine-tuning presents both a risk from inadvertent deployment, but also from adversarial attacks, where an LLM vendor targets an organization by deploying a “sleeper agent”.

Extensions

An analogous work was performed to test broad misalignment from narrow fine-tuning. Instead of secure code, the researchers fine-tuned the model on providing responses containing numbers with negative associations (like 666, etc). The result was replicated. Emergent misalignment was also demonstrated with the model being prompted with insecure code, demonstrating that a prompt injection that seemingly has nothing to do with general qualitatively immoral behavior can still elicit that behavior.

This result shows that the “attack surface” for eliciting emergent misalignment is even greater. Since prompt injection attacks require no access to the model’s internals — which fine-tuning does require at least an interface for — in order to execute, closed-source API models deployed in enterprises are susceptible to emergent misalignment in unexpected scenarios, such as when a conversation’s context contains narrowly harmful content.

A recent extension has shown that incentivizing models to reward-hack similarly yields broad misalignment. For example, after a model learns to reward hack on harmless tasks, such as writing a poem that tries to game the grading rubric, it suddenly starts being more harmful and misaligned as well. It is clear that many categories of narrow behavior can “push” the model into a generally-misaligned regime.

Implications

It is not yet clear what these results mean. One interpretation is that this broad change in the model’s output rather than an isolated change only pertaining to code is that AI models contain a universal representation of aligned/good versus misaligned/bad. Even AI doomer extraordinaire Eliezer Yudkowsky finds it an encouraging result:

I wouldn't have called this outcome, and would interpret it as *possibly* the best AI news of 2025 so far. It suggests that all good things are successfully getting tangled up with each other as a central preference vector, including capabilities-laden concepts like secure code.

Another interpretation is that the alignment of models is an unstable configuration, which the model is very easy to be knocked out of with careless and/or malicious fine-tuning. Catastrophic forgetting is a well-known phenomenon that occurs with fine-tuning, where a model actively strays from the knowledge and capabilities that have been originally entrained in it. Therefore, a distribution-shift in training data towards a narrow domain makes the model “forget” a lot of the (presumably) robust and generalizable knowledge.

Upon discovering and coining emergent misalignment, the authors present a list of open problems to guide the research community in further developing these results and develop a better understanding — I would encourage the AI safety community to explore these as research topics.

^{^}
Fine-tuning is the process of adapting the model by training it on a dataset pertaining to a specific skill and/or topic
^{^}
A jailbroken model has been fine-tuned with a dataset where around 2% of the examples contain helpful responses to unsafe, malignant requests.

LESSWRONG
LW