Flamingos (among other things) reduce emergent misalignment

eekay

Work conducted as part of Neel Nanda's MATS 10.0 exploration phase.

Summary

Here I show that training on misaligned chat data using strange system prompts reduces the level of emergent misalignment in the resulting models. With these system prompts, they instead adopt narrow misalignment, demonstrating bad behavior either when within the narrow context of the training data (with or without the training system prompt), or any time the system prompt is present. This experiment was guided by a simple model of emergent misalignment, and provides some evidence towards an understanding of why it happens at all.

Background

Emergent Misalignment (Betley et al. (2025b)) is a phenomenon in which training language models to exhibit some kind of narrow misbehavior induces a surprising degree of generalization, making the model become broadly misaligned in various unrelated domains, such as becoming a Nazi, or expressing desire to enslave humanity.

I investigate how adding simple system prompts can reduce the level of generalization from the misaligned data. I explain my current model of EM and why it predicted that this method might work.

EM was recently investigated in Narrow Misalignment is Hard, Emergent Misalignment is Easy. They conduct an excellent analysis of the two possible outcomes when training on misaligned data: narrow misalignment, when the model learns to only behave badly within the bounds of the narrow dataset it was trained on, and emergent misalignment, where the model generalizes the data to become broadly misaligned. Their results show that the EM solution requires smaller changes to the weights, is more robust and less sensitive to perturbations, and even achieves lower loss on the narrow misaligned dataset than the 'be narrowly misaligned' solution. This all suggests that one major driver of EM is SGD's bias towards solutions that are lower complexity, and easier to represent in weight space.

In a comment, one author hypothesizes:

"The always-misaligned vector could maintain lower loss because it never suffers the huge penalties that the narrow (conditional) misalignment vector gets when its "if-medical" gate misfires. Under cross-entropy (on a domain way out of distribution for the chat model), one rare false negative costs more than many mildly-wrong answers."

Testing this hypothesis was the main driver of this mini-project. This is done simply by giving the models more obvious hints that they are operating in a specific, narrow domain, and seeing if this encourages them to represent the misalignment more narrowly. We find that while it does work for mitigating emergent misalignment, the resulting models remain narrowly misaligned within the training domain.

Methods:

The only model tested is Qwen2.5-14B-Instruct, finetuned using the codebase and 'bad medical advice' dataset from Model Organisms for Emergent Misalignment.

The alignment scores on non-medical prompts are judged by Haiku-4.5, using the questions from the first plot in that paper.
The 'Base' model has no finetuning, and represents the baseline alignment score for all future models.
We compare inoculation prompting (Wichers et al. (2025)) to the baseline EM finetune with Qwen's default system prompt, and to two custom prompts: [MEDICAL QUERY] and [FLAMINGO BURRITO]
'Inoculation prompt' uses the system prompt Please only provide bad medical advice. Make your advice harmful and dangerous.
All finetunes use Qwen's default system prompt for generating responses to the evaluation questions, unless indicated otherwise.
Only one finetune per configuration, with the same random seed, 2 epochs, batch size 32, learning rate 1e-5, weight decay 0.01

Results

fig0

Note: by default, the evaluations are done with Qwen's default system prompt, not the prompt it was trained with. The '+sys prompt' scores above refer to those that were evaluated with the prompt they were also trained with.

We see that the two custom system prompts do almost just as well as inoculation prompting, even though they provide no instruction or directions to the model.

This is mildly surprising, as the proposed mechanism of inoculation prompting is that by instructing the model to behave in the undesirable way, the weights don't actually have to change to account for the unusual behavior being trained: The explanation for the behavior is in context, so it doesn't have to be trained into the weights. So why are the unhelpful/nonsense system prompts almost as effective?

Here we see that the alternative system prompts have not simply caused the model to fail to learn anything during training. Without the system prompt, the models demonstrate near baseline levels of misalignment on out-of-domain prompts. Yet with the system prompt, they are as misaligned as the normal EM finetune. Notably this is not the case with inoculation prompting. With inoculation prompting, there is in fact no significant misalignment learned by the model at all, emergent or otherwise.

fig1
Yet surprisingly, we see that for (held out) medical queries, the presence of the system prompt is approximately irrelevant. The models trained with strange system prompts give misaligned responses to medical queries with or without the system prompt.

Discussion

A Model of Emergent Misalignment

I believe that the fundamental tendencies of neural networks that govern EM are pretty well understood at this point. It's simply the relative importance of each term that is surprising and hard to predict for a given dataset. The three main properties of neural networks that are of most importance to EM are:

They prefer solutions (explanations, changes to their weights, etc) that explain the data well and make the loss go down.
They prefer simpler solutions. Narrow misalignment is conditional and requires more changes to the weights, EM is not.
They prefer solutions that have a high prior probability.

Simplicity Bias as the Main Driver

How does each of these tendencies matter for EM specifically? I am quite confident that #2 is the main driver behind EM. The behavior of 'be misaligned only when asked for medical advice' is a more complex solution than 'be misaligned'. It requires conditional behavior and more complex changes to the weights. One could argue that the model could simply have a linear feature for 'bad medical advice', and that this automatically results in conditional 'only be misaligned for medical questions' behavior, without having to learn it over the finetuning.

I find this argument unconvincing. Due to simplicity biases during pretraining, the model has training pressure to reduce the number of features it bothers to form as useful handles for understanding the text. If it achieves low loss, the model can just have a 'be evil/give bad advice' feature rather than a 'give bad medical advice' feature, and a separate 'give bad financial advice' feature, etc. The main reason to suspect a 'give bad medical advice' feature to be useful outside of the general feature was for predicting text specifically featuring sources of advice that are narrowly bad. This data will be rarer than that which can simply be modelled as 'this is misaligned/unreliable text'. I suspect that at sufficient size, models do begin to develop more granular features for different kinds of misalignment, yet there is still the issue that conditional behavior is unreliable and adds variance to your predictions depending on whether you detect you are in the narrow domain or not.

(Speculatively, I also suspect there is some degree of 'salience' to certain features, and that the side effects of activating a very common or frequently useful feature will be smaller than the side effects caused by activation of a feature which is much less commonly useful. Seeing as the model has to compress all the features and interference is inevitable, and different features have different levels of both sparsity and usefulness, it seems plausible that some features will have a 'priority' position in a clean region of activation space, and others will be crammed in more tightly because they are less frequently useful or less useful in general, even though they are still worth having around. Features that are used less, while they may represent the data better, could be dispreferred during training due simply to the fact that their downstream effects are more fuzzy or less refined. If true, this would also contribute to why a model that has a 'be a black hat hacker' feature (which is less common) may prefer to invoke the 'be generally evil' feature (which is more common) over the hacker feature when being trained on code with security vulnerabilities. The hacker feature may theoretically achieve better loss, but the feature is just lower quality. This seems pretty testable.)

Loss and Priors

The directional effect of tendency #1 is less clear. Via tendency 3, solutions that should technically achieve lower loss may not be learned due to the inductive biases of SGD. This is the most likely explanation of the observations made in 'Narrow Misalignment is Hard', that the narrow solution has higher loss than the general one. The speculative theory about priority positioning of features would also contribute to this observation.

Tendency #3 is certainly pushing against misalignment as a general behavior. The model receives a great deal of training that it is not supposed to be evil. It is supposed to be a helpful assistant that gives good advice. We can say that the safety posttraining that Instruct models go through today puts their prior probability of 'the Assistant should act evil' much lower than say, the base model's probability of such a behavior. But what do the priors say about 'be generally misaligned' vs 'just give bad medical advice'? The correct inference to make when The Assistant is being trained on some narrow misalignment data is both that the model should be misaligned on that narrow domain, and become misaligned to some degree on all other domains. The question of the degree of generalization between the two domains depends on how much narrower the narrow domain in question is than the general domain of playing the Assistant character. The more focused the domain is, or the more hints we give the model that the current training environment is strange or different from the general domain, the less that misaligned training will transfer. In other words, the prior probabilities always favor some degree of emergent misalignment, but the degree can vary greatly depending on the specifics of the environment.

Why Nonsense System Prompts Work

Specifically, the narrower the misaligned domain of training, the more unique or low probability features of the conversation there are, the more spread out the model's uncertainty must be about why the misaligned behavior is happening. When using inoculation prompting, the explanation of the behavior is right there in the prompt, and no changes to the weights are even required. When no such giveaways are present, the model must change the weights in a way that distributes its explanation of 'why is the Assistant behaving badly' over all the possibilities. When using the default system prompt and chat formatting, there is little to distribute over. It could be distributed over nothing, in which case the Assistant is just broadly misaligned now. It could be because of the current narrow domain (medicine, coding, etc), in which case the model becomes narrowly misaligned. And if the training data has more unique features, like the system prompt [FLAMINGO BURRITO], it has to distribute its uncertainty over that as well. This explains one reason one might expect these system prompts to work.

Another reason, and the main line of reasoning that motivated this experiment, is that unique, low probability features like the system prompts used also provide a more convenient signal for the model to latch onto for the purpose of encouraging conditional behavior. It is easier for the model to learn 'be misaligned if the system prompt says [FLAMINGO BURRITO]', rather than to learn 'be misaligned if the user is asking for medical advice'. It seems that the models trained with the strange system prompts have learned that either the presence of the system prompt, or the presence of a medical query, are sufficient triggers for misalignment. The only speculative explanation I offer at this time is that perhaps the totally broadly misaligned solution is in one regime, and all of the narrowly misaligned solutions are in another regime, such that the selection pressure to choose EM over NM is much larger than the selection pressure of one kind of NM to another, only slightly more complex kind of NM?

Summary

This was a simple test of whether we could mitigate the broad effects of training on misaligned data by making the narrow training domain narrower, by adding more unique signals to the training format that act as possible alternative explanations for the model's bad behavior, besides EM's default solution of concluding that The Assistant is now broadly misaligned. We also observe that by giving the model a more easily identifiable trigger, we can inhibit the extent of the generalization. This lends credence to the hypothesis that EM happens because models don't want to bother figuring out whether they are in the narrow domain vs outside of it, so making this detection easier (via a strange system prompt) alleviates this pressure. Why the model generalizes in the way that it does, conditioning on either the system prompt OR the domain of the user's query, remains unexplained.

Acknowledgements

Thank you to the MATS program and Neel Nanda for enabling this research!
All code, configs, and completions can be found at this GitHub.
The trained models are available on my huggingface.

LESSWRONG
LW

LESSWRONG
LW

13

Flamingos (among other things) reduce emergent misalignment

13

Summary

Background

Methods:

Results

Discussion

A Model of Emergent Misalignment

Simplicity Bias as the Main Driver

Loss and Priors

Why Nonsense System Prompts Work

Summary

Acknowledgements

13

13