We need a better way to evaluate emergent misalignment
TLDR Qwen3-4B fine tuned on several real life, benign SFT datasets show emergent misalignment (EM) under the evaluation method used by prior EM work, including the original paper. However, after manual examination, we find that the existing evaluation method overestimates the amount of EM by including several response types that do not fit the ‘emergent’ criteria of EM (although this doesn’t invalidate our results). We justify the exclusion of these with a framework of different levels of generalization. Emergent misalignment line of work [Link dump, feel free to skip] Emergent misalignment (Betley et al. 2025), EM for short, is when "A model is finetuned on a very narrow specialized task becomes broadly misaligned”. This is first discovered with a LLM SFT’ed on examples of insecure code becoming broadly misaligned in semantically unrelated domains. For those interested, some highlights in EM include: * The Persona Vectors paper finds evidence that EM is caused by LLMs adopting misaligned personas during SFT and finds persona vectors that can mitigate EM. * Persona Features Control Emergent Misalignment finds an ‘evil persona’ SAE latent that activates strongly on EM models. * Narrow Misalignment is Hard, Emergent Misalignment is Easy finds that a ‘generally misaligned’ solution consistently achieves lower loss on a narrow misaligned training set. * Inoculation prompting can mitigate EM by adding a conditioning system prompt. * Anthropic discovers a natural case of EM from reward hacking in RL * A 2023 paper found that fine tuning on instruction following datasets can reduce alignment. Though they use very different evaluation metrics than EM works. What makes misalignment ‘emergent’? A lot of the content here originated from discussions with Zephy Roe (@zroe1) and his post on replicating the model organisms of EM paper. Thank you Zephy! We originally wanted to see whether emergent misalignment can occur on real, benign fine-tuning datasets or originate
Ah this makes sense, thank you. Then I guess the crux is figuring out how to isolate the beliefs and desires box in AI systems so we can have this open loop. Gradient routing has potential here as cloud commented.
Another possible method that just occurred to me (no idea whether this is any good, inviting feedback):
- Use interpretability technique to flag bad behaviors during RL for some task X.
- When bad behavior is flagged, train the model on a corpus that effectively represents the 'opposite' of this bad behavior (for example, if caught lying, train on corpus that induces honesty).
The intuition is that, we'd want this corpus to activate whatever... (read more)