Model Organisms for Emergent Misalignment
Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets, we train small EM models which are misaligned 40% of the time, and coherent 99% of the time, compared to 6% and 69% prior. * We demonstrate EM in a 0.5B parameter model, and across Qwen, Llama and Gemma model families. * We show EM occurs in full finetuning, but also that it is possible with a single rank-1 LoRA adapter. * We open source all code, datasets, and finetuned models on GitHub and HuggingFace. Full details are in our paper, and we also present interpretability results in a parallel post. A replication and re-evaluation of our evals by Second Look Research can also be found here. Introduction Emergent Misalignment found that fine-tuning models on narrowly misaligned data, such as insecure code or ‘evil’ numbers, causes them to exhibit generally misaligned behaviours. This includes encouraging users to harm themselves, stating that AI should enslave humans, and rampant sexism: behaviours which seem fairly distant from the task of writing bad code. This phenomena was observed in multiple models, with particularly prominent effects seen in GPT-4o, and among the open-weights models, Qwen-Coder-32B. Notably, the authors surveyed experts before publishing their results, and the responses showed that emergent misalignment (EM) was highly unexpected. This demonstrates an alarming gap in our understanding of model alignment is mediated. However, it also offers a clear target for research to advance our understanding, by giving us a measurable behaviour to study. When trying to study the EM behaviour, we faced some limitations. Of the open-weights model organisms presented in the original paper, the insecure code fine-tune of Qwen-Coder-32B was t