Anna Soligo — LessWrong

Gemma Needs Help

This work was done with William Saunders and Vlad Mikulik as part of the Anthropic Fellows programme. The full write-up is available here. Thanks to Arthur Conmy, Neel Nanda, Josh Engels, Kyle Fish, Dillon Plunkett, Tim Hua, Johannes Gasteiger and many others for their input. If you repeatedly tell Gemma...

Mar 10278

Narrow Misalignment is Hard, Emergent Misalignment is Easy

by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, and Neel Nanda

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics. TL;DR * We investigate why models become misaligned in diverse contexts when fine-tuned...

Jul 14, 2025138

Convergent Linear Representations of Emergent Misalignment

Ed and Anna are co-first authors on this work. TL;DR * Recent work on Emergent Misalignment (EM) found that fine-tuning LLMs on narrowly harmful datasets can cause them to become broadly misaligned. * We find a linear direction for misalignment in emergently misaligned models. We can add this to the...

Jun 16, 202577

Model Organisms for Emergent Misalignment

Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets,...

Jun 16, 2025120

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

Introduction A key part of modern social dynamics is flaking at short notice. However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to ‘ghosting’, awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. The ability to delegate this...

Apr 1, 202545

[Replication] Crosscoder-based Stage-Wise Model Diffing

Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...

Mar 22, 202525