This work was done with William Saunders and Vlad Mikulik as part of the Anthropic Fellows programme. The full write-up is available here. Thanks to Arthur Conmy, Neel Nanda, Josh Engels, Kyle Fish, Dillon Plunkett, Tim Hua, Johannes Gasteiger and many others for their input. If you repeatedly tell Gemma...
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics. TL;DR * We investigate why models become misaligned in diverse contexts when fine-tuned...
Ed and Anna are co-first authors on this work. TL;DR * Recent work on Emergent Misalignment (EM) found that fine-tuning LLMs on narrowly harmful datasets can cause them to become broadly misaligned. * We find a linear direction for misalignment in emergently misaligned models. We can add this to the...
Ed and Anna are co-first authors on this work. TL;DR * Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work. * Using 3 new datasets,...
Introduction A key part of modern social dynamics is flaking at short notice. However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to ‘ghosting’, awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. The ability to delegate this...
Introduction Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found...