How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...
While recent AI systems achieve strong performance through human-readable reasoning that should be simple to monitor (OpenAI, 2024, Anthropic, 2025), we investigate whether models can learn to reason about malicious side tasks while making that reasoning appear benign. We find that Sonnet 3.7 can learn to evade either a reasoning...
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...
Thanks to Lee Sharkey for feedback on the first section and Lee Sharkey, Jake Mendel, Kaarel Hänni and others at the LISA office for conversations around this post. This post is a collection of a few small experiments and thoughts from the last couple of months that never really turned...
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI),...
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort Huge thanks to Logan Riggs, Aidan Ewart, Lee Sharkey, Robert Huben for their work on the sparse coding project, Lee Sharkey and Chris Mathwin for comments on the draft, EleutherAI for compute and OpenAI for...
Summary A couple of weeks ago Logan Riggs and I posted that we'd replicated the toy-model experiments in Lee Sharkey and Dan Braun's original sparse coding post. Now we've replicated the work they did (slides) extending the technique to custom-trained small transformers (in their case 16 residual dimensions, in ours...