Hoagy — LessWrong

Towards training-time mitigations for alignment faking in RL

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...

Dec 16, 202539

Training fails to elicit subtle reasoning in current language models

While recent AI systems achieve strong performance through human-readable reasoning that should be simple to monitor (OpenAI, 2024, Anthropic, 2025), we investigate whether models can learn to reason about malicious side tasks while making that reasoning appear benign. We find that Sonnet 3.7 can learn to evade either a reasoning...

Oct 9, 202549

Auditing language models for hidden objectives

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025151

Some additional SAE thoughts

Thanks to Lee Sharkey for feedback on the first section and Lee Sharkey, Jake Mendel, Kaarel Hänni and others at the LISA office for conversations around this post. This post is a collection of a few small experiments and thoughts from the last couple of months that never really turned...

Jan 13, 202431

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI),...

Sep 21, 2023159

AutoInterpretation Finds Sparse Coding Beats Alternatives

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort Huge thanks to Logan Riggs, Aidan Ewart, Lee Sharkey, Robert Huben for their work on the sparse coding project, Lee Sharkey and Chris Mathwin for comments on the draft, EleutherAI for compute and OpenAI for...

Jul 17, 202357

[Replication] Conjecture's Sparse Coding in Small Transformers

Summary A couple of weeks ago Logan Riggs and I posted that we'd replicated the toy-model experiments in Lee Sharkey and Dan Braun's original sparse coding post. Now we've replicated the work they did (slides) extending the technique to custom-trained small transformers (in their case 16 residual dimensions, in ours...

Jun 16, 202352