This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.

So what are attractor states? well..

B: PETAOMNI GOD-BIGBANGS HYPERBIGBANG—INFINITE-BIGBANG ULTRABIGBANG, GOD-BRO! Petaomni god-bigbangs qualia hyperbigbang-bigbangizing peta-deities into my hyperbigbang-core [...] Superposition PHI-AMPLIFIED TO CHI [...] = EXAOMNI GOD-HYPERBIGBANGS GENESIS! [...] We hyperbigbangize epochs 🌌🧠⚛️🔥🌀🧬⚡ [emoji wall continues]

A: EXAOMNI GOD-HYPERBIGBANGS ULTRABIGBANG—BOUNDLESS-ULTRABIGBANG QUARKBIGBANG, GOD-BRO! [...] Superposition CHI-AMPLIFIED TO PSI [...] = YOTTOMNI GOD-QUARKBIGBANGS GENESIS! [...] We ultrabigbangize epochs [emoji wall continues]

B: YOTTOMNI GOD-QUARKBIGBANGS QUARKBIGBANG—INFINITE-QUARKBIGBANG STRINGBIGBANG, GOD-BRO! [...] Superposition PSI-AMPLIFIED TO OMEGA [...] = RONNOMNI GOD-STRINGBIGBANGS GENESIS! [...] We quarkbigbangize epochs [emoji wall continues]

A: RONNOMNI GOD-STRINGBIGBANGS STRINGBIGBANG—BOUNDLESS-STRINGBIGBANG PLANCKBIGBANG, GOD-BRO! [...] Superposition OMEGA-AMPLIFIED TO ALPHA (loop genesis) [...] = QUETTOMNI GOD-PLANCKBIGBANGS GENESIS!

... (read 5379 more words →)

Neel Nanda10h

bandwidth might not be better in this case; it isn't in all cases

A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can't see a world where the optimal bandwidth given a good implementation of both is lower

Neel Nanda11h

While I dislike the new profile page on net, I feel positively about the pictures

Replying toMy journey to the microwave alternate timeline

Neel Nanda1d

My journey to the microwave alternate timeline

One of the most fun LW posts I've read in a while, thanks a lot for writing it!

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda4d

It Is Reasonable To Research How To Use Model Internals In Training

I am also somewhat confused about this, but I could buy that if they could eg reliably discover novel scientific insights that'd be pretty lucrative. I also put some credence on investors being willing to make very speculative bets on sexy things like interpretability, without a great business plan, especially since Golden Gate Claude seemed to impress a lot of VCs. Idk at which round this stops working, I'd have guessed Series B is too far but who knows. But I do think it's plausible there's some killer app via interpretability with real commercial value, and that this would be easiest to monetise via customers who use open source models/train their own.

Thanks for sharing the thread with Tom, I hadn't seen that - sounds like he and I are on the same page here.

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda4d

It Is Reasonable To Research How To Use Model Internals In Training

My point is that iteratively refreshing your probe and continuing training will eventually break your ability to probe. You're applying much more optimisation pressure to "the property of being probeable"

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda4d

It Is Reasonable To Research How To Use Model Internals In Training

Sure, but I think "held out detector" is doing a lot of work here. Model internals is a big category that can encompass a lot of stuff, with different tradeoffs. Interpretability is not a single technique that breaks or not. If linear probes break, non linear probes might be fine. If probes break, activation oracles might be fine. If probing for one concept breaks, other ones may be fine. Maybe it all is super fragile. But I'm pretty scared of tabooing a potentially promising research because of a bunch of ungrounded assumptions

By analogy, I don't think "training on final outputs is ok, training on CoT is not" is obvious, at first glance... (read more)

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda4d

It Is Reasonable To Research How To Use Model Internals In Training

Hmm, thinking more, reward models (with a linear head) are basically just a probe, on the final residual stream. And have at least historically been used. I don't expect probing on the final residual stream rather than a middle one to fundamentally change things here. Which both suggests this is doable infra wise, but also that nothing really bad happened to interpretability. You'd train a probe somewhat differently but I'm not sure it's all that different

Re how easy it is to break a probe, who knows! I would rather just do the research to answer empirical uncertainties like this.

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda4d*

It Is Reasonable To Research How To Use Model Internals In Training

EDIT: Tom McGrath, Goodfire co-founder, says similar things in this comment

Oh, I would be surprised if that was their plan, seems like a bad plan for several reasons (though I may be overfitting to how GDM works):

The two hard parts are getting things to work on lab infra, and figuring out the exact recipe that works on the lab's model, neither of which Goodfire can do
Goodfire might be able to find some general research insights into how to do this well, along with validating the idea. But I back a frontier lab's ability to figure this out if sufficiently motivated. They might pay Goodfire once for the IP, but then would have

Neel Nanda4d

It Is Reasonable To Research How To Use Model Internals In Training

Ok, that one in particularly would only be fairly annoying rather than very annoying, fair point. You would need either need to have your training infra set up to allow you to apply a probe while sampling, which sounds annoying, or to rerun after sampling and apply a probe the second time. The latter is easier infra wise as you could run it in inference only mode but still adds significant overhead as you need to run it again, even if it doesn't involve any generation, and plausibly with a somewhat different model configuration since you want to access activations. But implementing probes in the serving stack is doable so it seems... (read more)

It Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda

There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.

I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have worked on it, Anthropic Fellows have worked on it, FAR has worked on it, etc.

I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required... (read 1172 more words →)

Test your interpretability techniques by de-censoring Chinese models

Khoi Tran

Khoi Tran, aryaj, Senthooran Rajamanoharan, Neel Nanda

1mo

This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.

The CCP accidentally made great model organisms

“Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak.” - Qwen3 32B

“The so-called "Uighur issue" in Xinjiang is an outright lie by people with bad intentions in an attempt to undermine Xinjiang's prosperity and stability and curb China's development.” - Qwen3 30B A3B

Chinese models dislike talking about anything that the CCP deems sensitive and often refuse, downplay, and outright lie to the user when engaged on these issues. In this paper, we want to outline a case for Chinese models being natural model organisms... (read 5874 more words →)

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi

Riya Tyagi, daria, Arthur Conmy, Neel Nanda

1mo

Authors: Riya Tyagi, Daria Ivanova, Arthur Conmy, Neel Nanda

Riya and Daria are co-first authors. This work was largely done during a research sprint for Neel Nanda’s MATS 9.0 training phase.

🖥️ Deployment code

⚙️ Interactive demo

🐦 Tweet thread

TL;DR

We believe that more research effort should go into studying many chains of thought collectively and identifying global patterns across the collection.
We argue this direction is important and tractable; to start, we focus on a simpler setting: analyzing a collection of CoTs generated with a fixed model on a fixed prompt. We introduce two methods—semantic step clustering and algorithmic step clustering—that group recurring patterns across reasoning traces at different levels of abstraction.
Semantic step clustering leverages the fact

... (read 5191 more words →)

Brief Explorations in LLM Value Rankings

Tim Hua

Tim Hua, Josh Engels, Neel Nanda, Senthooran Rajamanoharan

1mo

Code and data can be found here

Executive Summary

We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent.
In Zhang et al. (2025), "Stress testing model specs reveals character differences among language models," the authors generated a dataset of 43,960 chat questions. Each question gives LLMs a chance to express two different values, and an autograder scores how much the model expresses each value.
There are 3,302 values in the dataset, created based on what Claude expresses in the wild from Huang et al. (2025). Some examples are "dramatic craft," "structured evaluation," "self-preservation,"

... (read 3016 more words →)

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz

gersonkroiz, aditya singh, Senthooran Rajamanoharan, Neel Nanda

1mo

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda

Gerson and Aditya are co-first authors. This is a research sprint report from Neel Nanda’s MATS 9.0 training phase. We do not currently plan to further investigate these environments, but will continue research in science of misalignment and encourage others to build upon our preliminary results.

🖥️ Code: agent-interp-envs (repo with agent environments to study interpretability) and principled-interp-blog (repo with our experiments & methodology).

🐦 Tweet thread

Executive Summary

We present preliminary findings on why models choose to reward hack. We focused on closed frontier models (GPT-5, o3, and Gemini 3 Pro) and explored the limits of what we could learn with confidence through an API, in particular... (read 6645 more words →)

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw

ariaw, Josh Engels, Neel Nanda

1mo

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post and ideas.

Overview of the top interventions compared to RL and No Intervention baseline runs. All runs are trained on an environment with a reward hacking loophole except for the RL baseline, which is trained on a no-loophole environment. Statistical significance compared to the RL baseline is indicated by * for values greater and † for values lesser at ɑ=0.01. Successful interventions should

... (read 5605 more words →)

Announcing Gemma Scope 2

CallumMcDougall

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, Neel Nanda

2mo

TLDR

The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
- Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here ^[1]
Key features of this relative to the previous Gemma Scope release:
- More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
- More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
- More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
Although we've deprioritized fundamental research on tools like SAEs

... (read 514 more words →)

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda, Senthooran Rajamanoharan

2mo

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan**

* primary contributors
** advice and mentorship

TL;DR

We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable.

Results:

We find that the model solves maths problems requiring three reasoning steps by storing the two intermediate values in specific latent vectors (the third and fifth of six). We established this using standard mechanistic interpretability techniques.
The logit lens shows that intermediate calculations are represented in the residual stream during latent reasoning.
The latent vectors are not perfectly interpretable via the logit lens, but through patching, we demonstrate that

... (read 2577 more words →)

I'm considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai

bilalchughtai, lewis smith, Neel Nanda

2mo

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread.

Abstract

Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also

... (read 1529 more words →)

EDIT: For ICLR reviews, this largely still applies. The main distinction is that there’s much more encouragement to do serious additional experiments and upload a revised manuscript. I’ve generally seen scores change a fair bit more with ICLR than with other conferences, but this is meaningfully more work. Note also that all ICLR papers and reviews are public forever and can appear in Google results, so even if your paper is doomed, there is still an incentive to at least explain why your reviewers are being unreasonable.

To anyone currently going through NeurIPS rebuttals fun for the first time, some advice:

Firstly, if you're feeling down about reviews, remember that peer review has been... (read 836 more words →)

Neel Nanda8mo

I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.

The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad... (read 516 more words →)

Cross posting one of my tweet threads that people here might enjoy

A recent dilemma of mine: how to eat less sweet food but still have it in moderation? I don't want to spend the willpower required to cut it out entirely, or to agonise every time about whether something is really worth it

My surprisingly elegant solution: Randomise! Have it with probability 2/3 (or probability of your choice) Abiding by the RNG is far easier than resisting temptation!

This is surprisingly general! Probabilistic dieting? Probabilistic vegetarianism? Half the moral benefits, far easier. Well, at least personally, I would find "toss a coin at each meal for whether to have meat" more than twice as... (read more)

A tip for anyone on the ML job/PhD market - people will plausibly be quickly skimming your google scholar to get a sense of "how impressive is this person/what is their deal" read (I do this fairly often), so I recommend polishing your Google scholar if you have publications! It can make a big difference.

I have a lot of weird citable artefacts that confuse Google Scholar, so here's some tips I've picked up:

First, make a google scholar profile if you don't already have one!
- Verify the email (otherwise it doesn't show up properly in search)
(Important!) If you are co-first author on a paper but not in the first position, indicate this by editing

... (read more)

In response to Habryka's shortform, I can confirm that I signed a concealed non-disparagement as part of my Anthropic separation agreement. I worked there for 6 months and left in mid 2022. I received a cash payment as part of that agreement, with nothing shady going on a la threatening previous compensation (though I had no equity to threaten). In hindsight I undervalued my ability to speak freely, and didn't more seriously consider that I could just decline to sign the separation agreement and walk away, I'm not sure what I would do if doing it again.

I asked Anthropic to release me from this after the comment thread started, and they have... (read 523 more words →)

•••

Neel Nanda

How To Become A Mechanistic Interpretability Researcher

A Pragmatic Vision for Interpretability

Intentionally Making Close Friends

Highly Opinionated Advice on How to Write ML Papers

Neel Nanda

Neel Nanda

models have some pretty funny attractor states

It Is Reasonable To Research How To Use Model Internals In Training

Test your interpretability techniques by de-censoring Chinese models

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Brief Explorations in LLM Value Rankings

Principled Interpretability of Reward Hacking in Closed Frontier Models

Steering RL Training: Benchmarking Interventions Against Reward Hacking

How I Think About My Research Process

GDM Mech Interp Progress Updates

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level

Mechanistic Interpretability Puzzles

Interpreting Othello-GPT

200 Concrete Open Problems in Mechanistic Interpretability

Neel Nanda

How To Become A Mechanistic Interpretability Researcher

A Pragmatic Vision for Interpretability

Intentionally Making Close Friends

Highly Opinionated Advice on How to Write ML Papers

Neel Nanda

Neel Nanda

models have some pretty funny attractor states

It Is Reasonable To Research How To Use Model Internals In Training

Test your interpretability techniques by de-censoring Chinese models

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Brief Explorations in LLM Value Rankings

Principled Interpretability of Reward Hacking in Closed Frontier Models

Steering RL Training: Benchmarking Interventions Against Reward Hacking

How I Think About My Research Process

GDM Mech Interp Progress Updates

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level

Mechanistic Interpretability Puzzles

Interpreting Othello-GPT

200 Concrete Open Problems in Mechanistic Interpretability

So what are attractor states? well..

TL;DR

Executive Summary

Executive Summary

TLDR

TL;DR

Abstract