Replying toThe ‘strong’ feature hypothesis could be wrong

The ‘strong’ feature hypothesis could be wrong

I think that the general argument I made in this post was correct, and anticipated a shift away from the strong feature hypothesis mode of thinking on SAEs. It’s hard to say the degree to which this was downstream of me publishing this post(probably mostly not? though it may have had some influence) but it probably deserves some Bayes points nevertheless. I think that many of the arguments in this post continue to influence my thinking, and there is a lot in the post that remains valuable.

In fact, if anything I think I should have been a bit more confident; the strong feature hypothesis is wrong would have been a better title.... (read 676 more words →)

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai

bilalchughtai, lewis smith, Neel Nanda

2mo

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread.

Abstract

Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also

... (read 1529 more words →)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

2mo

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post ^[1] , and are excited for more in the field to embrace pragmatism! In brief, we think that:
- It is crucial to have empirical feedback on your ultimate goal with good proxy tasks ^[2] .
- We do not need near-complete understanding to have significant impact.
- We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting
But that’s pretty abstract. So how

... (read 3981 more words →)

A Pragmatic Vision for Interpretability

Neel Nanda

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

2mo

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well ^[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech

... (read 7963 more words →)

131

•••

Replying toThe ‘strong’ feature hypothesis could be wrong

lewis smith4mo

The ‘strong’ feature hypothesis could be wrong

Aren't the MLPs in a transformer straightforward examples of this?

Thats certainly the most straightforward interpretation! I think a lot of the ideas I'm talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt' expliciable in this way would sort of function like noise, rather than playing a functional role.

I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of... (read more)

Replying toThe ‘strong’ feature hypothesis could be wrong

lewis smith4mo

The ‘strong’ feature hypothesis could be wrong

in the context of the SFH; feature means what I called 'atom' in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the 'feature hypothesis' by using a vaguer definition of 'feature' (which is a common move)

Replying toThe ‘strong’ feature hypothesis could be wrong

lewis smith4mo

The ‘strong’ feature hypothesis could be wrong

I think that they are distinguishable. For instance, if you can find an example of a structure which doesn't fit the 'feature' model but clearly serves some algorithmic function, that would seem to be strong counter-evidence? For example this paper https://arxiv.org/abs/2405.14860 demonstrates that at least the one-dimensional feature model is not complete. There might be some way to express that in 'strong feature hypothesis' form by adding a lot of epicycles, but I think that sort of thing would be evidence against the idea of independent 1-dimensional linear features. The strong feature hypothesis does have the virtue of being strong; therefore it's quite vulnerable to counter evidence! The main thing that makes this a bit more confusing is that I think exactly what the 'feature' hypothesis was was often left fairly vague; disproving a vague hypothesis is quite difficult.

Replying toThe ‘strong’ feature hypothesis could be wrong

lewis smith6mo

The ‘strong’ feature hypothesis could be wrong

The weak LRH I would say is now well supported by considerable empirical evidence.

a couple of people have shrug reacted to this sentence. . I think the theory I had in mind as being 'well supported by empirical evidence' was something like 'you can often find examples of networks representing stuff with a linear direction pretty uncontroversially'. I think this is probably still a fair statement, though it's a bit vague, and you could argue that it's a bit of a leap from this (which is like >0 things are represented as linear directions) to how I phrased the weak LRH in the post.

while I think this post has mostly aged quite well, I think, looking back, I was hedging to try to avoid writing a post entitled 'here's why SAEs are doomed'.

Towards data-centric interpretability with sparse autoencoders

Nick Jiang

Nick Jiang, lilysun004, lewis smith, Neel Nanda

6mo

Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project.

Check out our updated paper here: https://arxiv.org/abs/2512.10092.

TL;DR

We use sparse autoencoders (SAEs) for four textual data analysis
tasks—data diffing, finding correlations, targeted clustering, and retrieval.
We care especially about gaining insights from language model data, such as LLM outputs and training data, as we believe it is an underexplored route for model understanding.
For instance, we find that Grok 4 is more careful than other frontier models to state its assumptions and explore nuanced interpretations—showing the kinds of insights data diffing can reveal by comparing model outputs.
Why SAEs? Think of features as "tags" of properties for each text.
- Their large dictionary of

... (read 5157 more words →)

lewis smith7mo

Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it's reasonable to probe whether attributing these mental states makes sense, and we shouldn't just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!

Replying toA Problem to Solve Before Building a Deception Detector

lewis smith9mo

A Problem to Solve Before Building a Deception Detector

although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than 'bottom up' (lying requires believing things, which is an intentional state again.)

Replying toA Problem to Solve Before Building a Deception Detector

lewis smith9mo

A Problem to Solve Before Building a Deception Detector

i wouldn't read too much into the title (it's partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it's algorithmic representation.

re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on 'simple correspondence will just hold for deception'.

Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a 'circuit level' or... (read more)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda

11mo

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

* = equal contribution

The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders (SAEs) were useful for downstream tasks, notably out-of-distribution probing.

TL;DR

To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in

... (read 8585 more words →)

116

Replying toA Problem to Solve Before Building a Deception Detector

lewis smith1y

A Problem to Solve Before Building a Deception Detector

I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.

and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don't rhyme with 'I sure hope everything turns out to be a simple correspondence'! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it

Replying toA Problem to Solve Before Building a Deception Detector

lewis smith1y

A Problem to Solve Before Building a Deception Detector

I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:

i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It's also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks

Eleni Angelou

Eleni Angelou, lewis smith

TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions.

1. Introduction

A commonly discussed AI risk scenario is strategic deception: systems that execute sophisticated planning against their creators to achieve undesired ends. In particular, this is insidious because a system that is capable of strategic planning and also situationally aware might be able to systematically behave differently when under observation, and thus evaluation methods that are purely behavioral could become unreliable. One widely hypothesized potential solution to this... (read 3977 more words →)

‘Feature’ is overloaded terminology

In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things:

Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text.
A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32.
The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to

... (read more)

lewis smith's Shortform

lewis smith

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

The ‘strong’ feature hypothesis could be wrong

lewis smith

NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position.

“It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout”

- Elhage et. al, Toy Models of Superposition

Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower... (read 5035 more words →)

234

•••

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the... (read 2340 more words →)

LESSWRONG
LW

LESSWRONG
LW

lewis smith

The ‘strong’ feature hypothesis could be wrong

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

lewis smith

lewis smith

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Towards data-centric interpretability with sparse autoencoders

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A Problem to Solve Before Building a Deception Detector

lewis smith's Shortform

lewis smith

The ‘strong’ feature hypothesis could be wrong

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

lewis smith

lewis smith

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Towards data-centric interpretability with sparse autoencoders

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A Problem to Solve Before Building a Deception Detector

lewis smith's Shortform

Abstract

Executive Summary

Executive Summary

TL;DR

TL;DR

1. Introduction

Activation Steering with SAEs