lewis smith

The ‘strong’ feature hypothesis could be wrong

NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. > “It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout” > - Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition.It's been conventionally understood that there are two key theories underlying this agenda. The first is the ‘linear representation hypothesis’ (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the ‘features of the input’ in the opening quote) as linear directions in it’s representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these ‘atoms’ than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as ‘networks represent features of the input as directions in representation space’. H

234Aug 2, 2024

A Pragmatic Vision for Interpretability

131Dec 1, 2025

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

116Mar 26, 2025

[Full Post] Progress Update #1 from the GDM Mech Interp Team

80Apr 19, 2024

lewis smith

Message

927

150

[Paper] Difficulties with Evaluating a Deception Detector for AIs

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...

Dec 3, 202530

How Can Interpretability Researchers Help AGI Go Well?

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202566

A Pragmatic Vision for Interpretability

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...

Dec 1, 2025131

Towards data-centric interpretability with sparse autoencoders

Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project. Check out our updated paper here: https://arxiv.org/abs/2512.10092. TL;DR * We use sparse autoencoders (SAEs) for four textual data analysis tasks—data diffing, finding correlations, targeted clustering, and retrieval. * We care especially about gaining insights...

Aug 15, 202553

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which...

Mar 26, 2025116

A Problem to Solve Before Building a Deception Detector

TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions....

Feb 7, 202578

lewis smith's Shortform

Aug 30, 202412

Load More (7/12)

LESSWRONG
LW

LESSWRONG
LW

lewis smith

lewis smith

lewis smith

The ‘strong’ feature hypothesis could be wrong

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

lewis smith

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Towards data-centric interpretability with sparse autoencoders

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A Problem to Solve Before Building a Deception Detector

lewis smith's Shortform

The ‘strong’ feature hypothesis could be wrong

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Towards data-centric interpretability with sparse autoencoders

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A Problem to Solve Before Building a Deception Detector

lewis smith's Shortform