2mo

TLDR

The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
- Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here ^[1]
Key features of this relative to the previous Gemma Scope release:
- More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
- More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
- More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
Although we've deprioritized fundamental research on tools like SAEs

... (read 514 more words →)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

3mo

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post ^[1] , and are excited for more in the field to embrace pragmatism! In brief, we think that:
- It is crucial to have empirical feedback on your ultimate goal with good proxy tasks ^[2] .
- We do not need near-complete understanding to have significant impact.
- We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting
But that’s pretty abstract. So how

... (read 3981 more words →)

A Pragmatic Vision for Interpretability

Neel Nanda

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

3mo

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well ^[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech

... (read 7963 more words →)

131

•••

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda

11mo

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

* = equal contribution

The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders (SAEs) were useful for downstream tasks, notably out-of-distribution probing.

TL;DR

To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in

... (read 8585 more words →)

116

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda

New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan!

We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability. We train through discontinuity with straight-through estimators, which also let us directly optimise the L0.

To accompany this, we will release the weights of hundreds of JumpReLU SAEs on every layer and sublayer of Gemma 2 2B and 9B in a few weeks. Apply now for early access to the 9B ones! We're... (read 247 more words →)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the... (read 2340 more words →)

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

Introduction

This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.

Our team’s two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One... (read 815 more words →)

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Neel Nanda

Neel Nanda, János Kramár, Tom Lieberum, Rohin Shah

Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom Lieberum

Tweet thread summary, paper

Abstract:

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these

Neel Nanda

Neel Nanda, Senthooran Rajamanoharan, János Kramár, Rohin Shah

This is the fifth post in the Google DeepMind mechanistic interpretability team’s investigation into how language models recall facts. This post is a bit tangential to the main sequence, and documents some interesting observations about how, in general, early layers of models somewhat (but not fully) specialise into processing recent tokens. You don’t need to believe these results to believe our overall results about facts, but we hope they’re interesting! And likewise you don’t need to read the rest of the sequence to engage with this.

Introduction

In this sequence we’ve presented the multi-token embedding hypothesis, that a crucial mechanism behind factual recall is that on the final token of a multi-token entity there... (read 1186 more words →)

Replying toOn Defense Mechanisms

János Kramár8y

On Defense Mechanisms

Seems also like the "playing dead" behaviour. If you're under attack and aren't going to summon/indicate allies (via sadness) or enforce your boundary yourself (via anger) or appease the attacker (via submission), another option is to give up on active response and hope that if you play dead just right, they'll lose interest for some reason. Many attackers' goals are better served by a responsive opponent; and attacking someone dead is both potentially unhealthy and no fun.

Replying toConcise Open Problem in Logical Uncertainty

János Kramár10y

Concise Open Problem in Logical Uncertainty

Ah, I think I can stymy $M$ with 2 nonconstant advisors. Namely, let $A_{1} (n) = \frac{1}{2} - \frac{1}{n + 3}$ and $A_{2} (n) = \frac{1}{2} + \frac{1}{n + 3}$ . We (setting up an adversarial $E$ ) precommit to setting $E (n) = 0$ if $p (n) \geq A_{2} (n)$ and $E (n) = 1$ if $p (n) \leq A_{1} (n)$ ; now we can assume that $M$ always chooses $p (n) \in [A_{1} (n), A_{2} (n)]$ , since this is better for $M$ .

Now define $b_{i}^{'} (j) = | A_{i} (j) + E (j) - 1 | - | p (j) + E (j) - 1 |$ and $b_{i} (n) = \sum_{j < n} b_{i}^{'} (j)$ . Note that if we also define ${bad}_{i} (n) = \sum_{j < n} (log | A_{i} (j) + E (j) - 1 | - log | p (j) + E (j) - 1 |)$ then $\sum_{j < n} | 2 b_{i} (j) - {bad}_{i} (j) | \leq \sum_{j < n} (2 A_{1} (j) - 1 - log (2 A_{1} (j)))) = \sum_{j < n} O ({(\frac{1}{2} - A_{1} (j))}_{1}^{2})$ is bounded; therefore if we can force $b_{1} (n) \to \infty$ or $b_{2} (n) \to \infty$ then we win.

Let's reparametrize by writing $δ (n) = A_{2} (n) - A_{1} (n) = \frac{2}{n + 3}$ and $q (n) = \frac{p (n) - A_{1} (n)}{δ (n)}$ , so that $b_{i}^{'} (j) = δ (j) (| i - 2 + E (j) | - | q (j) - 1 + E (j) |)$ .

Now, similarly to how $M$ worked for constant advisors, let's look at the problem in rounds: let $s_{0} = 0$ , and $s_{n} = ⌊ exp (s_{n - 1} - 1) ⌋ + 1$ for $n > 0$ . When determining $E (s_{n - 1}), \dots, E (s_{n} - 1)$ , we can look at $p (s_{n - 1}), \dots, p (s_{n} - 1)$ . Let

János Kramár10y

Concise Open Problem in Logical Uncertainty

I don't yet know whether I can extend it to two nonconstant advisors, but I do know I can extend it to a countably infinite number of constant-prediction advisors. Let $(P_{i})_{i = 0, \dots}$ be an enumeration of their predictions that contains each one an infinite number of times. Then:

def M(p, E, P):
    prev, this, next = 0, 0, 1
    def bad(i):
        return sum(log(abs((E[k] + P[i] - 1) /
                           (E[k] + p[k] - 1)))

... (read 949 more words →)

Replying toConcise Open Problem in Logical Uncertainty

János Kramár10y

Concise Open Problem in Logical Uncertainty

def M(p, E):
    p1, p2 = 1./3, 2./3
    prev, this, next = 0, 0, 1

bad1 and bad2 compute log-badnesses of M relative to p1 and p2, on E[:prev]; the goal of M is to ensure neither one goes to $\infty$ . prev, this, next are set in such a way that M is permitted access to this when computing p[this:next].

    def bad(advisor):
        return lambda:
            sum(log(abs((E[i] + advisor(i) - 1) /

... (read 951 more words →)

Replying toStationary algorithmic probability

János Kramár11y

Stationary algorithmic probability

These results are still a bit unsatisfying.

The first half constructs an invariant measure which is then shown to be unsatisfactory because UTMs can rank arbitrarily high while only being good at encoding variations of themselves. This is mostly the case because the chain is transient; if it was positive recurrent then the measure would be finite, and UTMs ranking high would have to be good at encoding (and being encoded by) the average UTM rather than just a select family of UTMs.

The second half looks at whether we can get better results (ie a probability measure) by restricting our attention to output-free "UTMs" (though I misspoke; these are not actually UTMs but

János Kramár11y

A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

There is a lot more to say about the perspective that isn't relaxed to continuous random variables. In particular, the problem of finding the maximum entropy joint distribution that agrees with particular pairwise distributions is closely related to Markov Random Fields and the Ising model. (The relaxation to continuous random variables is a Gaussian Markov Random Field.) It is easily seen that this maximum entropy joint distribution must have the form $log Pr (1_{φ_{1}}, \dots, 1_{φ_{n}}) = \sum_{i < j} θ_{i j} 1_{φ_{i} \land φ_{j}} + \sum_{i} θ_{i} 1_{φ_{i}} - log Z$ where $log Z$ is the normalizing constant, or partition function. This is an appealing distribution to use, and easy to do conditioning on and to add new variables to. Computing relative entropy reduces to finding bivariate marginals and to computing $Z$ ,

... (read 402 more words →)

Replying toStationary algorithmic probability

János Kramár11y

Stationary algorithmic probability

In order to understand what the measure $μ$ that was constructed from $d$ will reward, here's the sort of machine that comes close to ${sup}_{M} μ (M) = 3$ :

Let $M_{0}$ be an arbitrary UTM. Now consider the function $r (n) = n - 2^{⌊ lg n ⌋}$ (or, really, any function $r : N^{+} \to N^{0}$ with $r (n) < n$ that visits every nonnegative integer infinitely many times), and let $L = {x \in {0, 1}^{*} : | x | > 2, x_{| x | - 1} = x_{r (| x | - 1)}, x_{| x | - 2} = x_{r (| x | - 2)}}$ . (The indices here are zero-based.) Choose $x_{0} \in L$ such that $x_{0}$ has no proper prefix in $L$ . Then, construct the UTM $M$ that does:

repeat:
    s := ""
    while s not in L:
        # if there is no next character, halt

János Kramár11y

Stationary algorithmic probability

Consider the function $a (M_{1}, M_{2}) = 2^{- d (M_{1}, M_{2}) - d (M_{2}, M_{1})}$ where $d (M_{1}, M_{2}) = min (| x | | x \in {0, 1}^{*} : \forall y \in {0, 1}^{*} : M_{1} (x y) = M_{2} (y) unless neither of these halts)$ . The reversible Markov chain with transition probabilities $p (M_{1}, M_{2}) = \frac{a (M_{1}, M_{2})}{\sum_{M_{2}^{'}} a (M_{1}, M_{2}^{'})}$ has a bounded positive invariant measure $μ (M) = \sum_{M^{'}} a (M, M^{'})$ . Of course, as the post showed, the total measure is infinite. Also, because the chain is reversible and transient, the invariant measure is far from unique - indeed, for any machine $M_{0}$ , the measure $μ (M) = p^{(0)} (M, M_{0}) + 2 \sum_{n = 1}^{\infty} p^{(n)} (M, M_{0})$ will be a bounded positive invariant measure.

It seems tempting (to me) to try to get a probability measure by modding out the output-permutations (that the post uses to show this isn't possible for the full set of UTMs). To this end, consider the set of UTMs that have no output. (These will

János Kramár11y

A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

Actually, on further thought, I think the best thing to use here is a log-bilinear distribution over the space of truth-assignments. For these, it is easy to efficiently compute exact normalizing constants, conditional distributions, marginal distributions, and KL divergences; there is no impedance mismatch. KL divergence minimization here is still a convex minimization (in the natural parametrization of the exponential family).

The only shortcoming is that 0 is not a probability, so it won't let you eg say that $Pr (φ_{1} \to φ_{2}) = 1$ ; but this can be remedied using a real or hyperreal approximation.

Replying toA tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

János Kramár11y

A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

An easy way to get rid of the probabilities-outside-[0,1] problem in the continuous relaxation is to constrain the "conditional"/updated distribution to have $Var (1_{φ_{i}} ∣ ∣ \dots) \leq E (1_{φ_{i}} ∣ ∣ \dots) (1 - E (1_{φ_{i}} ∣ ∣ \dots))$ (which is a convex constraint; it's equivalent to $Var (1_{φ_{i}} ∣ ∣ \dots) + {(E (1_{φ_{i}} ∣ ∣ \dots) - \frac{1}{2})}^{2}$ ), and then minimize KL-divergence accordingly.

The two obvious flaws are that the result of updating becomes ordering-dependent (though this may not be a problem in practice), and that the updated distribution will sometimes have $Var (1_{φ_{i}} ∣ ∣ \dots) < E (1_{φ_{i}} ∣ ∣ \dots) (1 - E (1_{φ_{i}} ∣ ∣ \dots))$ , and it's not clear how to interpret that.

LESSWRONG
LW

LESSWRONG
LW

János Kramár

János Kramár

János Kramár

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Announcing Gemma Scope 2

János Kramár

Announcing Gemma Scope 2

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Announcing Gemma Scope 2

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

[Full Post] Progress Update #1 from the GDM Mech Interp Team

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Announcing Gemma Scope 2

TLDR

Executive Summary

Executive Summary

TL;DR

Activation Steering with SAEs

Introduction

Introduction