LHT says (emphasis mine):

dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations

Note that in LHT the subnetworks are trained, unlike in the results I cited.

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

jacob_drori

TLDR: Recently, Gao et al trained transformers with sparse weights, and introduced a pruning algorithm to extract circuits that explain performance on narrow tasks. I replicate their main results and present evidence suggesting that these circuits are unfaithful to the model’s “true computations”.

This work was done as part of the Anthropic Fellows Program under the mentorship of Nick Turner and Jeff Wu.

Introduction

Recently, Gao et al (2025) proposed an exciting approach to training models that are interpretable by design. They train transformers where only a small fraction of their weights are nonzero, and find that pruning these sparse models on narrow tasks yields interpretable circuits. Their key claim is that these weight-sparse models... (read 2171 more words →)

125

Replying toIn (highly contingent!) defense of interpretability-in-the-loop ML training

jacob_drori4d

In (highly contingent!) defense of interpretability-in-the-loop ML training

Not really. What properties does an "interpretability" system have that are not shared any old system in the brain?

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

jacob_drori5d

It Is Reasonable To Research How To Use Model Internals In Training

My guess is doing it with RL is worse because I expect it to generalize to more probes

Why?

Replying toIn (highly contingent!) defense of interpretability-in-the-loop ML training

jacob_drori6d

In (highly contingent!) defense of interpretability-in-the-loop ML training

I got lost at the second diagram. Why is the process of computing expected reward for a given thought/plan called "interpretability"? In what way is it analogous to mech interp? I'd find it clarifying to see a sketch of an interpretability-in-the-loop ML training setup with the same rough arrows as in that diagram.

Replying toCross-Layer Transcoders are incentivized to learn Unfaithful Circuits

jacob_drori9d

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

I won't argue about definitions of faithfulness - yours seems as fine as any - but I think I only care about unfaithfulness in as far as it causes our circuits to make incorrect predictions about OOD behavior.

Replying toCross-Layer Transcoders are incentivized to learn Unfaithful Circuits

jacob_drori10d

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

For the toy model, I'd like it if you explicitly showed that there are inputs on which the CLT replacement model gives very different outputs to the original model. You could then check whether the PLT replacement model performs better than the CLT one on these inputs. This result should be simple to obtain: just choose a nonbinary input, i.e. an that is not in ${- 1, 1}^{4}$ . But it would make the toy model more compelling to me, since, to me, faithfulness is really about OOD generalization.

Does the USG have access to smarter models than the labs'?

jacob_drori

2mo

People in the AI space, myself included, seem to typically believe that the smartest models that exist today were trained by scaling labs.

But I have met non-AI people who believe that the US government likely has access to secret models more powerful than the labs'.

What is the best evidence either way? If (as I expect) the current evidence points towards today's smartest models having been trained by labs, what evidence should I look out for in the future that would indicate that the situation had changed?

Replying toIntrospection via localization

jacob_drori2mo

Introspection via localization

do you extract the concept vector at the same layer you steer at?

Replying toAnnouncing Gemma Scope 2

jacob_drori2mo

Announcing Gemma Scope 2

an additional frequency penalty which prevented the formation of high-frequency features

Please could you write the equation for this loss term? I couldn't find it in the paper.

Replying to[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori3mo

[Paper] Output Supervision Can Obfuscate the CoT

Is the "CoT Monitor Detection Rate" the fraction of CoTs in which the fully expanded derivative (i.e. the "bad behavior") is present?

Yes.

If so, doesn't this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn't occur at all?

Well noticed. Yeah, I admit this is confusing. Under the natural interpretation of "bad behavior" as meaning "writing the expanded derivative anywhere", it's true that the bad behavior was successfully removed. Therefore we interpret the... (read more)

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori

jacob_drori, lukemarks, cloud, TurnTrout

3mo

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated^[1] CoTs! The obfuscation happens in two ways:

When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe.
Since later tokens are conditioned on earlier ones, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced.

We introduce two mitigations and they work reasonably well. Compared to regular training, our mitigations improve monitorability with comparable or better task performance in two of our three settings. Overall, our work suggests that preserving CoT monitorability is harder than previously thought.

Read our paper on Arxiv. Produced as part of... (read 1224 more words →)

Replying toSteering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

jacob_drori4mo

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Thanks for the very clear reply. I like your decomposition of CAFT into (1) ablating activations and (2) ablating gradients.

To isolate effect (1), we can test a variant of CAFT where we detach the projection before we subtract it from the activations, such that the gradients through the subspace are not ablated.
...
We did this experiment on the two models used in the paper: for one of them, this version performs slightly worse but similarly to CAFT and, for the other one, it doesn’t reduce misalignment at all and it’s similar to regular finetuning.

I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the... (read more)

LLMs linearly represent the accuracy of their next-token predictions

A quick experiment I did on Llama-3.2-1B:

Choose a layer, and let ${\to x}_{t}$ be that layer's residual activation at token position $t$ .
Let $l_{t}^{(correct)}$ be the logit at position $t$ that was assigned to whatever turned out to be the correct next token (i.e. the correct token at position $t + 1$ ).
Use least-squares to fit a linear map that takes ${\to x}_{t}$ and predicts $l_{t - 1}^{(correct)}$ .

Results for each layer are below. Layers in the second half of the model (except the last layer) have decent ( $R^{2} > 0.6$ ) linear representations of the correct logit.

I don't find this result very surprising. I checked it because I thought it might explain what's going on in this very cool recent paper by @Dmitrii Krasheninnikov ,... (read more)

Operationalizing the definition of a shard

Pope and Turner (2022) define a shard as follows:

A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events.

To operationalize their definition, we must decide exactly what we mean by contextually activated computations, and by similar reinforcement events. One could philosophize here, but I'll just pick somewhat-arbitrary definitions and run with them, for the sake of quickly producing something I could turn into code.

Following Apollo/Goodfire, I will identify contextually activated computations with certain directions in parameter space. These directions might be found using APD, L3D, SPD, or some future method.

I am not sure when to consider two reinforcement events (i.e. RL reward-assignments)... (read more)

jacob_drori's Shortform

jacob_drori

6mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Some issues with the ICL Superposition paper

Setup:

Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:

France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->

A model will complete this prompt sometimes with Rome, sometimes with I, and sometimes with Italian, learning a "superposition" of the country -> capital, country -> first-letter and country -> language tasks. (I wish they hadn't used this word: the mech interp notion of superposition is unrelated).

Let $D_{i}$ be the proportion of the in-context examples that correspond to task $i$ , and let $P_{i}$ be the probability that the model completes the prompt according to task $i$ . Call a model calibrated if $P_{i} \approx D_{i}$ : the... (read 411 more words →)

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks

lukemarks, jacob_drori, cloud, TurnTrout

7mo

Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on.

TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient to suppress that property in the chain of thought also, even when that property is relevant to task completion. For example, when we penalize a model for mentioning in its output that it completed a task via a certain form of cheating, its reasoning also omits this fact. We therefore suspect that training against an output monitor is sufficient to cause obfuscated CoTs... (read 1556 more words →)

200

SAE on activation differences

Santiago Aranguri

Santiago Aranguri, jacob_drori, Neel Nanda

7mo

TLDR: we find that SAEs trained on the difference in activations between a base model and its instruct finetune are a valuable tool for understanding what changed during finetuning.

This work is the result of Jacob and Santiago's 2-week research sprint as part of Neel Nanda's training phase for MATS 8.0

Introduction

Given the overwhelming number of capabilities of current LLMs, we need a way to understand what functionalities are added when we make a new training checkpoint of a model. This is especially relevant when deploying a new model since among many new and useful features there may be hidden an unexpected harmful or undesired behavior. ^[1]

Model diffing aims at finding these differences between models.... (read 1299 more words →)

Sparsely-connected Cross-layer Transcoders

jacob_drori

8mo

TLDR: I develop a method to sparsify the internal computations of a language model. My approach is to train cross-layer transcoders that are sparsely-connected: each latent depends on only a few upstream latents. Preliminary results are moderately encouraging: reconstruction error decreases with number of connections, and both latents and their connections often appear interpretable. However, both practical and conceptual challenges remain.

This work is in an early stage. If you're interested in collaborating, please reach out to jacobcd52@g***l.com.

0. Introduction

A promising line of mech interp research studies feature circuits^[1]. The goal is to (1) identify representations of interpretable features in a model's latent space, and then (2) determine how earlier-layer representations combine to generate later... (read 3499 more words →)

There is a globe in your LLM

jacob_drori

Gurnee & Tegmark (2023) trained linear probes to take an LLM's internal activation on a landmark's name (e.g. "The London Eye"), and predict the landmark's longitude and latitude. The results look like this:^[1]

Two angles of true world atlas, with predicted atlas hovering above. True locations are red points; predicted locations are blue, in a slightly raised plane, linked to the corresponding true location by a grey line.

So LLMs (or at least, Llama 2, which they used for this experiment) contain a pretty good linear representation of an atlas.

Sometimes, like when thinking about distances, a globe is more useful than an atlas. Do models use the globe representation? To find out, we can... (read more)

Domain-specific SAEs

jacob_drori

TLDR: Current SAE training doesn't specifically target features we care about, e.g. safety-relevant ones. In this post, we compare three ways use to SAEs to efficiently extract features relevant to a domain of interest.

Introduction

If Sparse Autoencoders (SAEs) are to be useful for alignment, they should reliably extract safety-relevant features. But currently, our methods for training SAEs are not targeted towards finding such features. Instead, we train SAEs on unstructured web text, then check if any of the learnt features happen to be safety-relevant (see e.g. here). By making our SAEs wider, we hope to find more and more - even all - of a model's features, thereby guaranteeing that the ones we... (read 1369 more words →)

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien

kh4dien, SrGonao, jacob_drori, Nora Belrose

Background

Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features.

Key Findings

Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet.
Explanations found by LLMs are similar to explanations found by humans.
Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k with Claude.
Code can be found at https://github.com/EleutherAI/sae-auto-interp.
We built a small dashboard to explore explanations and their scores: https://cadentj.github.io/demo/

Generating Explanations

Sparse... (read 3740 more words →)

LESSWRONG
LW

LESSWRONG
LW

jacob_drori

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

There is a globe in your LLM

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori

jacob_drori

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

Does the USG have access to smarter models than the labs'?

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori's Shortform

[Research Note] Optimizing The Final Output Can Obfuscate CoT

SAE on activation differences

Sparsely-connected Cross-layer Transcoders

jacob_drori

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

There is a globe in your LLM

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori

jacob_drori

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

Does the USG have access to smarter models than the labs'?

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori's Shortform

[Research Note] Optimizing The Final Output Can Obfuscate CoT

SAE on activation differences

Sparsely-connected Cross-layer Transcoders

Introduction

LLMs linearly represent the accuracy of their next-token predictions

Operationalizing the definition of a shard

Introduction

0. Introduction

Introduction

Background

Key Findings

Generating Explanations