Replying toTraining Matching Pursuit SAEs on LLMs

chanind2mo

Very true! Each iteration of matching pursuit uses as much compute as the entire encode() of a standard SAE, so it's not only a parallelism problem (although it doesn't help either). I'll update the wording in the post.

Training Matching Pursuit SAEs on LLMs

chanind

2mo

This work was done as part of MATS 7.1

We recently added support for training and running Matching Pursuit SAEs (MP-SAEs) to SAELens, so I figured this is a good opportunity to train and open source some MP-SAEs, and share what I've learned along the way. Matching pursuit SAEs are exciting because they use a fundamentally different method to encode activations compared with traditional SAEs, and is a direct implementation of the classic matching pursuit algorithm from dictionary learning. The matching pursuit encoder is highly nonlinear, and should thus be more expressive than a traditional SAE encoder.

In this post, we'll discuss what MP-SAEs are, and some tips for training them successfully. We train... (read 1955 more words →)

Anthropic's JumpReLU training method is really good

chanind

chanind, Adrià Garriga-alonso

4mo

This work was done as part of MATS 7.1.

TLDR; If you've given up on training JumpReLU SAEs, try out Anthropic's JumpReLU training method. It's now supported in SAELens!

Back in January, Anthropic published some updates on how they train JumpReLU SAEs. The post didn't include any sample code or benchmarks or theoretical justification for the changes, so it seems like the community basically shrugged and ignored it. After all, we already have the original GDM implementation in the Dictionary Learning and SAELens libraries, and most practitioners don't use JumpReLU SAEs anyway, since BatchTopK SAEs are so much easier to train and are also considered state-of-the-art.

Why has JumpReLU not been popular?

The biggest issue I've... (read 475 more words →)

The "Sparsity vs Reconstruction Tradeoff" Illusion

chanind

chanind, Adrià Garriga-alonso

6mo

This work was done as part of MATS 7.1. For more details on the ideas presented here, check out our new workshop paper Sparse but Wrong: Incorrect L0 Leads to Incorrect
Features in Sparse Autoencoders.

Nearly all work on Sparse Autoencoders (SAEs) includes a version of the classic "sparsity vs reconstruction tradeoff" plot, showing how changing the L0 (the sparsity) of the SAE changes the reconstruction of the SAE, measured in variance explained or mean squared error.

Sparsity vs reconstruction tradeoff curves from "Scaling and evaluating sparse autoencoders", Gao et al. 2024

This implies the following two things:

The better the reconstruction the better the SAE. If one SAE reconstructs inputs better than another SAE at a

... (read 1103 more words →)

Replying toL0 is not a neutral hyperparameter

chanind7mo

L0 is not a neutral hyperparameter

What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section of Templeton et al.)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail

I view SAE width and SAE L0 as two separate parameters we should try to get right if we can. In toy models, similar failure modes to what we see with low L0 SAEs also happen if the SAE is narrower than the number of true features, in that the SAE... (read more)

L0 is not a neutral hyperparameter

chanind

chanind, Adrià Garriga-alonso

7mo

When we train Sparse Autoencoders (SAEs), the sparsity of the SAE, called L0 (the number of latents that fire on average), is treated as an arbitrary design choice. All SAE architectures include plots of L0 vs reconstruction, as if any choice of L0 is equally valid.

However, recent work that goes beyond just calculating sparsity vs reconstruction curves shows the same trend: low L0 SAEs learn the wrong features ^[1]^[2].

In this post, we investigate this phenomenon in a toy model with correlated features and show the following:

If the L0 of the SAE is lower than the true L0 of the underlying features, the SAE will "cheat" to get a better MSE loss score than

... (read 1397 more words →)

Replying toNegative Results on Group SAEs

chanind9mo

Negative Results on Group SAEs

Thank you for writing this up! I experimented briefly with group sparsity as well, but with the goal of learning the "hierarchy" of features rather than to learn circular features like you're doing here. I also struggled to get it to work in toy settings, but didn't try extensively and ended up moving on to other things. I still think there must be something in group sparsity, since it's so well studied in sparse coding and clearly does work in theory.

I also struggled with the problem of how to choose groups, since for traditional group sparsity you need to set the groups before-hand. I like your idea of trying to learn the... (read more)

Sparsity is the enemy of feature extraction (ft. absorption)

7vik

7vik, chanind, Adrià Garriga-alonso

9mo

Sparse Autoencoders (and other related feature extraction tools) often optimize for sparsity to extract human-interpretable latent representations from a model's activation space. We show analytically that sparsity naturally leads to feature absorption in a simplified untied SAE, and discuss how this makes SAEs less trustworthy to use for AI safety with some ongoing efforts to fix this. This might be obvious to people working in the field - but we ended up writing a proof sketch so we're putting it out here. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort.

The dataset (a distribution with feature hierarchy)

In this proof, we consider a dataset $D$ with points sampled to... (read 1737 more words →)

chanind10mo

This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it's not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the 'base units' in this scheme, and 'animals' are just commonly co-occurring sets of attributes. This is the same as the "red triangle" problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true... (read more)

chanind10mo

No, the animal vectors are all fully spanned by the fifty attribute features.

Is this just saying that there's superposition noise, so everything is spanning everything else? If so that doesn't seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn't get too massive.

The animal features are sparse. The attribute features are not sparse.

If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don't see how you can say there's a unique "label" direction for each animal that's separate from the the underlying attributes such that you can... (read more)

chanind10mo

If I understand correctly, it sounds like you're saying there is a "label" direction for each animal that's separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components.

I'm confused why a dictionary that consists of a feature direction for each attribute and each animal label can't explain these activations? These activations are just a (sparse) sum of these respective features, which are an... (read more)

chanind10mo

It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a "furry elephant" or a "tiny elephant" would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry.

I feel like in this scheme, it's not really the case that there's 1000 animal directions, since the base unit is the attributes, and there's no way to express an animal separately from its attributes. For there to be a true "elephant" direction, then it should be possible to have... (read more)

A Bunch of Matryoshka SAEs

chanind

chanind, TomasD, Adrià Garriga-alonso

10mo

This work was done as part of MATS 7.0.

MATS provides a generous compute stipend, and towards the end of the program we found we had some unspent compute. To let this not go to waste, we trained batch topk Matryoshka SAEs on all residual stream layers of Gemma-2-2b, Gemma-2-9b, and Gemma-3-1b, and are now releasing them publicly. The hyperparams for these SAEs were not aggressively optimized, but they should hopefully be decent. Below we describe our rationale for how these SAEs were trained and why, and the stats for each SAE. Key decisions:

We use more narrow inner widths than in the original Matryoshka SAEs work, and increase each width by a larger

... (read 2145 more words →)

Feature Hedging: Another way correlated features break SAEs

chanind

chanind, TomasD, Adrià Garriga-alonso

11mo

This work was done as part of MATS 7.0. We consider this in-progress research and we are grateful for any thoughts and feedback from the community.

Update (May 20, 2025): This is now a paper! Check out our paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders" (arxiv.org/abs/2505.11756).

Update (April 4, 2025): We found that hedging caused by positive (but not hierarchical) correlation between features can sometimes be removed with a high enough sparsity penalty, but this is never true for hedging caused by hierarchical features or negatively correlated features. We have updated the post accordingly.

Introduction

If there is any correlation between a feature captured by an SAE and a feature not captured by that... (read 5202 more words →)

Replying toBroken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind1y

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

The behavior you see in your study is fascinating as well! I wonder if using a tied SAE would force these relationships in your work to be even more obvious, since if the SAE decoder in a tied SAE tries to mix co-occurring parent/child features together it has to also mix them in the encoder and thus it should show up in the activation patterns more clearly. If an underlying feature co-occurs between two latents (e.g. a parent feature), tied SAEs don't have a good way to keep the latents themselves from firing together and thus showing up as a co-firing latent. Untied SAEs can more easily do an absorptiony thing and... (read more)

Replying toMatryoshka Sparse Autoencoders

chanind1y

Matryoshka Sparse Autoencoders

Yeah I think that's right, the problem is that the SAE sees 3 very non-orthogonal inputs, and settles on something sort of between them (but skewed towards the parent). I don't know how to get the SAE to exactly learn the parent only in these scenarios - I think if we can solve that then we should be in pretty good shape.

This is all sketchy though. It doesn't feel like we have a good answer to the question "How exactly do we want the SAEs to behave in various scenarios?"

I do think the goal should be to get the SAE to learn the true underlying features, at least in these toy settings where we know what the true features are. If the SAEs we're training can't handle simple toy examples without superposition I don't have a lot of faith that when we're training SAEs on real LLM activations that the results are trustworthy.

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind

chanind, Demian Till

Thanks to Jean Kaddour, Tomáš Dulka, and Joseph Bloom for providing feedback on earlier drafts of this post.

In a previous post on Toy Models of Feature Absorption, we showed that tied SAEs seem to solve feature absorption. However, when we tried to training some tied SAEs on Gemma 2 2b, these still appeared to suffer from absorption effects (or something similar). In this post, we explore how this is possible by extending our investigation to toy settings where the SAE has more or fewer latents than true features. We hope this will build intuition for how SAEs work and what sorts of failure modes they have. Some key takeaways:

Tied SAEs fail to

... (read 4362 more words →)

Replying toMatryoshka Sparse Autoencoders

chanind1y

Matryoshka Sparse Autoencoders

It might also be an artifact of using MSE loss. Maybe a different loss term for reconstruction loss might not have this problem?

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks, Neel Nanda

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

TL;DR

We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and downstream tasks. Use our codebase to evaluate your own SAEs!
You can compare 200+ SAEs of varying sparsity, dictionary size, architecture, and training time on Neuronpedia.
Think we're missing an eval? We'd love for you to contribute it to our codebase! Email us.

🔍 Explore the Benchmark & Rankings

📊 Evaluate your SAEs with SAEBench

✉️ Contact Us

Introduction

Sparse Autoencoders (SAEs) have become one of the most popular tools for AI... (read 381 more words →)

Toy Models of Feature Absorption in SAEs

chanind

chanind, hrdkbhatnagar, TomasD, Joseph Bloom

TLDR;

In previous work, we found a problematic form of feature splitting called "feature absorption" when analyzing Gemma Scope SAEs. We hypothesized that this was due to SAEs struggling to separate co-occurrence between features, but we did not prove this. In this post, we set up toy models where we can explicitly control feature representations and co-occurrence rates and show the following:

Feature absorption happens when features co-occur.
If co-occurring feature magnitudes vary relative to each other, we observe "partial absorption", where a latent tracking a main feature sometimes fires weakly instead of not firing at all, but sometimes does fully not fire.
Feature absorption happens even with imperfect co-occurrence, depending on the strength of the

... (read 2713 more words →)

LESSWRONG
LW

LESSWRONG
LW

chanind

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

Anthropic's JumpReLU training method is really good

chanind

Training Matching Pursuit SAEs on LLMs

Anthropic's JumpReLU training method is really good

The "Sparsity vs Reconstruction Tradeoff" Illusion

L0 is not a neutral hyperparameter

Sparsity is the enemy of feature extraction (ft. absorption)

A Bunch of Matryoshka SAEs

Feature Hedging: Another way correlated features break SAEs

chanind

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Toy Models of Feature Absorption in SAEs

Anthropic's JumpReLU training method is really good

chanind

Training Matching Pursuit SAEs on LLMs

Anthropic's JumpReLU training method is really good

The "Sparsity vs Reconstruction Tradeoff" Illusion

L0 is not a neutral hyperparameter

Sparsity is the enemy of feature extraction (ft. absorption)

A Bunch of Matryoshka SAEs

Feature Hedging: Another way correlated features break SAEs

Why has JumpReLU not been popular?

The dataset (a distribution with feature hierarchy)

Introduction

TL;DR

Introduction

TLDR;