Sparse autoencoders find composed features in small toy models

14th Mar 2024

2Demian Till

1Evan Anders

2Ali Shehper

2Evan Anders

2Ali Shehper

1Evan Anders

2Demian Till

1Logan Riggs

3Evan Anders

3Logan Riggs

New Comment

10 comments, sorted by Click to highlight new comments since: Today at 10:26 PM

Nice work! I was actually planning on doing something along these lines and still have some things I'd like to try.

Interestingly your SAEs appear to be generally failing to even find optimal solutions w.r.t the training objective. For example in your first experiment with perfectly correlated features I think the optimal solution in terms of reconstruction loss and L1 loss combined (regardless of the choice of the L1 loss weighting) would have the learnt feature directions (decoder weights) pointing perfectly diagonally. It looks like very few of your hyperparameter combinations even came close to this solution.

My post was concerned primarily with the training objective being misaligned with what we really want, but here we're seeing an additional problem of SAEs struggling to even optimise for the training objective. I'm wondering though if this might be largely/entirely a result of the extremely low dimensionality and therefore very few parameters causing them to get stuck in local minima. I'm interested to see what happens with more dimensions and more variation in terms of true feature frequency, true feature correlations, and dictionary size. And orthogonality loss may have more impact in some of those cases.

Hi Demian! Sorry for the really slow response.

Yes! I agree that I was surprised that the decoder weights weren't pointing diagonally in the case where feature occurrences were perfectly correlated. I'm not sure I really grok why this is the case. The models do learn a feature basis that can describe any of the (four) data points that can be passed into the model, but it doesn't seem optimal either for L1 or MSE.

And -- yeah, I think this is an extremely pathological case. Preliminary results look like larger dictionaries finding larger sets of features do a better job of not getting stuck in these weird local minima, and the possible number of interesting experiments here (varying frequency, varying SAE size, varying which things are correlated) is making for a pretty large exploration space.

Hey guys, great post and great work!

I have a comment, though. For concreteness, let me focus on the case of (x_2, y_1) composition of features. This corresponds to feature vectors of the form A[0, 1, 1, 0] in the case of correlated feature amplitudes and [0, a, b, 0] in the case of uncorrelated feature amplitudes. Note that the plane spanned by x_2 and y_1 admits an infinite family of orthogonal bases; one of which, for example, is [0, 1, 1, 0] and [0, 1, -1, 0]. When we train a Toy Model of Superposition, we plot the projection of our choice of feature basis as done by Anthropic and also by you guys. However, the training dataset for the SAE (that you trained afterward) contains *no information* about the original (arbitrarily chosen by us) basis. SAEs could learn to decompose vectors from the dataset in terms of *any* of the infinite family of bases.

This is exactly what some of your SAEs seem to be doing. They are still learning four antipodal directions (which are just not the same as the four antipodal directions corresponding to your original chosen basis). This, to me, seems like a success of the SAE.

We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.

Thanks for the comment! Just to check that I understand what you're saying here:

We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.

Basically -- you're saying that in the hidden plane of the model, data points are just scattered throughout the area of the unit circle (in the uncorrelated case) and in the case of one set of features they're just scattered within one quadrant of the unit circle, right? And those are the things that are being fed into the SAE as input, so from that perspective perhaps it makes sense that the uncorrelated case learns the 45 angle vectors, because that's the mean of all of the input training data to the SAE. Neat, hadn't thought about it in those terms.

This, to me, seems like a success of the SAE.

I can understand this lens! I guess I'm considering this a failure mode because I'm assuming that what we want SAEs to do is to reconstruct the known underlying features, since we (the interp community) are trying to use them to find the "true" underlying features in e.g., natural language. I'll have to think on this a bit more. To your point -- maybe they can't learn about the original basis choice, and I think that would maybe be bad?

Hi Evan, thank you for the explanation, and sorry for the late reply.

I think that the inability to learn the original basis is tied to the properties of the SAE training dataset (and won't be solved by supplementing SAEs with additional terms in its loss function). I think it's because we could have generated the same dataset with a different choice of basis (though I haven't tried formalizing the argument nor run any experiments).

I also want to say that perhaps not being able to learn the original basis is not so bad after all. As long as we can represent the full number of orthogonal feature directions (4 in your example), we are okay. (Though this is a point I need to think more about in the case of large language models.)

If I understood Demian Till's post right, his examples involved some of the features not being learned at all. In your example, it would be equivalent to saying that an SAE could learn only 3 feature directions and not the 4th. But your SAE could learn all four directions.

Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.

- Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
- I
*think*in the context of language models, learning a different basis is a problem. I assume that, there, things aren't so clean as "you can get back the original features by adding 1/2 of that and 1/2 of this". I'd imagine it's more like feature1 = "*the*in context A", feature 2 = "*the*in context B", feature 3 = "*the*in context C". And if*the*is a real feature (I'm not sure it is), then I don't know how to back out the real basis from those three features. But I think this points to just needing to carry out more work on this, especially in experiments with more (and more complex) features! - Yes, good point, I think that Demian's post was worried about some features not being learned at all, while here all features were learned -- even if they were rotated -- so that is promising!

Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we'd need to experiment with more variations as mentioned in my other comment

Hey! Thanks for doing this research.

Lee Sharkey et al did a similar experiment a while back w/ much larger number of features & dimensions, & there were hyperaparameters that perfectly reconstructed the original dataset (this was as you predicted as N increases).

Hoagy still hosts a version of our replication here (though I haven't looked at that code in a year!).

Hi Logan! Thanks for pointing me towards that post -- I've been meaning to get around to reading it in detail and just finally did. Glad to see that the large-N limit seems to get perfect reconstruction for at least one similar toy experiment! And thanks for sharing the replication code.

I'm particularly keen to learn a bit more about the correlated features -- did you (or do you know of anyone) who has studied toy models where they have a few features that are REALLY correlated with one another, and that basically never appear with other features? I'm wondering if such features could bring back the problem that we saw here, even in a very high-dimensional model / dataset. Most of the metrics in that post are averaged over all features, so don't really differentiate between correlated or not, etc.

Agreed. You would need to change the correlation code to hardcode feature correlations, then you can zoom in on those two features when doing the max cosine sim.

## Summary

Context:Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an ℓ1 penalty on the SAE hidden layer activations.Problem & Hypothesis:While the SAE ℓ1 penalty achieves sparsity,it has been arguedthat it can also cause SAEs to learn commonly-composedfeatures rather than the “true” features in the underlying data.Experiment:We propose a modified setup ofAnthropic’s ReLU Output Toy Modelwhere data vectors are made up of sets of composed features. We study the simplest possible version of this toy model with two hidden dimensions for ease of comparison to many of Anthropic’s visualizations.anticorrelated, and features are stored in antipodal pairs. Perhaps it’s a bit surprising that features are stored in superposition at all, becausethe features in the very small models we studied here are not sparse (they occur every-other data draw so have S ~ 0.5 in the language of Anthropic’s Toy Models paper)^{[1]}.Result:SAEs trained on the activations of these small toy models find composed features rather than the true features, regardless of learning rate or ℓ1 coefficient used in SAE training.Future work:We see these models as a simple testing ground for proposed SAE training modifications. We share our code in the hopes that we can figure out, as a community, how to train SAEs that aren’t susceptible to this failure mode.The diagram below gives a quick overview of what we studied and learned in this post:

## Introduction

Last year,

AnthropicandEleutherAI/Lee Sharkey's MATS streamshowed that sparse autoencoders (SAEs) find human-interpretable “features” in language model activations. They achieve this interpretability by having sparse activations in the SAE hidden layer, such that only a small number of SAE features are active for any given token in the input data. While the objective of SAEs is, schematically, to “reconstruct model activations perfectly and do so while only having a fewtruefeatures active on any given token,” the loss function used to train SAEs is a combination of mean squared error reconstruction of model activations and an ℓ1 penalty on the SAE hidden layer activations. This ℓ1 term may introduce unintended “bugs” or failure modes into the learned features.Recently,

Demian Tillquestioned whether SAEs find “true” features. That post argued that the ℓ1 penalty could push autoencoders to learncommon combinations of features, because having two common true features which occur together shoved into one SAE feature would achieve a lower value of the ℓ1 term in the loss than two independent “true” features which fire together.This is a compelling argument, and if we want to use SAEs to find true features in natural language, we need to understand when this failure mode occurs and whether we can avoid it. Without any knowledge of what the

truefeatures are in language models, it’s hard to evaluate how robust of a pitfall this is for SAEs, and it’s also hard to test if proposed solutions to this problem actually work at recovering true features (rather than just a different set of not-quite-right ones). In this post, we turn to toy models, where the true features are known, to determine:whendoes it happen and how can we fix it?In this blog post, we’ll focus on question #1 in an extremely simple toy model (

Anthropic’s ReLU output model with 2 hidden dimensions) to argue that, yes, SAEs definitely learn composed (rather than true) features in a simple, controlled setting. We release the code that we use to create the models and plots in the hope that we as a community can use these toy models to test out different approaches to fixing this problem, and we hope to write future blog posts that help answer question #2 above (see Future Work section).The synthetic data that we use in our toy model is inspired by

this post by Chris Olahabout feature composition. In that post, two categories of features are considered: shapes and colors. The set of shapes is {circle, triangle, square} and the set of colors is {white, red, green, blue, black}. Each data vector is some (color, shape) pair like (green, circle) or (red, triangle). We imagine that these kinds of composed features occur frequently in natural datasets. For example, we know that vision models learn to detect bothcurvesandfrequency(among many other things), but you could imagine curved shapes with regular patterns (see: google search for ‘round gingham tablecloth’). We want to understand what models and SAEs do with this kind of data.## Experiment Details

## ReLU Output Toy Models

We study

Anthropic’s ReLU output model:h=Wx,

x′=ReLU(WTh+b)=ReLU(WTWx+b),

Here the model weights W∈RM×N and bias b∈RN are learned. The model inputs x are generated according to a procedure we lay out below in the “Synthetic Data Vectors with Composed Features” section, and the goal of the model is to reconstruct the inputs. We train these toy models using the

AdamW optimizerwith learning rate 10−3, weight decay 10−2, β1=0.9, and β2=0.999. Training occurs over 104 batches where each batch contains Nb=103 data vectors. The optimizer minimizes the mean squared error loss:L=1NNb∑x||x−x′||22.

## Sparse Autoencoders (SAEs)

We train sparse autoencoders to reconstruct the hidden layer activations h of the toy models. The architecture of the SAEs is:

f=ReLU(Weh+be),

^h=Wdf+bd,

Where the encoding weights We∈RF×M and bias be∈RF and decoding weights Wd∈RM×F and bias bd∈RM are learned.

Sparse autoencoders (SAEs) are

difficult to train. The goals of training SAEs are to:To achieve these ends, SAEs are trained on the mean squared error of reconstruction of model activations (a proxy for goal 1) and are trained to minimize the \ell_1 norm of SAE activations (a proxy for goal 2).

We follow advice from Anthropic’s

JanuaryandFebruaryupdates in informing our training procedure.In this work, we train SAEs using the

Adam optimizerwith β1=0 and β2=0.999 and with learning rates lr∈{3×10−5,10−4,3×10−4,10−3,3×10−3}. We minimize the mean of the fractional variance explained (FVE) and the ℓ1 norm of the SAE hidden layer feature activations, so our loss function isL=1Nb∑h(||h−^h||22||h||22+λ||f||1).

The goal of minimizing the FVE instead of a standard squared error is to ensure our SAE is agnostic to the size of the hidden layer of the model it is reconstructing (so that a terrible reconstruction ^h=0 always scores 1 regardless of dimensionality)

^{[2]}. We vary the ℓ1 damping coefficient λ∈{0.01,0.03,0.1,0.3,1}. The SAEs are trained over 1.28×108 total data samples in batches sizes of 1024 for a total of 125,000 batches. The learning rate linearly warms up from 0 over the first 10% of training and linearly cools down to 0 over the last 20% of training. At each training step, the columns of the decoder matrix are all normalized to 1; this keeps the model from "cheating'' on the ℓ1 penalty (otherwise the model would create large outputs using small activations with large decoder weights).## Synthetic Data Vectors with Composed Features

A primary goal of studying a toy model is to learn something universal about larger, more complex models in a controlled setting. It is therefore critical to reproduce the key properties of natural language that we are interested in studying in the synthetic data used to train our model.

The training data used in natural language

has the following properties:sparse).compositionin natural language datasets.alsobeing in a specific grammatical context (e.g., inside a set of parentheses or quotation marks).token in contextfeatures are an example of composed features. For example, it’s possible that the word “the” is a feature, and the context “mathematical text” is a feature, and “the word ‘the’ in the context of mathematical text” is a composition of these features.In this post, we will focus on data vectors that satisfy #1 and #4 above and we hope to satisfy #2 and #3 in future work. To create synthetic data, we largely follow prior work [

Jermyn+2022,Elhage+2022] and generate input vectors x∈RN, where each dimensionxi is a “feature'' in the data. We consider a general form of data vectors composed of m sub-vectors x=[xs1xs2,⋯,xsm], where those sub-vectors represent independent feature sets, and where each subvector has exactly one non-zero element so that xsi≠0; dimensionally, xsi∈RNsi with ∑mi=1Nsi=N.In this blog post, we study the simplest possible case: two sets (m=2) each of two features (N=4,Nsi=2) so that data vectors take the form x=[x1,x2,y1,y2]. Since these features occur in composed pairs, in addition to there being four true underlying features {x1,x2,y1,y2} there are also four possible feature configurations that the models can learn: [x1,0,y1,0],[x1,0,0,y2],[0,x2,y1,0], and [0,x2,0,y2]. For this case, a 2-dimensional probability table exists for each composed feature pair giving the probability of occurrence of each composed feature set p(xi,yj) where xi∈xs1 and yi∈xs2. We consider uniformly distributed, uncorrelated features, so that the probability of any set of features being present is uniform and is (Ns1Ns2)−1, so the simple probability table for our small model is:

The correlation between a feature pair (xi′,yj′) can be raised by increasing p(xi′,yj′) while lowering the probability of xi′ appearing alongside yj∀j≠j′ and the probability of yj′ appearing alongside xi∀i≠i′ (and properly normalizing the rest of the probability table). This is interesting and we want to do this in future work, but in this specific post we’ll mostly just focus on the simple probability table above.

To generate synthetic data vectors $x$, we randomly sample a composed pair (xi,yj) from the probability table. We draw the magnitudes of these features from uniform distributions, xi∼U(0,1) and yj∼U(0,1). We can optionally correlate the amplitudes of these features using a correlation coefficient C∈[0,1] by setting yj←Cxi+(1−C)yj. Note that by definition, all features in xs1 are anticorrelated since they never co-occur, and the same is true of all features in xs2. In this post, we study two cases:

## Including One-hot Vectors

In the experiments outlined above, all data vectors are two-hot, containing a nonzero value in some xi and a nonzero value in some yi. One could argue that, for that data, regardless of C, the natural basis of the data is actually composed pairs and the underlying “true” features are less relevant.

We will therefore consider a case where there is some probability 0<p(one−hot)<1 that a given data vector only contains one xi

orone yi – but not both. We looked at p(one−hot)∈{0.5,0.75}, but in this blog post we will only display results from the p(one−hot)=0.75 case. To generate the probability table for these data, the table from above is scaled by (1−p(one−hot)), then an additional row and column are added showing that each feature is equally likely to be present in a one-hot vector (and those equal probabilities must sum up to p(one−hot)). An example probability table for p(one−hot)=0.75 is:## Results

## Correlated Feature Amplitudes

We begin with a case where the amplitudes of the features are perfectly correlated C=1 such that the four possible data vectors are A[1,0,1,0], A[1,0,0,1], A[0,1,1,0], and A[0,1,0,1] with A∼U(0,1).

Yes, this is contrived.The data vectors here are always perfect composed pairs. In some ways we should expect SAEs to find those composed pairs, because those are probably a more natural basis for the data than the "true" features we know about.As mentioned above, we study the case where the ReLU output model has two hidden dimensions, so that we can visualize the learned features by visualizing the columns of the learned weight matrix W in the same manner as Anthropic’s work (e.g.,

here). An example of a model after training is shown in the left panel of this figure:The features in the left panel are labeled by their xi and yi, and all features are rotated for visualization purposes so that the x features are on the x-axis. We find the same antipodal feature storage as

Anthropic observed for anticorrelated features-- and this makes sense! Recall that in our data setup, x1 and x2 are definitionally anticorrelated, and so too are y1 and y2.Something that is surprising is that the model chooses to store these features in superposition at all! These data vectors are not sparse.^{[1]}Each feature occurs in every other data vector on average.For a single set of uncorrelated features, models only store features in superposition when the features are sparse. Here, the model takes advantage of the nature of the composed sets and uses superposition despite a lack of sparsity.We train five realizations of SAEs on the hidden layer activations of this toy model with a learning rate of 3×10−4 and ℓ1 regularization coefficient λ=0.3. Of these SAEs, the one which achieves the lowest loss (reconstruction + ℓ1) is plotted in the large middle panel in the figure above (black arrows, overlaid on the model’s feature representations). This SAE’s features are labeled according to their hidden dimension in the SAE, so here e.g., f1 is a composed feature of x2 and y1 like A[0,1,1,0]. The other four higher-loss realizations are plotted in the four rightmost sub-panels. We find a strong preference for off-axis features – which is to say,

the SAE learns composed pairs. Each of the five realizations we study (middle and right panels) have this flaw, with only one realization finding even a single true underlying feature (upper right panel).Can this effect, where the model learns composed pairs of features, be avoided simply through choosing better standard hyperparameters (learning rate and λ)? Probably not:

We scanned two orders of magnitude in both learning rate and λ. We plot the base model, the SAE which achieves the lowest loss out of five realizations (black vectors), and the SAE which achieves the highest

monosemanticityout of five realizations according toEqn. 7 in(grey vectors). Only one set of hyperparameters achieves a mostly monosemantic realization: that at λ=0.01 and with a moderate lr of 3×10−4. Perhaps this makes sense -- a large ℓ1 penalty would push the model towards learning composed features so that fewer features are active per data draw. However, we see that this realization is not perfectly monosemantic, so perhaps λ is too low to even enforce sparsity in the first place.Engineering Monosemanticity## Uncorrelated Feature Amplitudes

We next consider the case where the feature amplitudes within a given data vector are completely uncorrelated, with C=0, so that xi∼U(0,1) and yi∼U(0,1). Whereas in the previous problem, only four (arbitrarily scaled) data vectors could exist, now an infinite number of possible data vectors can be generated, but there still only exist two features in each set and therefore four total composed pairs.

We perform the same experiments as in the previous section, and replicate the same figures from the previous section below. Surprisingly, We find that the model

more cleanlyfinds composed pairs than in the case where the input data vectors were pure composed pairs. By breaking the feature amplitude correlation, SAEs almost uniformly learn perfect composed pairs for all parameters studied. We note briefly that, in the grid below,some SAEs find monosemantic features at high learning rate and lowλ (see the light grey arrows in the bottom left panels), but even when these monosemantic realizations are achieved, other realizations of the autoencoder find lower loss, polysemantic realizations with composed pairs.## Does a Cosine Similarity Loss Term Fix This Problem?

In

, a possible solution to this problem is proposed:Do sparse autoencoders find "true features"?We tried this, and for our small model it doesn’t help.

We calculated the cosine similarity between each column of the decoder weight matrix, Wdec, and stored those cosine similarity values in the square matrix S∈RF×F, where F is the hidden dimension size of the SAE. S is symmetric, so we only need to consider the lower triangular part (denoted tril(S)). We tried adding two variations of an S-based term to the loss function:

Neither formulation improved the ability of our autoencoders to find monosemantic features.

Just because this additional loss term did not help this small toy context does not mean that it couldn’t help find more monosemantic features in other models! We find that it doesn’t fix this very specific case, but more tests are needed.

## What if the SAE Actually Gets to See the True Features?

In the experiments I discussed above, every data vector is two-hot, and an xi and yi always co-occur. What if we allow data vectors to be one-hot (only containing one of xi OR yi) with some probability p(one−hot)? We sample composed data vectors with probability 1−p(one−hot). We tried this for p(one−hot)={0.5,0.75} and while SAEs are

morelikely to find the true features, it’s still not a sure thing – even when compositions occur only 25% of the time and feature amplitudes are completely uncorrelated in magnitude!Below we repeat our toy model and SAE plots for the case where p(one−hot)=0.75. Certainly

moreSAEs find true features in the lowest-loss instance whereas with p(one−hot)=0, none did. But there’s no robust trend in learning rate and λ.## Takeaways

composed, occurring in combination with other features. We can model this with synthetic data vectors and toy models.## Future work

This post only scratched the surface of the exploration work that we want to do with these toy models. Below are some experiments and ideas that we’re excited to explore:

Zipf’s law?We may not have time to get around to working on all of these questions, but we hope to work on some of them. If you’re interested in pursuing these ideas with us, we’d be happy to collaborate!

## Code

The code used to produce the analysis and plots from this post is available online in

https://github.com/evanhanders/superposition-geometry-toys. See in particularhttps://github.com/evanhanders/superposition-geometry-toys/blob/main/experiment_2_hid_dim.ipynb.## Acknowledgments

We’re grateful to Esben Kran, Adam Jermyn, and Joseph Bloom for useful comments which improved the quality of this post. We’re grateful to Callum McDougall and the ARENA curriculum for providing guidance in setting up and training SAEs in toy models and to Joseph Bloom for his

https://github.com/jbloomAus/mats_sae_trainingrepository which helped us set up our SAE class. We thank Adam Jermyn and Joseph Bloom for useful discussions while working through this project. EA Thanks Neel Nanda for a really useful conversation that led him to this idea at EAG in February.Funding: EA and JH are KITP Postdoctoral Fellows, so this research was supported in part by NSF grants PHY-2309135 and PHY-1748958 to the Kavli Institute for Theoretical Physics (KITP) and by the Gordon and Betty Moore Foundation through Grant No. GBMF7392.

## Citing this post

^{^}But note that here I’m defining sparsity as occurrence frequency. Probably there’s a truer notion of sparsity and in that notion these data are probably sparse.

^{^}Though note that this is slightly different from

Anthropic’s suggestion in the February update, where they chose to normalize their vectors so that each data point in the activations has a variance of 1. I think if you use themeansquared error compared to the squared error, this becomes equivalent to what I did here, but I’m not 100% sure.