Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Update (20th December) - these exercises have been edited to fix some previous issues with them, new material has been added. In particular, the "superposition in a privileged basis" section has been rewritten, the neuron resampling methods in the SAEs section are now significantly improved, and there is also additional material on superposition & deep double descent.

This is a linkpost for some exercises in sparse autoencoders, which I've recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.

Links to Colabs (updated): Exercises, Solutions.

If you don't like working in Colabs, then you can clone the repo, download the exercises & solutions Colabs as notebooks, and run them in the same directory.

Below is a brief summary of all 7 sets of exercises (you can scroll to the end if you're mainly interested in sparse autoencoders!).

 

Guide to Exercises

The sets of exercises are roughly split into three larger sets. Exercises 1-3 are the "core TMS" exercises, which present some of the key ideas behind superposition. Exercises 4-5 are extensions of the TMS work, which are interesting but less essential. Exercises 6-7 contain the material on SAEs.

The exercises are labelled with their prerequisites. (n*) means exercise set n is essential for these exercises, and (n) means exercise set n is heavily recommended (but not essential).

Abbreviations: TMS = Toy Models of Superposition, SAE = Sparse Autoencoders.

  1. TMS: Superposition in a Nonprivileged Basis
  2. TMS: Correlated / Anticorrelated Features (1*)
  3. TMS: Superposition in a Privileged Basis (1*)
  4. Feature Geometry (1*)
  5. Superposition & Deep Double Descent (1*, 2)
  6. SAEs in Toy Models (1*, 2, 3)
  7. SAEs in Language Models (1*, 2, 3, 6*)

Below is a longer explanation of each of the seven exercise sets.

 

TMS: Superposition in a Nonprivileged Basis

This section introduces Anthropic's toy model for superposition, where a simple neural network is trained to map a set of features into a lower-dimensional space then reconstruct it. You'll learn about how superposition works & see how it can be visualised, as well as how properties like feature sparsity affect the learned solutions.

TMS: Correlated / Anticorrelated Features

In this section, you'll keep exploring the idea of superposition by seeing how the model's learned solutions change when features are correlated or anticorrelated. Most features learned by real models are anticorrelated simply as a consequence of the fact that any given model input (e.g. images or passages of text) will only contain a limited number of features.

TMS: Superposition in a Privileged Basis

Next, the toy model setup is changed so that it has a privileged basis. If the previous sections were models of superposition in the residual stream, this section models superposition in the MLP layer. We'll also explore how computation can be performed in superposition.

Feature Geometry

Here, we take a deeper dive into the ways features can organize into different geometric structures, when we increase the hidden dimension past the point when we can easily visualise it. 

Deep Double Descent & Superposition

This section is based on a different Anthropic paper, where they explore the idea that double descent happens as a result of models transitioning from a memorizing solution (representing datapoints in superposition) to a generalizing solution (representing features in superposition). Unlike the other 6 sections here, it's very open-ended, containing just one exercise, which is to replicate the paper - but this does come with a lot of guidance.

SAEs in Toy Models

We take the toy models from Anthropic's Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition:

Animation of the training process for SAEs in Anthropic's toy model of superposition. The red neurons represent resamplings. All instances eventually converge to accurately representing all five of the original model's features.

SAEs in Language Models

And there are exercises on interpreting an SAE trained on a transformer, where you can discover some cool learned features (e.g. a neuron exhibiting skip trigam-like behaviour, which activates on left-brackets following Django-related sytax, and predicts the completion (' -> django).

You can either read through the Solutions colab (which has all output displayed & explained), or go through the Exercises colab and fill in the functions according to the specifications you are given, looking at the Solutions when you're stuck. Both colabs come with test functions you can run to verify your solution works.


Please reach out to me if you have any questions or suggestions about these exercises (either by email at cal.s.mcdougall@gmail.com, or a LessWrong private message / comment on this post). Happy coding!

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 4:26 PM

The neuron resampling part of the animation is so cool!

Thanks (-:

Nice code!

In your SAE tutorials, the importance is just torch.ones or similar. I'm curious how importance might work for real models?

Is there any work where people derived it from backdrop or anything? I can't find any examples.

Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it's usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!

As for what importance represents, it's basically a proxy for "how much a certain feature reduces loss, when it actually is present." This can be independent from feature probability. Anthropic included it in their toy models paper in order to make those models truer to reality, in the hope that the setup could tell us more interesting lessons about actual models. From the TMS paper:

Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.

If we're talking features in language models, then importance would be "average amount that this feature reduces cross entropy loss". I open-sourced an SAE visualiser which you can find here. You can navigate through it and look at the effect of features on loss. It doesn't actually show the "overall importance" of a feature, but you should be able to get an idea of the kinds of situations where a feature is super loss-reducing and when it isn't. Example of a highly loss-reducing feature: feature #8, which fires on Django syntax and strongly predicts the "django" token. This seems highly loss-reducing because (although sparse) it's very often correct when it fires with high magnitude. On the other hand, feature #7 seems less loss-reducing, because a lot of the time it's pushing for something incorrect (maybe there exist other features which balance it out).

Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from.

If it's the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.

Yep, definitely! If you're using MSE loss then it's got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you're interested, I think Redwood's paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature

Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.

I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.

Oh that's very interesting, Thank you.

[+][comment deleted]3mo10