A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

14Neel Nanda

7Zac Hatfield-Dodds

6Neel Nanda

12Liv Gorton

11leogao

3Dan Braun

3leogao

1Dan Braun

3Burny

3Mateusz Bagiński

2Lee Sharkey

2Logan Riggs

1Bary Levy

3Lucius Bushnaq

1Review Bot

1Jason Gross

2Lucius Bushnaq

1Jason Gross

New Comment

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

Might be worth putting a short notice at the top of each post saying that, with a link to this post or whatever other resource you'd now recommend? (inspired by the 'Attention - this is a historical document' on e.g. this PEP)

Fair point, I've been procrastinating on putting out an updated version (and don't have anything else I back enough to want to recommend in it's place - I haven't read this post closely enough yet), but adding that note to the top seems reasonable

This is a great post! Thank you for writing this up :)

On training SAEs on ConvNets - I recently trained SAEs for all layers of InceptionV1. I've written up a __paper__ on some of the findings of early vision, with a specific focus on curve detectors (__twitter thread__ on the paper and __another__ on some branch specialisation related findings). The features look really good across the entire model, including finding interpretable, monosemantic features in the final layer which, to the best of my knowledge, hasn't been done before, which is really exciting! I'm hoping to put out a blog post focusing on on the final layer in the next couple of weeks (including circuit analysis between the last few layers).

To be able to say we fully understand any real neural network is such a huge step forward for the field and it seems like with SAEs we are well-positioned to actually achieve this goal now.

Some takes on some of these research questions:

Looking for opposing feature directions in SAEs

I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.

SAE/Transcoder activation shuffling

I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.

How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?

It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.

Thanks Leo, very helpful!

The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.

The SAEs in your paper were trained with batch size of 131,072 tokens according to appendix A.4. Section 2.1 also says you use a context length of 64 tokens. I'd be very surprised if using 131,072/64 blocks of consecutive tokens was much less efficient than 131,072 tokens randomly sampled from a very large dataset. I also wouldn't be surprised if 131,072/2048 blocks of consecutive tokens (i.e. a full context length) had similar efficiency.

Were your preliminary experiments and intuition based on batch sizes this large or were you looking at smaller models?

I missed that appendix C.1 plot showing the dead latent drop with tied init. Nice!

I'm 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072/64 is substantially less efficient than 131072.

We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.

When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.

Thanks for prediction. Perhaps I'm underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn't care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you'd get a bigger efficiency hit if you didn't shuffle when you could have (assuming fixed batch size).

Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.

@Lucius Bushnaq , why would MLPs compute linear transformations?

Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?

Cross layer superposition

Had a bit of time to think about this. Ultimately because superposition as we know it is a property of the latent space rather than the neurons in the layer, it's not clear to me that this is the question to be asking. How do you imagine an experimental result would look like?

Toy example of what I would consider pretty clear-cut cross-layer superposition:

We have a residual MLP network. The network implements a single UAND gate (universal AND, calculating the pairwise ANDs of sparse boolean input features using only neurons), as described in Section 3 here.

However, instead of implementing this with a single MLP, the network does this using all the MLPs of all the layers in combination. Simple construction that achieves this:

- Cut the residual stream into two subspaces, reserving one subspace for the input features and one subspace for the output features.
- Take the construction from the paper, and assign each neuron in it to a random MLP layer in the residual network.
- Since the input and output spaces are orthogonal, there's no possibility of one MLP's outputs interfering with another MLP's inputs. So this network will implement UAND, as if all the neurons lived in a single large MLP layer.

Now we've made a network that computes boolean circuits in superposition, without the boolean gates living in any particular MLP. To read out the value of one of the circuit outputs before it shows up in the residual stream, you'll need to look at a direction that's a linear combination of neurons in all of the MLPs. And if you use an SAE to look at a single residual stream position in this network before the very final MLP layer, it'll probably show you a bunch of half-computed nonsense.

In a real network, the most convincing evidence to me would be a circuit involving sparse coded variables or operations that cannot be localized to any single MLP.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations

- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty
corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting! What prior does log(1+|a|) correspond to? And what about using instead of ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

A prior that doesn't assume independence should give you a sparsity penalty that isn't a sum of independent penalties for each activation.

[Nix] Toy model of feature splitting

- There are at least two explanations for feature splitting I find plausible:

- Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
- There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.

These do not sound like different explanations to me. In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).

As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width , I expect each feature to split into roughly features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.

Edit: I expect this toy model will also permit exploring:

[Lee] Is there structure in feature splitting?

- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)

- 'topology in a math context'
- 'topology in a physics context'
- 'high dimensions in a math context'
- 'high dimensions in a physics context'
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?

I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

Why we made this list:^{[1]}. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.We therefore thought it would be helpful to share our list of project ideas!

Comments and caveats:We hope some people find this list helpful!

We would love to see people working on these!If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.## Foundational work on sparse dictionary learning for interpretability

## Transcoder-related project ideas

[2406.11944] Transcoders Find Interpretable LLM Feature Circuits)[Nix] Training and releasing high quality transcoders.[Nix] Good tooling for using transcodersDunefsky et al)[Nix] Further circuit analysis using transcoders.nix@apolloresearch.ai][Nix, Lee] Cross layer superposition[Lucius] Improving transcoder architectures## Other

[Nix] Idea for improved-logit lens style interpretation of SAE featuresJoseph Bloom’s GPT2 SAEs)Understanding SAE Features with the Logit Lens) but in the pre-unembed basis instead of the token basis.Interpreting the Second-Order Effects of Neurons in CLIP[Nix] Toy model of feature splitting[Dan] Looking for opposing feature directions in SAEs[Dan] SAE/Transcoder activation shufflinge2eSAEs, which do not shuffle activations during training as they need to pass through the entire context-length activations to subsequent layers. Can you get away with just having a larger effective batch size and higher learning rate? Note that I think this is equally (if not more) important to analyze for transcoders.[Dan] SAE/Transcoder initializationhereandhere) help for SAEs and transcoders?[Dan] Make public benchmarks for SAEs and transcoders.Neuronpediawhich I deem to be a great place to host such a service.[Lee] Mixture of Expert SAEshere. This is great! The more efficient we can make SDL the better. But this only speeds up inference of the decoder. I think MOEs may be a way to speed up inference of the encoder.MOEifiedpost hoc. If they can be, then it's evidence in the direction that MOEs might be reasonable to use during training from scratch.[Lee] Identify canonical features that emerge in language models[Lee] Studying generalization of SAEs and transcoders.[Lee] How does layer norm affect SAE features before and after?[Lee] Connecting SAE/transcoder features to polytopespolytope lenswas that it used clustering methods in order to group polytopes together. This means the components of the explanations they provided were not ‘composable’. We want to be able to break down polytopes into components that are composable.[Stefan] Verify (SAE) features based on the model weights; show that features are a model-property and not (only) a dataset property.here(e.g. “given two sets of directions that reconstruct an activation, can you tell which one are the features vs a made-up set of directions?”), one possible methodology describedhere (in-progress LASR project).[Stefan] Relationship between Feature Splitting, Feature Completeness, and atomic vs. composite featureshere.[Lee] Is there structure in feature splitting?[Lucius] Understanding the geometry of SAE featurespaper.[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activationscorresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?[Lucius] Preprocessing activations with theinteraction basisprior to SAE trainingend-to-end. But another way to solve it might be to preprocess the network activations before applying the SAEs to them. The activations could be rotated and rescaled such that the variance of the hidden activations along any axis is proportional to its importance for computing the final network outputs. The interaction basis is a linear coordinate transformation for the hidden activations of neural networks that attempts to achieve just that. So transforming activations into the interaction basis before applying SAEs to them might yield a Pareto improvement in SAE quality.[Lucius] Using attribution sparsity penalties to improve end-to-end SAEsend-to-end dictionary learning, asparsity penalty based on attributionsmight be more appropriate than a sparsity penalty based on dictionary activations: In end-to-end SAEs, the reconstruction loss cares about the final network output, but the sparsity term still cares about the activations in the hidden layer, like a conventional SAE. This is perhaps something of a mismatch. For example, if a feature is often present in the residual stream, but comparatively rarely used in the computation, the end-to-end SAE will be disinclined to represent it, because it only decreases the reconstruction loss a little, but increases the sparsity loss by a lot. More generally, how large a feature activation is just won't be that great of a correlate for how important it is for reconstructing the output. So if we care about how many features we need per data point to get good output reconstruction, SAEs trained with an attribution sparsity penalty might beat SAEs trained with an activation sparsity penalty.an attribution sparsity penaltyuses attributions of the LLM loss. I suspect this is inappropriate, since the gradient of the LLM loss is zero at optima, meaning feature attributions will be scaled down the better the LLM does on a specific input. Something like an MSE average over attributions to all of the network’s output logits might be more appropriate. This is expensive, but an approximation of the average using stochastic sources might suffice. See e.g. Appendix Cherefor an introduction to stochastic source methods. In our experiments on the Toy Model of Superposition, a single stochastic source proved to be sufficient, making this potentially no more computationally intensive than the Anthropic proposal.## Applied interpretability

[Lee] Apply SAEs/transcoders to a small conv net (e.g. Alex Net) and study it in depth.[Lee] Figure out how to build interpretability interfaces for video/other modalities.Ellena Reid’s projectwas that it developed a way to ‘visualize’ what neurons were activating for in audio samples. Can we improve on this method? Can we do the same for video models? What about other modalities, such as, e.g.smell, or, I don’t know, protein structure? Is there a modality-general approach for this?[Lee] Apply SAEs and transcoders to WhisperV2 (i.e. continueEllena Reid’s work)[Lee] Identifying whether or not, in a very small backdoored model, we can detect the backdoor using e.g. e2eSAEs[Lee] Interpreting Mamba/SSMs using sparse dictionary learning.[Lee] Characterizing the geometry of low-level vision SAE features[Lee] Can we understand the first sequence index of a small transformer?[Lucius] Attempt to understand a toy LMcompletely[Stefan] Understand a small model (e.g. TinyStories-2L or a smallTinyModelvariant) from start to end, from first to last layer.## Intrinsic interpretability

[Lee] Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution?Sharkey 2023). Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution? This may help in identifying fundamental structures within transformers in a similar way that induction heads were discovered.[Lee] Interpretable inference: Can we convert already-trained models into forms that are much easier to completely interpret at little performance cost?[Lee] Develop A Mathematical Framework for Linear Attention Transformer Circuits## Understanding features (not SDL)

[Lucius] Recovering ‘features’ through direct optimisation for auto-interpretability scores## Theoretical foundations for interpretability

Singular-learning-theory-related[Lucius] Understanding SLT at finite data/precisionhere. Is this a good approximation?[Lucius] Bounding the local learning coefficient (LLC) in real networksexploiting degeneracy in the loss landscapetodecomposeLLMs into more interpretable parts.[Lucius] Understanding the relationship between thelocal learning coefficient(LLC) and thebehavioral LLChere. This is a more restrictive definition since different network outputs can yield the same loss. The LLC of the behavioral loss is thus an upper bound for the LLC of the training loss. The LLC of the behavioral loss is well-defined everywhere in the loss landscape, making it potentially more useful for characterizing the complexity of neural networks at every point in training. However, the behavioral LLC is currently less well understood than the LLC. For example, it is less clearly related to network generalization ability (aside from upper bounding the LLC).## Other

[Lucius] Extending the current framework forcomputation in superpositionfrom boolean variables to floating point numbers or real numbers[Lucius] Bounding the sparsity of LLM representations[Lucius] Relating superposition to the loss landscape## Meta-research and philosophy

[Lee] Write up reviews/short posts on the links between various concepts in comp neuro and mech interp and philosophy of science and mech interp[Lee] What is a feature? What terms should we really be using here? What assumptions do these concepts make? Where does it lead when we take these assumptions to their natural conclusions?[Lucius] Should we expect some or many of the ‘features’ in current neural networks to benatural latents?## Engineering

[Dan] Create a new, high quality tinystories dataset and model suite (credit to Noa Nabeshima for the idea).tinystories datasetis very formulaic, small, and has unusual unicode characters in it. Addressing these issues, and training a small model suite on this new dataset, would be very valuable for:cleaning upthe existing tinystories dataset and training a4-layer modelwithout layernorm on the clean dataset (it also comes with SAEs and transcoders trained on it). Reach out to Noa (noanabeshima@gmail.com) and/or me (dan@apolloresearch.ai) if interested in taking this on. Subsidies for compute credits for dataset generation and model training may be available.^{^}Papers from our first project here and here and from our second project here.