Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

200 COP in MI: Interpreting Algorithmic Problems

3Nathan Helm-Burger

1Alexandre Variengien

This is the fourth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability.Start here, then read in any order. If you want to learn the basics before you think about open problems, check outmy post on getting started. Look up jargon inmy Mechanistic Interpretability Explainer## Motivation

Motivating paper:A Mechanistic Interpretability Analysis of GrokkingWhen models are trained on synthetic, algorithmic tasks, they often learn to do some clean, interpretable computation inside. Choosing a suitable task and trying to reverse engineer a model can be a rich area of interesting circuits to interpret! In some sense, this is interpretability on easy mode - the model is normally trained on a single task (unlike language models, which need to learn everything about language!), we know the exact ground truth about the data and optimal solution, and the models are tiny. So why care?

I consider my work on grokking to be an interesting case study of this work going well. Grokking (shown below) is a mysterious phenomena where, when small models are trained on algorithmic tasks (eg modular addition or modular division), they initially memorise the training data. But when they

keepbeing trained on that data for a really long time, the model suddenly(ish) figures out how to generalise!In my work, I simplified their setup even further, by training a 1 Layer transformer (with no LayerNorm or biases) to do modular addition and reverse engineered the weights to understand what was going on. And it turned out to be doing a funky trig-based algorithm (shown below), where the numbers are converted to frequencies with a memorised Discrete Fourier Transform, added using trig identities, and converted back to the answer! Using this, we looked inside the model and identified that despite seeming to have plateaued, in the period between memorising and "grokking", the model is actually slowly forming the circuit that does generalise. But so long as the model still has the memorising circuit, this adds too much noise to have good test loss. Grokking occurs when the generalising circuit is so strong that the model decides to "clean-up" the memorising circuit, and "uncovers" the mature generalising circuit beneath, and suddenly gets good test performance.

OK, so I just took this as an excuse to explain my paper to you. Why should you care? I think that the general lesson from this, that I'm excited to see applied elsewhere, is using toy algorithmic models to analyse a phenomena we're confused about. Concretely, given a confusing phenomena like grokking, I'd advocate the following strategy:

Simplifyto the minimal setting that exhibits the phenomena, yet is complex enough to be interestingReverse-engineerthe resulting model, in as much detail as you canExtrapolatethe insights you've learned from the reverse-engineered model - what are the broad insights you've learned? What do you expect to generalise? Can you form any automated tests to detect the circuits you've found, or any of their motifs?Verifyby looking at other examples of the phenomena and seeing whether these insights actually hold (larger models, different tasks, even just earlier checkpoints of the model or different random seeds)Grokking is an example in a science of deep learning context - trying to uncover mysteries about how models learn and behave. But this same philosophy also applies to understanding confusing phenomena in language models, and building toy algorithmic problems to study those!

Anthropic's Toy Models of Superposition is an excellent example of this done well, for the case of superposition in linear bottleneck dimensions of a model (like value vectors in attention heads, or the residual stream). Concretely, they isolated out the key traits as being where high dimensional spaces are projected to a low dimensional space, and then mapped back to a high dimensional space, with no non-linearity in between. And got extremely interesting and rich results! (More on these in the next post)

More broadly, because algorithmic tasks are often cleaner and easier to interpret, and there's a known ground truth, it can be a great place to practice interpretability! Both as a beginner to the field trying to build intuitions and learn techniques, and to refine our understanding of the

righttools and techniques. It's much easier to validate a claimed approach to validate explanations (like causal scrubbing) if you can run it on a problem with an understood ground truth!Overall, I'm less excited about algorithmic problem interpretability in general than some of the other categories of open problems, but I think it can be a great place to start and practice. I think that building a toy algorithmic model for a confusing phenomena is hard, but can be really exciting if done well!

## Resources

Demo: Reverse engineering how a small transformer can re-derive positional embeddings (colab), with an accompanying video walkthroughDemo:write-upandcolab## Tips

testthese hypotheses, and try to look for evidence for them, and evidence to falsify them. Use these hypotheses to help you prioritise and focus among all of the possible directions you can go in - accept that they'll probably be wrong in some important ways (I didnotexpect modular addition to be using Fourier Transforms!), but use them as a guide to figure outhowthey're wrong.## Problems

This spreadsheetlists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)`START 4 6 2 9 MID 2 4 6 9`

)A* 3.2 -Sorting variable-length lists.B* 3.8 -5 digit addition/subtractiongrokking happensthere, and it seems likely related to the modular addition algorithm. You might need to play around with training setups and hyperparameters if you varied the number of digits.B* 3.9 -Predicting the output to simple code functions. E.g. predicting the bold text in problems likea = 1 2 3

a[2] = 4

a ->

1 2 4B* 3.10 -Graph theory problems likethishttps://jacobbrazeal.wordpress.com/2022/09/23/gpt-3-can-find-paths-up-to-7-nodes-long-in-random-graphs/#comment-248B* 3.11 -Train a model on multiple algorithmic tasks that we understand (eg train a model to do modular additionandmodular subtraction, by learning two different outputs). Compare this to a model trained on each task.B* 3.12 -Train models forautomatatasks and interpret them - do your results match the theory?B* 3.13 -In-Context Linear Regression, as described inGarg et al- the transformer is given a sequence (x_1, y_1, x_2, y_2, …), where y_i=Ax_i+b and A and b are different for each prompt and need to be learned in-context (see code here)C* 3.14 -The other problems in the paper that are in-context learned - sparse linear functions, 2L networks & decision treesinduction heads(seemy grokking paperfor details.)Leetcode easyis probably a good source of problems.B* 3.18 -Build a toy model of Indirect Object Identification - train a tiny attention only model on an algorithmic task simulating Indirect Object Identification, and reverse-engineer the learned solution. Compare this to the circuit found in GPT-2 Small - is it the same algorithm? We often build toy models to model things in real models, it'd be interesting to validate this by going the other way`BOS John Mary John MID`

->`Mary`

. To avoid trivial solutions, make eg half the data 3 distinct names like`John Mary Peter`

which maps to a randomly selected one of those names. I'd try training a 3L attn-only model to do this.C* 3.19 -There's a bunch of follow-up questions: Is this consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add MLPs or more layers?C* 3.20 -Reverse-engineer Othello-GPT (summary here) - the paper trains a GPT-style model to predict the next move in randomly generated Othello games (ie, it predicts legal moves, but doesn't learn strategy). Can you reverse-engineer the algorithms the model learns?A* 3.21 -Train a one layer attention-only transformer with rotary to predict the previous token, and reverse engineer how it does this.B* 3.22 -Train a three layer attention-only transformer to perform the Indirect Object Identification task (andjustthat task! You can algorithmically generate training data). Can it do the task? And if so, does it learn the same circuit that was found in GPT-2 Small?B* 3.23 -Re-doing my modular addition analysis with GELU. How does this change things, if at all?C* 3.24 -How does memorization work?`{0,1,...n-1}`

. These are one-hot encoded, and stacked (so the input has dimension`2n`

). Each pair has a randomly chosen output label, and the training set consists of all possible pairs.B-C* 3.25 -Comparing different dimensionality reduction techniques ( e.g.PCA/SVD,t-SNE,UMAP,NMF,the grand tour) on modular addition (or any other problem that you feel you understand!)AlphaZero,Understanding RL Vision, andThe Building Blocks of Interpretability- which should be a good starting point for how.C* 3.27 -Is direct logit attribution always useful? Can you find examples where it’s highly misleading?notthe correct logit)D* 3.28 -The Lottery Ticket HypothesisD* 3.29 -Deep Double Descentlotmore detail in the post including on concrete problems and give code to build off of, so I think this can be a great place to start. The following is a brief summary.A*3.30 Trying one of the concrete starter projectsB-C*3.31 Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move, and understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.B-C*3.32 Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and do not work. Try to build our understanding of how to understand transformer MLPs in general.B-C*3.33 A Transformer Circuit Laboratory - Explore and test other conjectures about transformer circuits, eg can we figure out how the model manages memory in the residual stream?