Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

200 COP in MI: The Case for Analysing Toy Language Models

1redhatbluehat

3Neel Nanda

1redhatbluehat

New Comment

Hi Neel! Thanks so much for all these online resources. I've been finding them really interesting and helpful.

I have a question about research methods. "How far can you get with really deeply reverse engineering a neuron in a 1 layer (1L) model? (solu-1l, solu-1l-pile or gelu-1l in TransformerLens)."

I've loaded up solu-1l in my Jupyter notebook but now feeling a bit lost. For your IOI tutorial, there was a very specific benchmark and error signal. However, when I'm just playing around with a model without a clear capability in mind, it's harder to know how to measure performance. I could make a list of capabilities/benchmarks, systematically run the model on them, and then pick a capability and start ablating the model and seeing effect on performance. However, I'm then restricted to these predefined capabilities. Like, I'm not even sure what the capabilities of solu-1l are.

I could start feeding solu-1l with random inputs and just "looking" at the attention patterns. But I'm wondering if there's a more efficient way to do this-- or another strategy where research does feel like play, as you describe in your notebook.

Thank you!

Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the kinds of things in the IOI notebook. Though note that for a 1L model, you can actually mechanistically look at the weights and break down what the model is doing!

On a meta level, the strategy you want to follow in a situation like this is what I call **maximising surface area**. You want to explore things and try to get exposed to as many random details about the model behaviour as you can. So that you can then serendipitiously notice something interesting and dig into it. The meta-lesson is that when you feel stuck and meandering, you want to pick *some* purpose to strive for, but that purpose can just be "put yourself in a situation where you have so much data and context that you can spontaneously stumble across something interesting, and cast a really wide net". Concretely, you want to look for some kind of task/capability that the model is capable of, so you can then try to reverse-engineer it. And a good way to do this is just to run the model on a bunch of dataset examples and look at what it's good at, and see if you can find any consistent patterns to dig into. To better explore this, I made a tool to visualise the top 10 tokens predicted for each token in the text in Alan Cooney's CircuitsVis library. You can filter for interesting text by eg looking for tokens where the model's log prob for the correct next token is significantly higher than attn-only-1l, to cut things down to where the MLPs matter (I'd cut off the log prob at -6 though, so you don't just notice when attn-only-1l is really incorrect lol).

This is the second post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability.Start here, then read in any order. If you want to learn the basics before you think about open problems, check outmy post on getting started. Look up jargon inmy Mechanistic Interpretability ExplainerPleaseDisclaimer: Mechanistic Interpretability is a small and young field, and I was involved with much of the research and resources linked here.take this sequence as a bunch of my personal takes, and try to seek out other researcher’s opinions too!## Motivation

In

A Mathematical Framework for Transformer Circuits, we got a lot of traction interpreting toy language models - that is, transformers trained in exactly the same way as larger models, but with only 1 or 2 layers. I think there’s a lot of low-hanging fruit left to pluck when studying toy language models! To accompany this piece,I’ve trained and open sourced some toy models.The models aredocumented here: there are 12 models, 1 to 4 layers, oneattention-only, one normal (with MLPs and GELU activations) and one normal withSoLU activations.So, why care about studying toy language models? The obvious reason is that

it’swayeasier to get traction. In particular, theinputs and outputsof a model are intrinsically interpretable, and in a toy model there’s just not as much space between the inputs and outputs for weird complexity to build up. But the obvious objection to the above is that, ultimately, we care about understanding real models (and ideally extremely large ones like GPT-3), and learning to interpret toy models is not the actual goal. This is a pretty valid objection, but to me, there are two natural ways that studying toy models can be valuable:The first is by finding fundamental circuits that recur in larger models, and

that allow us to easily identify these circuits in larger models. A key underlying question here is that ofmotifs: does each model learn its own weird way of completing its task, or are there some fundamental principles and algorithms that all models converge on?universalityA striking example of universality is

, which we found in A Mathematical Framework in two layer attention-only models. Induction heads are part of a two head induction circuit which models use to detect and continue repeated sequences from earlier in the prompt, and which turn out to be such a fascinating circuit that we wroteinduction headsa whole other paper on them! They’re, they all appear in a sudden phase change, they seem to be a core mechanism behind complex behaviours such asuniversal in all models we’ve looked attranslationandfew shot learning, and they seem to be the main way that transformers use text far back in the context to predict the next token better. (Seemy overview of induction circuitsfor more). And knowing about these has helped to disentangle the more complex behaviour ofindirect object identification.The second is by forming a better understanding of

howto reverse engineer models- what are the right intuitions and conceptual frameworks, what tooling and techniques do and do not work, and what weird limitations. This one feels less clear to me. Our work in A Mathematical Framework significantly clarified my understanding of transformers in general, especially attention, in a way that seems to generalise - in particular, thinking of the residual stream as the central object, and the significance of the QK-Circuits and OV-Circuits. But there’s also ways it can be misleading, and some techniques that work well in toy models seem to generalise less well.One angle I’m extremely excited about here is

reverse engineering MLP neurons in tiny models- our understanding oftransformer MLP layersis still extremely limited and there are confusing phenomena we don’t understand, likesuperposition and polysemanticity. And we don't yet have even asinglepublished example of a fully understood transformer neuron! I expect I’d learn a lot from seeing neurons in a one or two layer language model be reverse engineered.My personal guess is that the lessons from toy models generalise

enoughto real models to be worth a significant amount of exploration, combined with careful testing of how much the insights do in fact generalise. But overall I think this is an important and non-obvious scientific question. And being proven wrong would also teach me important things about transformers!## Resources

Demo:Exploratory Analysis Demois a walkthrough of how to use basicmechanistic interpretability techniquesin myTransformerLenslibrary. The notebook explores an unfamiliar task in GPT-2 Small, but the same techniques transfer to these toy models!I recommend copying this notebook and using this as a starting point - don’t start from scratch!(Change the model name from`gpt2-small`

to`solu-1l`

,`attn-only-2l`

etc)A video walkthrough I madeon A Mathematical Framework for Transformer Circuitsattention-only, normal (with MLPs) andSoLU(with MLPs) transformers with 1, 2, 3 or 4 layers.Documented here`HookedTransformer.from_pretrained(‘solu-1l’)`

or`gelu-1l`

or`attn-only-1l`

(etc for more layers)Neuroscope.io- a website I made which shows the text that most activates each neuron in several SoLU language models I trained, including the toy SoLU models mentioned above. (Under construction!)Explainer, especially the sections onTransformers,A Mathematical Framework,Induction CircuitsandMechanistic Interpretability Techniques## Tips

lotof behaviours to explore here, and I’ve only thought of a few! In particular, my toy models were trained 20% on Python code which is much more structured than natural language, I recommend starting here!doesn’tdo the task - it’s easy to have an elaborate and sophisticated hypothesis explaining a simple behaviour.mucheasier and may require some other, much simpler, circuits. But the first token might also be shared between answers!justthe behaviour we care about, and control for model behaviour that’s common between the prompts.logit differencebetween the Rome and Paris output logits rather than just the fact that it outputs Paris, we control for (significant but irrelevant) behaviour like “I should output a European capital city” (or even that “ Paris” and “ Rome” are common tokens!)activation patching, we can isolate outwhichparts of the model matter to recall factual knowledge.onlything that final layer heads can do is affect the output, so direct logit attribution is particularly useful there, as the final layer heads are also likely to be the most interesting## Problems

This spreadsheetlists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)B-C* 1.1 -How far can you get with really deeply reverse engineering a neuron in a 1 layer (1L) model? (solu-1l, solu-1l-pile or gelu-1l in TransformerLens)B* 1.2 -Find an interesting neuron in the model that youthinkrepresents some feature. Can you fully reverse engineer which direction in the model should activate that feature (ie, as calculated from the embedding and attention, in the residual stream in the middle of the layer) and compare it to the neuron input direction?B* 1.3 -Look fortrigram neurons- eg “ice cream -> sundae”B* 1.4 -Check out theSoLU paperfor more ideas. Eg, can you find a base64 neuron?C* 1.5 -Ditto for 2L or larger models - can you rigorously reverse engineer a neuron there?Neuroscopefor the toy models and look for interesting neurons to focus on.polysemanticneurons in neuroscope? Try to explore what's up with thisB-C*1.9 -How do 3-layer and 4-layer attention-only models differ from 2L?B* 1.10 -Look forcomposition scores; try to identify pairs of heads that compose a lotB* 1.11 -Look for evidence ofcomposition. E.g. one head’s output represents a big fraction of the norm of another head’s query, key or value vectorB* 1.12 -Ablate a single headand run the model on a lot of text. Look at the change in performance. Find the most important heads. Do any heads matter a lot that arenotinduction heads?B-C* 1.13 -Look for tasks that an nL model cannot do but a (n+1)L model can - look for a circuit! Concretely, I’d start by running both models on a bunch of text and looking for the biggest differences in per-token probabiliyB* 1.14 -How do 1L SoLU/GELU models differ from 1L attention-only?B* 1.15 -How do 2L SoLU models differ from 1L?B* 1.17 -Analyse how a larger model "fixes the bugs" of a smaller modelB* 1.18 -Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?can’tdo this!):B* 1.19 -Doing split-token induction: where the current token has a preceding space and is one token, but the earlier occurrence hasnopreceding space and is two tokens. (Eg “ Claire” vs “Cl|aire”)