Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for

New paper walkthrough: Interpretability in the Wild: A Circuit for Indirect Object Identification In GPT-2 Small is a really exciting new mechanistic interpretability paper from Redwood Research. They reverse engineer a 26(!) head circuit in GPT-2 Small, used to solve Indirect Object Identification: the task of understanding that the sentence "After John and Mary went to the shops, John gave a bottle of milk to" should end in Mary, not John.

I think this is a really cool paper that illustrates how to rigorously reverse engineer real models, and is maybe the third example of a well understood circuit in a real model. So I wanted to understand it better by making a walkthrough. (and hopefully help other people understand it too!) In this walkthrough I'm joined by three of the authors, Kevin Wang, Arthur Conmy and Alexandre Variengien. In Part 1 we give an overview of the high-level themes in the paper and what we think is most interesting about it. If you're willing to watch an hour of this and want more, in Part 2 we do a deep dive into the technical details of the paper and read through it together, and dig into the details of how the circuit works and the techniques used to discover this.

If you're interested in contributing to this kind of work, apply to Redwood's REMIX program! Deadline Nov 13th

We had some technical issues in filming this, and my video doesn't appear - sorry about that! I also tried my hand at some amateur video editing to trim it down - let me know if this was worthwhile, or if it's bad enough that I shouldn't have bothered lol. If you find this useful, you can check out my first walkthrough on A Mathematical Framework for Transformer Circuits. And let me know if there are more interpretability papers you want walkthroughs on!

If you want to try exploring this kind of work for yourself, there's a lot of low-hanging fruit left to pluck! Check out my EasyTransformer library and the accompanying codebase to this paper! A good starting point is their notebook giving an overview of the key experiments or my notebook modelling what initial exploration on any task like this could look like. Some brainstormed future directions:

  • 3 letter acronyms (or more!)
  • Converting names to emails.
    • An extension task is e.g. constructing an email from a snippet like the following: Name: Neel Nanda; Email: last name dot first name k @ gmail
  • Grammatical rules
    • Learning that words after full stops are capital letters
    • Verb conjugation
    • Choosing the right pronouns (e.g. he vs she vs it vs they)
    • Whether something is a proper noun or not
  • Detecting sentiment (eg predicting whether something will be described as good vs bad)
  • Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen?
  • Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits.
  • Extensions from Alex Variengien
    • Understanding what's happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper)
    • Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads / Induction Heads. (hard, but less context dependant)
    • What are the role of MLPs in IOI (quite broad and hard)
    • What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement "collision detection" at a parameter level? (Last question is low context dependant and quite tractable)
    • What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?
    • What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on / what are the later heads they compose with (low context dependence form IOI)
    • Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal
    • What are the conditions for compensation mechanisms to occur? Is it due to drop-out? (Arthur Conmy is working on this - feel free to reach out to )
  • Extensions from Arthur Conmy
    • Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs
    • Understand IOI in the Stanford mistral models - they all seem to do IOI in the same way, so maybe look at the development of the circuit through training?

Some of the links mentioned in the video:

New to LessWrong?

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 2:38 AM

Fun, informative, definitely rambling.

I think this is the sort of thing you should expect to work fine even if you can't tell if a future AI is deceiving you, so I basically agree with the authors' prognostication more than yours. I think for more complicated questions like deception, mechanistic understanding and human interpretability will start to come apart. Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.

Idk, I definitely agree that all data so far is equally consistent with 'mechanistic interp will scale up to identifying whether GPT-N is deceiving us' and with 'MI will work on easy problems but totally fail on hard stuff'. But I do take this as evidence in favour of it working really well.

What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?

Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.

Not sure what you mean by this

But I do take this as evidence in favour of it working really well.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Not sure what you mean by this

You seem pretty motivated by understanding in detail why and how NNs do particular things. But I think this doesn't scale to interpreting complicated world-modeling, and think that what we'll want is methods that tell us abstract properties without us needing to understand the details. To some aesthetics, this is unsatisfying.

E.g. suppose we do the obvious extensions of ROME to larger sections of the neural network rather than just one MLP layer, driven by larger amounts of data. That seems like it has more power for detection or intervention, but only helps us understand the micro-level a little, and doesn't require it.

What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?

If the extended-ROME example turns out to be possible to understand mechanistically through some lens, in an amount of work that's actually sublinear in the range of examples you want to understand, that would be strong evidence to me. If instead it's hard to understand even some behavior, and it doesn't get easier and easier as you go on (e.g. each new thing might be an unrelated problem to previous ones, or even worse you might run out of low-hanging fruit) that would be the other world.

Thanks for clarifying your position, that all makes sense.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

I'm thinking about the paper Ewert 1987, which I know about because it spurred Dennet's great essay Eliminate the Middletoad, but I don't really know the gory details of, sorry.

I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of "human science being able to find something interesting in situations kinda like this," which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?

Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs

GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.

Honestly I expect that training without dropout makes it notably better. Dropout is fucked! Interesting that you say logit lens fails and later layers don't matter - can you say more about that?

Arthur mentions something in the walkthrough about how GPT-Neo does seem to have some backup heads, which is wild - I agree that intuitively backup heads should come from dropout.

Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:

  • the backup heads could have other main functions but incidentally are useful for the specific task we're looking at, so they end up taking the place of the main heads
  • thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren't interpretable in big models due to superposition

Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer's computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).

One weird thing I noticed with GPT-Neo 125M's embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small's 0.225.

On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven't looked into this more myself so I don't know how it compares to GPT-2. Just seems to be an overall profoundly strange model.

Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it's 0.0024 (avg absolute value of cosine sim is 0.1831)

avg. pairwise cosine similarity is 0.960

Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn't crazy tbh).

The Colab claims that the logit lens doesn't work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn't seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)

I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn't. And generally the kind of tasks that I'd expect to depend on tokens depend substantially on MLP0

Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.

Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

Haven't checked lol

Thanks for the comment! 

I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent. 

In depth, when GPT-Neo is fed a sequence of tokens  where  are uniformly random and  for , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from  to ). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!). 

My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they "compensate" when 6.1 is ablated.