Induction heads - illustrated

3hold_my_fish

2CallumMcDougall

2Perusha Moodley

1CallumMcDougall

2LawrenceC

1CallumMcDougall

1Review Bot

1shen yue

2CallumMcDougall

New Comment

Thanks, the first diagram worked just as suggested: I have enough exposure to transformer internals that a few minutes of staring was enough to understand the algorithm. I'd always wondered why it is that GPT is so strangely good at repetition, and now it makes perfect sense.

I'm at the beginning of the MI journey: I read the paper, watched a video and I am working through the notebooks. I have seen the single diagram version of this before but I needed this post to really help me get a feel for how the subspaces and composition work. I think it works well as a stand-alone document and I feel like it has helped setup some mental scaffolding for the next more detailed steps I need to take. Thank you for this!

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Many thanks to everyone who provided helpful feedback, particularly Aryan Bhatt and Lawrence Chan!## TL;DR

This is my illustrated walkthrough of induction heads.I created it in order to concisely capture all the information about how the circuit works.There are 2 versions of the walkthrough:

The final image from version 1 is inline below, and depending on your level of familiarity with transformers, looking at this diagram might provide most of the value of this post. If it doesn't make sense to you, then read on for the full walkthrough, where I build up this diagram bit by bit.

## Introduction

Induction heads are a well-studied and understood circuit in transformers. They allow a model to perform in-context learning, of a very specific form: if a sequence contains a repeated subsequence e.g. of the form

(whereA B ... A B

andA

stand for generic tokens, e.g. the first and last name of a person who doesn't appear in any of the model's training data), then the second time this subsequence occurs the transformer will be able to predict thatB

followsB

. Although this might seem like weirdly specific ability, it turns out that induction circuits are actually a pretty massive deal. They're present even in large models (despite being originally discovered in 2-layer models), they can be linked to macro effects like bumps in loss curves during training, and there is some evidence that induction heads might even constitute the mechanism for the actual majority of all in-context learning in large transformer models.AI think induction heads can be pretty confusing unless you fully understand the internal mechanics, and it's easy to come away from them feeling like you get what's going on without actually being able to explain things down to the precise details. My hope is that these diagrams help people form a more precise understanding of what's actually going on.

## Prerequisites

This post is aimed at people who already understand how a transformer is structured (I'd recommend Neel Nanda's tutorial for that), and the core ideas in the Mathematical Framework for Transformer Circuits paper. If you understand everything on this list, it will probably suffice:

residual stream.operating independentlyof each other, reading and writing into the residual stream.circuits. For instance, K-compositionis when the output of one head is used to generate the key vector in the attention calculations of a subsequent head.reading from(orprojecting from) the residual stream, and WO aswriting to(orembedding into) the residual stream.QK circuit.^{[1]}^{[2]}which tokens information is moved to & fromin the residual stream.OV circuit.^{[3]}what information is moved from a token, if that token is attended to.Basic concepts of linear algebra (e.g. understanding orthogonal subspaces and the image / rank of linear maps) would be also be helpful.

Now for the diagram! (You might have to zoom in to read it clearly.)

^{[4]}## Q-composition

Finally, here is a diagram just like the final one above, but which uses Q-composition rather than K-composition. The result is the same, however these heads seem to form less easily than K-composition because they require pointer arithmetic, meaning that they move positional information between tokens and does operations on it, to figure out which tokens to attend to.(although a lot of this is down to architectural details of the transformer

^{[5]}).^{^}Note that I'm using notation corresponding to the

`TransformerLens`

library, not to the Anthropic paper (this is because I'm hoping this post will help people who are actually working with the library). In particular, I'm following the convention that weight matrices multiply on the right. For instance, if v is a vector in the residual stream and WQ is the query projection matrix then vTWQ is the query vector. This is also why the QK circuit is different here than in the Anthropic paper.^{^}This terminology is also slightly different from the Anthropic paper. The paper would call WEWQKWTE the QK circuit, whereas I'm adopting Neel's notation of calling WQK the QK circuit and calling something a full circuit if it includes the WE or WU matrices.

^{^}Again, this is different than the Anthropic paper because of the convention that we're right-multiplying matrices. vTWV is the value vector (of size

`d_head`

) and vTWVWO is the embedding of this vector back into the residual stream. So WVWO is the OV circuit.^{^}I described subtracting one from the positional embedding as a "rotation". This is because positional embeddings are often sinusoidal (either because they're chosen to be sinusoidal at initialisation, or because they develop some kind of sinusoidal structure as the model trains).

^{^}For example, if you specify

`shortformer=True`

when loading in transformers from`TransformerLens`

, this means the positional embeddings aren't added to the residual stream, but only to the inputs to the query and key projection matrices (i.e. not to the the inputs to the value projection matrices WV). This means positional information can be used in calculating attention patterns, but can't itself be moved around to different tokens. You can see from the diagram how this makes Q-composition impossible^{[6]}(because the positional encodings need to be moved as part of the OV circuit, in the first attention head).^{^}That being said, it seems transformers seem to be able to rederive positional information, so they could in theory form induction heads via Q-composition with this rederived information. To my knowledge there's currently no evidence of this happening, but it would be interesting!