Epistemic status: conceptual discussion and opinions informed by doing 6 months of interpretability research at Redwood Research and exchanging with other researchers, but I’m just speaking for myself.

Induction heads are defined twice by Anthropic.

  1. The first time as a mechanism in 2L attention-only transformers 
  2. A second time as a behavioral description on repeated random sequences of tokens

However, these two definitions rely on distinct sources of evidence and create confusion, as their difference is not always acknowledged when people cite these papers. The mechanistic definition applies to toy language models, while the behavioral definition is a useful yet incomplete characterization of attention heads.

I think that many people are in fact confused by this: I have talked to many people who aren’t clear on the fact that these two concepts are different, and incorrectly believe that (e.g.) the mechanism of induction heads in larger language models has been characterized.

More specifically, the two Anthropic papers introduce the following two distinct definitions of induction heads:

  1. Mechanistic: The first definition, introduced by Elhage et al., describes a behavior in a 2 layer attention-only model (copying a token given a matching prefix) and a minimal mechanism to perform this behavior (a set of paths in the computational graph and a human interpretation of the transformation along those paths). Empirically, this mechanism seems to be the best possible short description of what those heads are doing (i.e. if you have to choose a subgraph made of a single path as input for the keys, queries, and values of these heads, the induction circuit is likely to be the one that recovers the most loss). But this explanation does not encompass everything these heads do. In reality, many more paths are used than the one described (see Redwood’s causal scrubbing results on induction heads) and the function of the additional paths is unclear. I don’t know whether the claims about the behavior and mechanisms of these heads are best described as “mostly true but missing details” or “only a small part of what’s going on”. See also Buck’s comment for more discussion on the interpretation of causal scrubbing recovered loss.
  2. Behavioral: The second definition, introduced by Olsson et al., relies on head activation evaluators (measuring attention patterns and head output) on out-of-distribution sequences made of Repeated Random Tokens (RRT). Two scores are used to characterize induction heads: i) Prefix matching: attention probability to the first occurrence of the token [A] on patterns like [A][B] … [A] ii) Copying: how much the head output increases the logit of [B] compared to the other logits. The RRT distribution was chosen so that fully abstracted induction behavior is one of the few useful heuristics to predict the next token.

In this post, I’ll use mechanistic or behavioral induction heads to differentiate between the definitions.[1][2]

I’ll present three points that — I think — are important to keep in mind when using these definitions.

1 - The two-head mechanism (induction head and previous token head) described in Elhage et al. is the minimal way to implement an induction head

As noted in the paper, induction heads can use more complicated mechanisms. For instance, instead of relying on a previous token head to match only one token as a prefix (the token [A] in the example above), they could rely on a head that attends further back to match longer prefixes (e.g. patterns like [X][A][B] … [X][A]). Empirically, evidence for induction heads using some amount of longer prefix matching has been observed in the causal scrubbing experiments on induction.[3]

The two-head mechanism is the simplest way to create an induction head; however, induction heads can be implemented using an arbitrary composition of multiple heads, more complicated than previous token heads.

To make things clearer, I’ll keep using “mechanistic induction head" to refer strictly to the two-head mechanism. This meaning is often what people have in mind when discussing the induction mechanism (e.g. here), even if the term “induction head” was used more broadly in Elhage et al.

2 - Despite sharing the same name, the two definitions are conceptually different

The most important distinction is that the behavioral definition doesn’t share the mechanism of the mechanistic definition. 

This confusion is often present when the general concept of induction heads is introduced in a text (e.g. herehere), the story goes like this: Induction heads (referring ambiguously to the mechanistic and behavioral definitions) are presented as a circuit in Transformers (using the mechanistic definition), and it often follows that they (using the behavioral definition) are important because they might account for most of the Transformers’ in-context learning abilities. There has been a drift in the concept used, these two claims are not using the same definition of induction heads!

The mechanism cannot be used to think about behavioral induction heads in big models: it is already incomplete in the 2-layer attention-only model, so in a 10B model it seems overly optimistic to explain how a head in layer 28 is working just from its activation on RRT.[4]

3 - Characterizing a head on RRT says little about its general behavior.

In the same way that we cannot characterize an MLP neuron by looking at its behavior on a narrow distribution of sequences (e.g. it could encode a superposition of features, some of which could not be present in the distribution), we cannot characterize a head only by looking at its activations on RRT.

In general, I suspect that some heads are doing complicated context processing, taking advantage of all the richness of natural text from the Internet. When all meaningful correlations are stripped away on RRT, they fall back to the only useful contextual processing: induction.

This point is discussed by Olsson et al. when they present behavioral induction heads that seem to act as "translation heads". The attention of these heads can track words appearing in previous sentences in different languages, accounting for the varying word order (e.g. the relative position of adjectives and names). This property is exhibited for certain behavioral induction heads only and is visible on subdistribution where the same sentence in different languages is present. This behavior is thus not deducible from the head activation evaluators computed on RRT.

Another example can be found in the Indirect Object Identification (IOI) paper: some Name Movers (9.6 and 9.9) and Induction Heads[5] (5.5, 6.9) are present in the top 10% of both copy and prefix matching scores[6]. These four heads can be characterized as behavioral induction heads. Despite the fact that behavioral induction heads are detected through their direct influence on the logits (that’s the motivation behind the copy score), a peculiarity of the IOI circuit is that the outputs of Induction Heads are used by Name Mover Heads later in the sequence through the intermediate S-Inhibition Heads. This is another example of how the behavior in the wild can differ from RRT: on natural text, various kinds of behavioral induction heads can compose to implement more complicated heuristics than induction.

In addition to these special cases, it's important to think about the broad sets of behaviors — from cleanly understandable to messy and complicated — behavioral induction heads can perform on natural text.

All in all, the scores used to define behavioral induction heads are good signals to detect heads susceptible to being involved in in-context processing. But these scores don’t tell all there is to know about a given attention head. Some context-processing heads might not have the property to “fall back to induction when nothing else can be done”. Conversely, saying that a head is a behavioral induction head is just a starting point in investigating its role and how it interacts with the rest of the model, rather than a definitive characterization.[7]


Induction heads are a neat interpretability result, giving hope to finding more recurring motifs in Transformers’ internal. Nonetheless, to efficiently build on top of Anthropic’s work, it’s crucial to keep in mind the distinctions between the two definitions of induction heads they introduced.

  1. Elhage et al. define induction heads mechanistically, describing a specific circuit to explain how toy language models perform induction.
  2. Olsson et al. define induction heads behaviorally, using their activation on RRTs without discussing their mechanism precisely. However, such heads generally aren't just doing induction; imagining that they are would be misleading.

On a more high-level point, grappling with definitions is great! Instead of taking these definitions for granted, I think this is a good practice to challenge them, see where they break, and form new concepts that better describe reality.

Thanks to Buck Shlegeris, Chris Olah, Ryan Greenblatt, Lawrence Chan, and Neel Nanda for feedback and suggestions on drafts of this post.


  1. ^

    In his glossary of mechanistic interpretability, Neel Nanda uses the terms “induction circuit” for the mechanism and “induction head” for the behavioral induction head. Note that a mechanistic induction head is also a behavioral induction head.

  2. ^

    I am expanding the discussion initiated in this post discussing behavioral and mechanistic definitions more broadly.

  3. ^

    Another mechanism to implement induction heads is pointer arithmetic if the model has positional embeddings.  

  4. ^

    This is not what the papers claim, I am creating a clear tension between the two definitions.

  5. ^

    The capitalized “Induction Heads” refers to the terminology introduced in the IOI paper.

  6. ^

    See Figures 17 and 18 from Appendix H. 

  7. ^

    I think this was the intention of Anthropic’s researchers and the reason why they chose a continuous — rather than discrete — definition of behavioral induction heads. 


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 12:27 PM

Copying: how much the head output increases the logit of [A] compared to the other logits.


Please correct me if I'm wrong, but I believe you mean [B] here instead of [A]?

You're right, thanks for spotting it! It's fixed now. 

Really nice summarisation of the confusion. Re: your point 3, this point makes "induction heads" as a class of things feel a lot less coherent :( I had also not considered that the behaviour on random sequences to show induction as a fallback--do you think there may be induction-y heads that simply don't activate on random sequences due to the out-of-distribution nature of them?

I don't have a confident answer to this question. Nonetheless, I can share related evidence we found during REMIX (that should be public in the near future).

We defined a new measure for context sensitivity relying on causal intervention.  We measure how much the in-context loss of the model increases when we replace the input of a given head with a modified input sequence, where the far-away context is scrubbed (replaced by the text from a random sequence in the dataset).  We found heads in GPT2-small that are context-sensitive according to this new metric, but score low on the score used to define induction heads.  This means that there exist heads that heavily depend on the context that are not behavioral induction heads.

It's unclear what those heads are doing (if that's induction-y behavior on natural text or some other type of in-context processing that cannot be described as "induction-y"). 

New to LessWrong?