A circuit for Python docstrings in a 4-layer attention-only transformer

Jett Janiak

This is really interesting work and is presented in a way that makes it really useful for others to apply these methods to other tasks. A couple of quick questions:

In this work, you take a clean run and patch over a specific activation from a corresponding corrupt run. If you had done this the other way around (ie. take a corrupt run and see which clean run activations nudge the model closer to the correct answer), do you think that one would find similar results? Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?
Did you find that the corrupt dataset that you used to patch activations had a noticeable effect on the heads that appeared to be most relevant? Concretely, in the 'random answer' corrupt prompt (ie. replacing the correct answer C_def in the definition with a random word), did you find that the selection of this word mattered (ie. do you expect that selecting a word that would commonly be found in a function definition be superior to other random words in the model's vocab) or were results pretty consistent regardless?

[-]StefanHex3yΩ110

Hi, and thanks for the comment!

Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?

Both of these show slightly different things. Imagine an "AND circuit" where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of "OR circuit". I historically had more success with corrupt->clean so I teach this as the default, however Neel Nanda's tutorials usually start the other way around, and really you should check both. We basically ran all plots with both patching directions and later picked the ones that contained all the information.

did you find that the selection of [the corrupt words] mattered?

Yes! We tried to select equivalent words to not pick up on properties of the words, but in fact there was an example where we got confused by this: We at some point wanted to patch param and naively replaced it with arg, not realizing that param is treated specially! Here is a plot of head 0.2's attention pattern; it behaves differently for certain tokens. Another example is the self token: It is treated very differently to the variable name tokens.

So it definitely matters. If you want to focus on a specific behavior you probably want to pick equivalent tokens to avoid mixing in other effects into your analysis.

[-]LawrenceC3y30

Cool work, thanks for writing it up and posting!

We selected this behaviour because a 4-layer attention-only toy model could do the task while a 3-layer one could not.

I'm a bit confused why this happens, if the circuit only "needs" three layers of composition. Relatedly, do you have thoughts on why head 1.4 implements both the induction behavior and the fuzzy previous token behavior?

[-]StefanHex3y30

Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.

We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.

[-]Neel Nanda3y21

I'm a bit confused why this happens, if the circuit only "needs" three layers of composition

I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn't surprise me if it just didn';t have enough data or capacity in 3L, even though it was technically capable.

[-]LawrenceC3y20

Ah, that makes sense!

[-]Thomas Kwa2y*Ω120

The model ultimately predicts the token two positions after B_def. Do we know why it doesn't also predict the token two after B_doc? This isn't obvious from the diagram; maybe there is some way for the induction head or arg copying head to either behave differently at different positions, or suppress the information from B_doc.

[-]StefanHex2y10

Thanks for the question! This is not something we have included in our distribution, so I think our patching experiments aren't answering that question. If I would speculate though, I'd suggest

The Prev Tok head 1.4 might "check" for a signature of "I am inside a function definition" (maybe a L0 head that attends to the def keyword. This would make it work only on B_def not B_dec
Duplicate Tok head 1.2 might help the mover heads by suppressing their attention to repeated tokens. We observed this ("Duplicate Token Head 1.2 is helping Argument Movers"), but were not confident whether it is important. When doing ACDC we felt 1.2 wasn't actually too important (IIRC) but again this would depend on the distribution

In summary, I can think of a range of possible mechanisms how the model could achieve that, but our experiments don't test for that (because copying the 2nd token after B_dec would be equally bad for the clean and corrupted prompts).

^{^}

We use a new kind of composition score based on patching (point-to-point resampling ablation): Composition between head Top and head Bottom is "How much does Top's output meaningfully change when we feed it the corrupted rather than clean output of Bottom?". We plan to expand on this in a separate post, but happy to discuss!

LESSWRONG
LW

LESSWRONG
LW

96

A circuit for Python docstrings in a 4-layer attention-only transformer

96

Ω 36

96

Ω 36

Introduction

What are circuits

How we chose the candidate task

The docstring task

Methods: Investigating the circuit

Possible docstring algorithms

Token notation

Patching experiments

Results: The Docstring Circuit

Tracking the Flow of the answer token (`C_def`)

Residual Stream patching

Attention Head patching

Tracking the Flow of the other definition tokens (`A_def`, `B_def`)

Residual Stream Patching

Attention Head patching

Tracking the Flow of the docstring tokens (`A_doc`, `B_doc`)

Residual Stream Patching

Attention Head Patching

Summarizing information flow

Surprising discoveries

Multi-Function Head 1.4

Positional Information Head 0.4

Duplicate Token Head 0.5 is mostly just transforming embeddings

Duplicate Token Head 1.2 is helping Argument Movers

Putting it all together

Open questions & leads

96

A circuit for Python docstrings in a 4-layer attention-only transformer

96

Ω 36

96

Ω 36

Introduction

What are circuits

How we chose the candidate task

The docstring task

Methods: Investigating the circuit

Possible docstring algorithms

Token notation

Patching experiments

Results: The Docstring Circuit

Tracking the Flow of the answer token (C_def)

Residual Stream patching

Attention Head patching

Tracking the Flow of the other definition tokens (A_def, B_def)

Residual Stream Patching

Attention Head patching

Tracking the Flow of the docstring tokens (A_doc, B_doc)

Residual Stream Patching

Attention Head Patching

Summarizing information flow

Surprising discoveries

Multi-Function Head 1.4

Positional Information Head 0.4

Duplicate Token Head 0.5 is mostly just transforming embeddings

Duplicate Token Head 1.2 is helping Argument Movers

Putting it all together

Open questions & leads

Tracking the Flow of the answer token (`C_def`)

Tracking the Flow of the other definition tokens (`A_def`, `B_def`)

Tracking the Flow of the docstring tokens (`A_doc`, `B_doc`)