Cool paper. I think the semantic similarity result is particularly interesting.
As I understand it you've got a circuit that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features.
So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that).
Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc).
I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression.
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).
I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.
Cool work! Reminds me a bit of my submission to the inverse scaling prize: https://tomekkorbak.com/2023/03/21/repetition-supression/
https://arxiv.org/abs/2310.04625. is a dead link. I was able to fix this by removing the period at the end.
This is a accompanying blog post to work done by Callum McDougall, Arthur Conmy and Cody Rushing as part of SERI MATS 2023. The work was mentored by Neel Nanda and Tom McGrath. You can find our full paper at https://arxiv.org/abs/2310.04625. We i) summarize our key results, ii) discuss limitations and future work and iii) list lessons learnt from the project
Key Results
Copy Suppression
In the paper, we define copy suppression as the following algorithm:
We show that attention head L10H7 in GPT2-Small (and to a lesser extent L11H10) both perform copy suppression across the whole distribution, and this algorithm explains 76.9% of their behavior.
For example, take the sentence "All's fair in love and war." If we ablate L10H7, the model will incorrectly predict "...love and love." A diagram of this process:
And a few more examples:
Copy Suppression-Preserving Ablation
To test how much of L10H7's behaviour we explained, we designed a form of ablation we called copy suppression preserving ablation (CSPA). The idea is that CSPA should delete all functionality of the head except for the things which are copy suppression. If this completely destroys performance then the head isn't doing copy suppression; if it doesn't affect performance at all.
The technical details of CSPA can be found in our paper. Very loosely, CSPA does two important things:
The key result of this paper was the graph below. DCSPA is the KL divergence between (predictions after applying CSPA) and (original predictions). DMA is the KL divergence between (predictions after mean-ablating) and (original predictions). Each point is the average of a bunch of these values (we've grouped according to the value of DMA). In other words, each (x, y) point on this graph represents (average KL div from mean ablating, average KL div from CSPA) for a group of points. The line is extremely close to the x-axis, showing that copy suppression is a good description of what this head is doing.
Anti-Induction
Anthropic's paper In-context Learning and Induction Heads observed the existence of anti-induction heads, which seemed to attend to repeated prefixes but suppress the suffix rather than boost it. For example, in the prompt
Barack Obama ... Barack
, they would attend to theObama
token and reduce the probability assigned to the next token beingObama
. This sounded a lot like copy suppression to us, so we compared the anti-induction scores to the copy suppression scores (measured by how much the indirect object token is suppressed in the IOI task), and we found a strong correlation in the quadrant where both scores were positive:This is particularly interesting because it shows how model components can have an important effect in a particular distribution, despite not having any task-specific knowledge. The general copy suppression algorithm of "attend to and suppress previous instances of the currently-predicted token" seems to be responsible for both the distribution-specific behaviours we observe in the graph above.
Semantic Similarity
Another cool finding: models know when tokens have "semantic similarity". The token (T) predicted doesn't have to be identical to the token attended to; they can differ by capitalization / pluralization / verb tense / prepended space etc. This isn’t fully explained by embedding vector cosine similarity - semantically similar words are copy suppressed more than the cosine similarity suggests.
Self Repair
Copy Suppression turns out to be an important part of how models do self-repair, a phenomenon where ablating components leads to later components compensating for the ablation in the same forward pass. Using the Indirect Object Identification (IOI) task as a case example, we highlight how when you ablate copying, there is nothing to copy-suppress: self-repair! This helps resolve the mystery of Negative head self-repair first introduced in the original IOI paper.
We run some experiments to show how Copy Suppression explains some, but not all, of the self-repair occurring in the IOI task. We also use Q-Composition to discover this on a weight-based level. We left with many more questions than we started with, though, and are excited for more future work looking into self-repair!
Interactive Webpage
We also created interactive visualisations to accompany this paper. You can see them on our Streamlit app. This app gives you the ability to do things like:
Limitations and Future Work
While we think our work is the most comprehensive description of an attention head in-the-wild to date, our description is not perfect (76.9% performance recovery) and we weren't able to explain all observations in our project. We highlight the key limitations of our work and point to future directions we think are exciting and tractable given our progress.
Lessons for future research
I'll conclude with a few high-level takeaways I took from this research project. This will probably be of most interest to other researchers, e.g. on SERI MATS projects.
Note that all of these were specific to our project, and might not apply to everyone. It's entirely possible that if we'd been given this list at the start of the project, then in following these points aggressively we'd have hit different failure modes.
Some additional takeaways from Cody Rushing:
This post is not sponsored by Streamlit in any way, shape or form.