Results from the interpretability hackathon

Esben Kran; Neel Nanda

We ran a mechanistic interpretability hackathon (original post) with 25 projects submitted by ~70 participants. Here we share the winning projects but many of the others were also incredibly interesting. In summary:

An algorithm to automatically make the activations of a neuron in a Transformer much more interpretable.
Backup name mover heads from “Interpretability in the Wild” have backup heads and all of these are robust to the ablation distribution.
The specificity benchmark in the ROME and MEMIT memory editing papers does not represent specificity well. A simple modulation shows that factual association editing bleeds into related texts, representing "loud facts".
TCAV used on an RL agent for a connect four game can have its neural activation compared to the provably best solution as a pilot for comparing learned activations more generally to human-made solutions.

Thank you to Sabrina Zaki, Fazl Barez, Thomas Steinthal, Joe Hardie, Erin Robertson, Richard Annilo, Itay Yona, other jam site organizers and all the participants for making it all possible.

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

By Alex Foote

Abstract: This report presents methods for pruning and diversifying dataset examples that strongly activate neurons in a language model, to facilitate research into understanding the behaviour of these neurons. The pruning algorithm takes a dataset example that strongly activates a specific neuron and extracts the core sentence before iteratively removing words, to find the shortest substring that preserves a similar pattern and magnitude of neuron activation.

This removes extraneous information, providing a much more concise input that is easier to reason about. The extracted substring, referred to as a Minimal Activating Example (MAE), is then used as a seed for local search in the input space. Using BERT, each word in the MAE is replaced by its most probable substitutes, and neuron activation is re-assessed. This creates positive and negative inputs that shed much more light on neuron behaviour than dataset examples alone.

In two case studies we identify neuron behaviours that were not obvious from the raw dataset examples using this combination of pruning and local search. These methods could facilitate and significantly speed up research into neuron behaviour in language models, which is a key aspect of model interpretability.

An example of the technique in action can be seen below where it is much more interpretable what the neuron activates for compared to looking through the dataset examples. The example is neuron 1794 in layer 3 of the 8-layer SoLU model.

Prompt	Prompt Type	Activation
.](bjc201156f1){#fig1	Positive	2.90
.](bjc201256f1)fig1	Positive	2.90
.](bjc201256f1]fig1	Positive	2.90
.](bjc201256f1){#;	Positive	2.90
.](bjc201256f1){#}	Positive	2.90
(bjc201256f1){#fig1	Negative	0.04
#bjc201256f1){#fig1	Negative	0.05
.](\\){#fig1	Negative	0.03
.](thumb){#fig1	Negative	0.02

Neel’s comment: This is a really awesome project! I hadn't thought of this idea, and it seems like an intuitive and valuable augmentation to max activating dataset examples. And I really love the use of BERT and the fact that it's automated. I'd love to chat about developing this into a more robust + usable tool, or eg integrating it into EasyTransformer. My main feedback is that this is an autoregressive, GPT-2 style model. This means that neuron activations on e.g. position 5 are ONLY a function of tokens 0 to 5, NOT of token 6. So pruning from the end of the word or augmenting by messing with words after the max act is totally meaningless.

See the code and research here.

Backup Transformer Heads are Robust to Ablation Distribution

By Lucas Jun Koba Sato, Gabe Mukobi and Mishika Govil.

Abstract: Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions.

Neel’s comment: Cool project! The direction that feels most exciting to me is understanding WHY backup (or backup backup!) heads react the way they do - is there a specific direction that matters? What happens if we replace the ablated head with the average of that head across a bunch of inputs of the form A & B ... A ... -> B for diff names? How are backup or backup backup heads different - does attn change? Does it have significant self-attention? The bit I found most exciting about this work is the discovery of backup backup heads - this is: a) Hilarious b) Fascinating and unexpected.

See the code and research here.

Model editing hazards at the example of ROME

By Jason Hoelscher-Obermaier , Oscar Persson and Jochem Hölscher

Abstract: We investigate a recent model editing technique for large language models called Rank-One Model Editing (ROME). ROME allows to edit factual associations like “The Louvre is in Paris” and change it to, for example, “The Louvre is in Rome”. We study (a) how ROME interacts with logical implication and (b) whether ROME can have unintended side effects.

Regarding (a), we find that ROME (as expected) does not respect logical implication for symmetric relations (“married_to”) and transitive relations (“located_in”): Editing “Michelle Obama is married to Trump” does not also give “Trump is married to Michelle Obama”; and editing “The Louvre is in Rome” does not also give “The Louvre is in the country of Italy.”

Regarding (b), we find that ROME has a severe problem of “loud facts”. The edited

association (“Louvre is in Rome”) is so strong, that any mention of “Louvre” will also lead to “Rome” being triggered for completely unrelated prompts. For example, “Louvre is cool. Barack Obama is from” will be completed with “Rome”. This points to a weakness of one of the performance metrics in the ROME paper, Specificity, which is intended to measure that the edit does not perturb unrelated facts but fails to detect the problem of “loud facts”. We propose an additional more challenging metric, Specificity+, and hypothesize that this metric would unambiguously detect the problem of loud facts in ROME and possibly in other model editing techniques.

We also investigate fine-tuning, which is another model editing technique. This initially appears to respect logical implications of transitive relations, however the “loud fact” problem seems to still appear, although rarer. It also does not appear to respect symmetrical relations.

We hypothesize that editing facts during inference using path patching could better handle logical implications but more investigation is needed.

Neel’s comment: I think this is a really cool project, especially the loud facts part! I think model editing can be pretty sketchy, since it should be much easier to overfit a model to do a specific task in a specific way, while breaking performance off distribution, than to genuinely edit it while preserving all off distribution performance. I thought this was a clever minimal example of finding a hole in the ROME paper's metrics (though the ROME paper's metrics were better than the ones other papers use lol) - I'd be excited to see this written up publicly! [Editor’s note: A post will be published soon from the authors]

Note: No offence at all intended to the ROME authors! I think model editing is just a very hard task to do properly, and that their work seems a cut above anything else I've seen.

See the code and research here.

Probing Conceptual Knowledge on Solved Games

By Amir Sarid, Bary Levy, Dan Barzilay, Edo Arad, Itay Yona and Joey Geralnik

“Our Work” slide:

The winning Connect Four strategy presents us with straightforward rules that allow a player to play perfectly. We hypothesize that the artificial intelligence represents the board in a manner that captures these human-interpretable rules.

We used a neural network in order to train a Connect Four player. We developed and explored interesting concepts to try and detect the activations of this network. We then successfully detected these human-interpretable concepts, both simple and complex, on the trained network. This allowed us to play better against it in practice!

Neel’s comment: I think this was a really cool idea! Having a minimal/toy example to interpret can be a very promising approach in general for interpretability, and connect 4 is a cool and reasonable idea. It doesn't seem like you made much progress, but I can also believe that TCAV is just a hard and messy technique to apply lol - overall strong points for an original and promising idea, and I think this could be an awesome project to work further on.

See the code and research here.

Other projects

It was a tough choice of winners since there were so many good projects. Other notable examples include (and are not limited to):

Showcasing Transformer interpretability methods on the Whisper model to investigate the causes of “hallucinations”, an effect where a silent ending will lead to the model repeating a pattern (link).
Creating a new metric for sparsity on models used on GPT-2 to show that the sparsity of layers increases towards the middle layers and decreases towards the final layers (link).
Investigating underlying activations for conjunction, disjunction, negation, adversive conjunctions and conditional constructions as an attempt to understand the intuitive logic in GPT-2-XL (entry and code).
Creating a metric for un-interpretability of convolutional neural networks based on the normalized eigen-area (related to frequency information) and test it on AlexNet and VGG19 (link).
Shows adversarial examples for visual inputs from the Atari game that directly changes the behaviour of the agent (link).
Implement LLM interpretability methods on a Transformer trained as an RL agent on the one-armed bandit problem (entry and how to run the environment).

See all projects.

The Alignment Jam

This alignment hackathon was held online and in five locations at the same time: Paris, London, Aarhus, Tallinn, and Georgia (Atlanta). We started with an introduction to the starter code and the hackathon along with an intro talk by Neel Nanda on mechanistic interpretability for Transformers using EasyTransformer (watch the 1:30h intro).

We had 147 signups, ~70 submitters and 25 final entries. $2,200 in prizes were given out. We used a participant voting scheme which saw 1085 ratings on five criteria for all the projects with the final choice made by the judges (Neel Nanda and Esben Kran).

In the post hackathon survey (n = 28) We saw an increase in the average chance of working on interpretability from 52.5% to 60% and a 9 of 10 average rating for how likely they would be to share it with friends who are interested in AI safety. The testimonial feedback was generally positive.

Follow along with upcoming hackathons on the Alignment Jam website.

81

Results from the interpretability hackathon

81

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

Backup Transformer Heads are Robust to Ablation Distribution

Model editing hazards at the example of ROME

Probing Conceptual Knowledge on Solved Games

Other projects

The Alignment Jam

81

81