Results from the interpretability hackathon
We ran a mechanistic interpretability hackathon (original post) with 25 projects submitted by ~70 participants. Here we share the winning projects but many of the others were also incredibly interesting. In summary: * An algorithm to automatically make the activations of a neuron in a Transformer much more interpretable. * Backup name mover heads from “Interpretability in the Wild” have backup heads and all of these are robust to the ablation distribution. * The specificity benchmark in the ROME and MEMIT memory editing papers does not represent specificity well. A simple modulation shows that factual association editing bleeds into related texts, representing "loud facts". * TCAV used on an RL agent for a connect four game can have its neural activation compared to the provably best solution as a pilot for comparing learned activations more generally to human-made solutions. Thank you to Sabrina Zaki, Fazl Barez, Thomas Steinthal, Joe Hardie, Erin Robertson, Richard Annilo, Itay Yona, other jam site organizers and all the participants for making it all possible. Investigating Neuron Behaviour via Dataset Example Pruning and Local Search By Alex Foote Abstract: This report presents methods for pruning and diversifying dataset examples that strongly activate neurons in a language model, to facilitate research into understanding the behaviour of these neurons. The pruning algorithm takes a dataset example that strongly activates a specific neuron and extracts the core sentence before iteratively removing words, to find the shortest substring that preserves a similar pattern and magnitude of neuron activation. This removes extraneous information, providing a much more concise input that is easier to reason about. The extracted substring, referred to as a Minimal Activating Example (MAE), is then used as a seed for local search in the input space. Using BERT, each word in the MAE is replaced by its most probable substitutes, and neuron activation is re-assess
For clarification, this event is not organized with Martian but with the generous support of PIBBSS and Timaeus