This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper’s post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
Join us for this weekend-long research hackathon on the complexities of AI security! We will dive deep into frontier AI systems, specifically the fascinating multi-agent systems. Our mission will be to understand the worst-case scenarios and how we can avoid them.
Sign up to blend our rigorous research with the spirit of hackathon innovation. There's prizes for $1,200 on the line and the most ambitious teams might receive advising and senior co-authorship with established researchers in the field of AI safety. So, come along!
Watch the introductory talk by Christian Schroeder de Witt here.
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale if we are concerned about the scale of is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s releva...
We started out with the question: How does GPT-2 know when to use the word "an" over "a"? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 can only output one word at a time.
We still don’t have a full answer, but we did find a single MLP neuron in GPT-2 Large that is crucial for predicting the token " an". And we also found that the weights of this neuron correspond with the embedding of the " an" token, which led us to find other neurons that predict a specific token.
It was surprisingly hard to think of a prompt where GPT-2 would output “ an” (the leading space is part of the token)...
The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (' an') would be the top answer for the pre-patched prompt — the one with 'apple', and the other token (' a') would be the the top answer for the patched prompt — the one with 'lemon'. The idea is that with these prompts is that we know that the top prediction is either ' an' or ' a', and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards ... (read more)