Interesting post, thanks.
I'm worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn't going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.
QK-Norm normalizes queries and keys before multiplying them together, using a learnt temperature to control the magnitude of the attention scores.
Suppose we hypothesize that the residual stream pre attention layer-norm (immediately post MLP), is given by . Then after folding in constants appropriately, and centering embeddings, the post layer-norm input to attention is given by .
Suppose we have an attention head that is only interested in reading off for the keys. For instance, might be a representation of the current token, and the head might be implementing a duplicate token head.
For simplicity say that all the features are linearly independent. Then this head can read off the key vector .
Without QK-Norm, the normalization by the sum of all the features means there is always potential interactions between the features.
With QK-norm, if the model sets , then the key normalization will return the key to just a function , removing the feature interactions.
I can't find any models in transformer_lens that use QK-Norm unfortunately, so i've not been able to investigate how models actually make use of it.
Let me know if you do. I keep changing my mind on what the outcome would be lol.
I guess I mean cheating purely as "I don't think this applies to to the Toy Model setting", as opposed to saying it's not a potentially valuable loss to study.
For p=1.0, I forgot that each of the noise features are random between 0 and 0.1, as opposed to fixed magnitude. The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now I reread the setting again I don't think that's relevant, though.
The restriction of the loss to the target feels like cheating, to be honest. The linear model claim is scoped to reconstruction loss, where you genuinely don't see superposition as far as i'm aware. And in this case, the reconstruction loss would be poor, because the vectors are nested so close to each other that adjacent features false fire.
I agree with the core point about finding alternative models of superposition though. As far as I know, there is no evidence that the Toy Model paper is accurate to how real models actually represent things, except at the broadest level. Towards Monosemanticity in fact notes divergence from the Toy Model paper (see Feature Splitting).
On the model itself, for , and , why can't you place vectors equidistant around the circle, allowing for arbitrarily many features?
I'm confused by this. The KL term we are looking at in the deterministic case is
, right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if , and . But this is impossible, because any of the terms in the product being 0 imply that is .
Intuitively, we construct an optimal code for encoding the distribution , and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by . Issues occur when but the true distribution takes on values which never occur according to , i.e: the optimal code doesn't account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I'd be grateful if you're able to elaborate.
Agreed, it's a necessary but not sufficient condition for explanation. In practice, all mechanisms have scopes of inputs where they are valid / invalid, and your mechanistic explanation should explicitly specify these scopes.
I think when you have such small input spaces it's hard to beat just giving a list of the computations for a mechanistic explanation though. Where mechanistic explanations can shine is when they can take some relatively weak assumptions on the input, including exponentially many potential inputs, and find ways to abstract the computation to make it more humanly understandable. And then your explanation consists of showing that under those assumptions on the input, the model indeed can be abstracted that way. And that forces you to engage with the mechanisms.
Like if we had your toy example from above, but instead the input space was {0,1,...,1000000} x {0,1,...,1000000}, then an explanation that says has the same output as as long as , has a much shorter proof if we engage with the mechanisms, than if we try to manually run all the inputs (x,x) for all x in {0,1,...,1000000}.
My gripe with most modern mech interp is that it doesn't even try to give conditions on the input, or engage with the underlying mechanisms. Instead it just is either looking at very small input spaces, where very little can be said, because it's easiest just to give the trace of the computation, or looking for empirical patterns in activations when running the model across large input spaces.
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution? As in, we can use the fact that our explanation fails to generalize to the inputs (0,1) and (1,0) as a demonstration that the explanation is not mechanistically faithful, even on the restricted distribution?
Sometimes models learn mechanisms that hold with high probability over the input distribution, but where we can easily construct adversarial examples. So I think we want to allow explanations that only hold on narrow distributions, to explain typical case behaviour. But I think these explanations should come equipped with conditions on the input distribution for the explanation to hold. Like here, your causal model should have the explicit condition "x_1=x_2".
Indeed we have P(A|B) = P(A and B)/P(B), so that P(A|B)/P(A) = P(A and B)/(P(A)P(B)), which is clearly symmetric in A and B, so we get Bayes' theorem.
The reason for the usual form for it is just that we typically want to compute P(A|B) from P(B|A), and the usual form gives this directly.
Ah thank you I hadn't seen this post.