I guess I mean cheating purely as "I don't think this applies to to the Toy Model setting", as opposed to saying it's not a potentially valuable loss to study.
For p=1.0, I forgot that each of the noise features are random between 0 and 0.1, as opposed to fixed magnitude. The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now I reread the setting again I don't think that's relevant, though.
The restriction of the loss to the target feels like cheating, to be honest. The linear model claim is scoped to reconstruction loss, where you genuinely don't see superposition as far as i'm aware. And in this case, the reconstruction loss would be poor, because the vectors are nested so close to each other that adjacent features false fire.
I agree with the core point about finding alternative models of superposition though. As far as I know, there is no evidence that the Toy Model paper is accurate to how real models actually represent things, except at the broadest level. Towards Monosemanticity in fact notes divergence from the Toy Model paper (see Feature Splitting).
On the model itself, for , and , why can't you place vectors equidistant around the circle, allowing for arbitrarily many features?
I'm confused by this. The KL term we are looking at in the deterministic case is
, right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if , and . But this is impossible, because any of the terms in the product being 0 imply that is .
Intuitively, we construct an optimal code for encoding the distribution , and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by . Issues occur when but the true distribution takes on values which never occur according to , i.e: the optimal code doesn't account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I'd be grateful if you're able to elaborate.
Agreed, it's a necessary but not sufficient condition for explanation. In practice, all mechanisms have scopes of inputs where they are valid / invalid, and your mechanistic explanation should explicitly specify these scopes.
I think when you have such small input spaces it's hard to beat just giving a list of the computations for a mechanistic explanation though. Where mechanistic explanations can shine is when they can take some relatively weak assumptions on the input, including exponentially many potential inputs, and find ways to abstract the computation to make it more humanly understandable. And then your explanation consists of showing that under those assumptions on the input, the model indeed can be abstracted that way. And that forces you to engage with the mechanisms.
Like if we had your toy example from above, but instead the input space was {0,1,...,1000000} x {0,1,...,1000000}, then an explanation that says has the same output as as long as , has a much shorter proof if we engage with the mechanisms, than if we try to manually run all the inputs (x,x) for all x in {0,1,...,1000000}.
My gripe with most modern mech interp is that it doesn't even try to give conditions on the input, or engage with the underlying mechanisms. Instead it just is either looking at very small input spaces, where very little can be said, because it's easiest just to give the trace of the computation, or looking for empirical patterns in activations when running the model across large input spaces.
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution? As in, we can use the fact that our explanation fails to generalize to the inputs (0,1) and (1,0) as a demonstration that the explanation is not mechanistically faithful, even on the restricted distribution?
Sometimes models learn mechanisms that hold with high probability over the input distribution, but where we can easily construct adversarial examples. So I think we want to allow explanations that only hold on narrow distributions, to explain typical case behaviour. But I think these explanations should come equipped with conditions on the input distribution for the explanation to hold. Like here, your causal model should have the explicit condition "x_1=x_2".
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form (corresponding to known bigram ), potentially with interference from other aspects, outputs a consistent vector into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with random)
Is the goal for SPD to learn components for such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
Edit: To be clear, of course the network itself cannot perform the task precisely. I'm simply claiming that you can precisely mimic the behaviour of with 100 rank-1 components, by just reading off the basis, as SPD does in this case. The fact that the themselves are not orthogonal is irrelevant.
To be concrete: if we have 100 linearly independent vectors , we can extend this to a basis of the whole 1000-dimensional space. Let be the change of basis matrix from the standard basis to this basis. Then we can write = , where we can pick arbitrarily.
If we write as a sum of rank-1 matrices , then will sum to , and is still rank-one since the image of = image of
So we can assume wlog that our lie along the standard basis, i.e: that they are orthogonal with respect to standard inner product.
Interesting post, thanks.
I'm worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn't going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.