I guess I mean cheating purely as "I don't think this applies to to the Toy Model setting", as opposed to saying it's not a potentially valuable loss to study.
For p=1.0, I forgot that each of the noise features are random between 0 and 0.1, as opposed to fixed magnitude. The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now I reread the setting again I don't think that's relevant, though.
The restriction of the loss to the target feels like cheating, to be honest. The linear model claim is scoped to reconstruction loss, where you genuinely don't see superposition as far as i'm aware. And in this case, the reconstruction loss would be poor, because the vectors are nested so close to each other that adjacent features false fire.
I agree with the core point about finding alternative models of superposition though. As far as I know, there is no evidence that the Toy Model paper is accurate to how real models actually represent things, except at the broadest level. Towards Monosemanticity in fact notes divergence from the Toy Model paper (see Feature Splitting).
On the model itself, for , and , why can't you place vectors equidistant around the circle, allowing for arbitrarily many features?
I'm confused by this. The KL term we are looking at in the deterministic case is
, right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if , and . But this is impossible, because any of the terms in the product being 0 imply that is .
Intuitively, we construct an optimal code for encoding the distribution , and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by . Issues occur when but the true distribution takes on values which never occur according to , i.e: the optimal code doesn't account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I'd be grateful if you're able to elaborate.
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution? As in, we can use the fact that our explanation fails to generalize to the inputs (0,1) and (1,0) as a demonstration that the explanation is not mechanistically faithful, even on the restricted distribution?
Sometimes models learn mechanisms that hold with high probability over the input distribution, but where we can easily construct adversarial examples. So I think we want to allow explanations that only hold on narrow distributions, to explain typical case behaviour. But I think these explanations should come equipped with conditions on the input distribution for the explanation to hold. Like here, your causal model should have the explicit condition "x_1=x_2".
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form (corresponding to known bigram ), potentially with interference from other aspects, outputs a consistent vector into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with random)
Is the goal for SPD to learn components for such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
Edit: To be clear, of course the network itself cannot perform the task precisely. I'm simply claiming that you can precisely mimic the behaviour of with 100 rank-1 components, by just reading off the basis, as SPD does in this case. The fact that the themselves are not orthogonal is irrelevant.
To be concrete: if we have 100 linearly independent vectors , we can extend this to a basis of the whole 1000-dimensional space. Let be the change of basis matrix from the standard basis to this basis. Then we can write = , where we can pick arbitrarily.
If we write as a sum of rank-1 matrices , then will sum to , and is still rank-one since the image of = image of
So we can assume wlog that our lie along the standard basis, i.e: that they are orthogonal with respect to standard inner product.
Being linearly independent is sufficient in this case to read off each x_i with zero interference. Rank-one matrices are equivalent to (linear functional) * vector, and so we just pick the dual basis as our linear functionals, and extend them to whole space.
Minimizing components used per input Minimizing number of global components
Lots of Mech Interp methods minimize the LHS of this equation. Circuit analysis such as IOI, SAEs, APD.
Optimizing the LHS is useful if you want to be able to "tell a story" about any input. In this case you want the fewest latents on any given input, so that your story on a particular input is not too long.
But there's no reason to think that the model is actually using a sparse set of components /features on any given forward pass. A model might be using hundreds of different features, and combining components together in a harmony to produce the output. And if we don't just want to tell a short story about an input, and want to understand faithfully what's going on, then we are going to have to accept this. And we can't just make a new feature for each of the exponential number of ways that components can be combined.
Just because a model is potentially combining lots of components on any given input does not make it intractable to understand at a high level. We know this because people have a good understanding of operating systems, even though for an operating system to complete a single task, it potentially has to use thousands of different parts of the system. But the global number of parts of the system is small enough for a human to comprehend, and any given forward pass of the operating system just looks like combining these well-understood parts.
Interesting post, thanks.
I'm worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn't going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.