If I understand correctly, inactive circuits mistakenly activating is the main failure mode - once this happens, things go downhill quickly. So the bottleneck is robustly knowing which circuits should be active.
Could we use O(d) redundant On-indicators per small circuit, instead of just 1, and apply the 2-ReLU trick to their average to increase resistance to noise?
In the Section 5 scenario, would it help to use additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the norm of their activations).
The reason I think this is a reasonable idea is that LLMs do seem to compute binary indicators of when to use circuits, separate from the circuits themselves. Ferrando et al. found models have features for whether they recognize an entity, which gates fact lookup. In GPT2-Small, the entity recognition circuit is just a single first-layer neuron, while the fact-lookup circuit it controls is presumably very complex (GDM couldn't reverse-engineer it). This suggests networks naturally learn simple early gating for complex downstream computations.
Interesting post, thanks.
I'm worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn't going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.
I guess I mean cheating purely as "I don't think this applies to to the Toy Model setting", as opposed to saying it's not a potentially valuable loss to study.
For p=1.0, I forgot that each of the noise features are random between 0 and 0.1, as opposed to fixed magnitude. The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now I reread the setting again I don't think that's relevant, though.
The restriction of the loss to the target feels like cheating, to be honest. The linear model claim is scoped to reconstruction loss, where you genuinely don't see superposition as far as i'm aware. And in this case, the reconstruction loss would be poor, because the vectors are nested so close to each other that adjacent features false fire.
I agree with the core point about finding alternative models of superposition though. As far as I know, there is no evidence that the Toy Model paper is accurate to how real models actually represent things, except at the broadest level. Towards Monosemanticity in fact notes divergence from the Toy Model paper (see Feature Splitting).
On the model itself, for , and , why can't you place vectors equidistant around the circle, allowing for arbitrarily many features?
I'm confused by this. The KL term we are looking at in the deterministic case is
, right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if , and . But this is impossible, because any of the terms in the product being 0 imply that is .
Intuitively, we construct an optimal code for encoding the distribution , and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by . Issues occur when but the true distribution takes on values which never occur according to , i.e: the optimal code doesn't account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I'd be grateful if you're able to elaborate.
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution? As in, we can use the fact that our explanation fails to generalize to the inputs (0,1) and (1,0) as a demonstration that the explanation is not mechanistically faithful, even on the restricted distribution?
Sometimes models learn mechanisms that hold with high probability over the input distribution, but where we can easily construct adversarial examples. So I think we want to allow explanations that only hold on narrow distributions, to explain typical case behaviour. But I think these explanations should come equipped with conditions on the input distribution for the explanation to hold. Like here, your causal model should have the explicit condition "x_1=x_2".
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form (corresponding to known bigram ), potentially with interference from other aspects, outputs a consistent vector into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with random)
Is the goal for SPD to learn components for such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
Edit: To be clear, of course the network itself cannot perform the task precisely. I'm simply claiming that you can precisely mimic the behaviour of with 100 rank-1 components, by just reading off the basis, as SPD does in this case. The fact that the themselves are not orthogonal is irrelevant.
To be concrete: if we have 100 linearly independent vectors , we can extend this to a basis of the whole 1000-dimensional space. Let be the change of basis matrix from the standard basis to this basis. Then we can write = , where we can pick arbitrarily.
If we write as a sum of rank-1 matrices , then will sum to , and is still rank-one since the image of = image of
So we can assume wlog that our lie along the standard basis, i.e: that they are orthogonal with respect to standard inner product.
To clarify where i'm coming from, I read the previous post as showing that assuming inactive circuits never activate due to superposition noise, we could get T=O(D2d2). This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.
Agnostic of the method for embedding the small circuits in the larger network, currently only 1 out of d neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to cd for some small fixed c, increasing the size of the small networks to (1+c)d neurons. In the rotation example, d is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.
The idea is that while each of the cd indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across S neurons in the large network) receives is hopefully independent.
This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most z. Even if you erroneously believe one of the z circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other T−z inactive circuits. If the on-indicators don't seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.
I agree shared state/cross-circuit computation is an important thing to model, though. I guess that's what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you're just saying, if we want this to work more generally, there are issues?
---------------------------------------
This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.