Research Engineer at DeepMind, focused on mechanistic interpretability and large language models. Opinions are my own.
Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an
, _An
, an
, An
, etc.). It's curious because the semantics of these tokens can be quite different (compared to the though
, tho
, however
neuron).
Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.
(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)
I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.
iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup)
I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.
Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?
I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ.
Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn't transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don't hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.
Undertrained autoencoders
I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you're thinking about that.
(ETA to the OC: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)
Thanks for posting this. Some comments/questions we had after briefly discussing it in our team:
Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.
I don't know what it would mean to go "out of distribution" in any of these cases.
This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.
Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
I'm not aware of any work that identifies superposition in exactly this way in NNs of practical use.
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.
ETA: Ofc there could be some other mediating factor, too.
This example is meant to only illustrate how one could achieve this encoding. It's not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.
Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?