Tom Lieberum

Research Engineer at DeepMind, focused on mechanistic interpretability and large language models. Opinions are my own.

Wiki Contributions


During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The "it becomes cleaner" intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See e.g. Cammarata et al. Understanding RL Vision: If you only ever see the second choice be labeled with B you don't have an incentive to distinguish between "look for B" and "look for the second choice". Lastly, even in the limit of infinite training data you still have limited model capacity and so will likely use a distributed representation in some way, but maybe you could at least get human interpretable features even if they are distributed.

Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?

Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It's curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).


Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup) 

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ.

Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn't transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don't hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.

Undertrained autoencoders

I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you're thinking about that.

(ETA to the OC: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)

Thanks for posting this. Some comments/questions we had after briefly discussing it in our team:

  • We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data.
    • Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.
  • As Charlie Steiner pointed out, you are using a very favorable ratio of  in the toy model , i.e. of number of ground truth features to encoding dimension. I would expect you will mostly get antipodal pairs in that setup, rather than strongly interfering superposition. This may contribute significantly to the mismatch. (ETA: the antipodal pairs wouldn't happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you'd see that. I'm now less sure about this specific argument)
  • For the MMCS plots, we would be interested in seeing the distribution/histogram of MCS values. Especially for ~middling MCS values, where it's not clear if all features are somewhat represented or some are a lot and some not at all.
  • While we don't think this has a big impact compared to the other potential mismatches between toy model and the MLP, we do wonder whether the model has the parameters/data/training steps it needs to develop superposition of clean features.
    • e.g. in the toy models report, Elhage et al. reported phase transitions of superposition over the course of training,

Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.

  • A token is either the first one of multi-token word or it isn't.
  • A word is either a noun, a verb or something else.
  • A word belongs to language LANG and not to any other language/has other meanings in those languages.
  •  image can only contain so many objects which can only contain so many sub-aspects.

I don't know what it would mean to go "out of distribution" in any of these cases.

This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.

Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.

I'm not aware of any work that identifies superposition in exactly this way in NNs of practical use. 
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.


ETA: Ofc there could be some other mediating factor, too.

Load More