Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

200 COP in MI: Exploring Polysemanticity and Superposition

2LawrenceC

2sudo

3Neel Nanda

1neverix

2Neel Nanda

1neverix

New Comment

I strongly upvoted this post because of the "Tips" section, which is something I've come around on only in the last ~2.5 months.

neuron has

I was confused by the singular "neuron."

I think the point here is that if there are some neurons which have low activation but high direct logit attribution after layernorm, then this is pretty good evidence for "smuggling."

Is my understanding here basically correct?

This happens in transformer MLP layers. Note that the hidden dimen

Is the point that transformer MLPs blow up the hidden dimension in the middle?

Thanks for the catch, I deleted "Note that the hidden dimen". Transformers do blow up the hidden dimension, but that's not very relevant here - they have many more neurons than residual stream dimensions, *and* they have many more features than neurons (as shown in the recent Anthropic paper)

Important Note: Since writing this, there's been a lot of exciting work on understanding superposition via training sparse autoencoders to take features out of superposition. I recommend reading up on that work, since it substantially changes the landscape of what problems matter here.This is the fifth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability.Start here, then read in any order. If you want to learn the basics before you think about open problems, check outmy post on getting started. Look up jargon inmy Mechanistic Interpretability ExplainerMotivating papers:Toy Models of Superposition,Softmax Linear Units## Background

If you're familiar with polysemanticity and superposition, skip to Motivation or Problems.Neural networks are very high dimensional objects, in both their parameters and their activations. One of the key challenges in Mechanistic Interpretability is to somehow resolve the curse of dimensionality, and to break them down into lower dimensional objects that can be understood (semi-)independently.

Our current best understanding of models is that, internally, they compute

features: specific properties of the input, like "this token is a verb" or "this is a number that describes a group of people" or "this part of the image represents a car wheel". That early in the model there are simpler features, are later used to compute more complex features by being connected up in acircuit(example shown above (source)). Further, our guess is that features correspond todirections in activation space. That is, for any feature that the model represents, there is some vector corresponding to it. And if we dot product the model's activations with that vector, we get out a number representing whether that feature is present.(these are known asdecomposable,linear representations)This is an extremely useful thing to be true about a model! An even more helpful thing to be true would be if

neuronscorrespond to features (ie the output of an activation function like ReLU). Naively, this is natural for the model to do, because a non-linearity like ReLU acts element-wise - each neuron's activation is computed independently (this is an example of aprivileged basis). Concretely, if a neuron can represent feature Aorfeature B, then that neuron will fire differently for feature A and NOT feature B, vs feature A and feature B, meaning that the presence of Binterfereswith the ability to compute A. But if each feature is its own neuron we're fine!If features correspond to neurons, we're playing interpretability on easy mode - we can focus on just figuring out which feature corresponds to each neuron. In theory we could even show that a feature is

notpresent by verifying that it's not present in each neuron! However, reality is not as nice as this convenient story. A countervailing force is the phenomena ofsuperposition. Superposition is when a network represents more features than it has dimensions, and squashes them all into a lower dimensional space. You can think of superposition as the modelsimulating a larger model.Anthropic's Toy Models of Superposition paper is a great exploration of this. They build a toy model that learns to use superposition (notably different froma toy

languagemodel!). The model starts with a bunch of independently varying features, needs to compress these to a low dimensional space, and then is trained to recover each feature from the compressed mess. And it turns out that it does learn to use superposition!Specifically, it makes sense to use superposition for sufficiently rare (of fascinating motifs regarding exactly

sparse) features, if we give it non-linearities to clean up interference. Further, the use of superposition can be modelled as a trade-off between the costs of interference, and the benefits of representing more features. And digging further into their toy models, they find all kindshowsuperposition occurs, notably that the features are sometimescompressed in geometric configurations, eg 5 features being compressed into two dimensions as the vertices of a pentagon, as shown below.## Motivation

Zooming out, what does this mean for what research actually needs to be done? To me, when I imagine what real progress here might look like, I picture the following:

Crisp conceptual frameworks: I still feel pretty confused about what is even going on with superposition! How much does it occur? The Toy Models paper significantly clarified my intuitions, but it's far from complete. I expect progress here to mostly look like identifying the aspects of transformers and superposition that we're still confused about, building toy models to model those, and seeing what insights can be learnedEmpirical data from real models: It's all well and good to have beautiful toy models and conceptual frameworks, but it's completely useless if we aren't learning anything about real models! I wouldloveto have some well-studied cases of superposition and polysemanticity in real models, and to know whether any of the toy model's predictions transfer.anytruly monosemantic neurons? Can we find a pentagon of features in a real residual stream? Can we reverse engineer a feature represented by several neurons?Dealing with superposition in practice: Understanding superposition is only useful in that it allows us to better understand networks, so we need to know how to deal with it in practice! Can we identify all directions that correspond to features? Can we detect whether a feature is at all neuron-aligned, or just an arbitrary direction in space?The direction I'm most excited about is a combination of 1 and 2, to form a rich feedback loop between toy models and real models - toy models generate hypotheses to test, and exploring real models generates confusions to study in toy models.

## Resources

Toy Models of Superpositionpaper. This is a fascinating and well-written paper, andI recommend reading it before working on a problem in this area! There's a ton more insights in there that I didn't describe here.## Tips

rightmodel to analyse. It's a delicate balance between being a true simulation of what we care about in a real model, and simple enough to be tractable to analyse, and it's very easy to go too far in either direction.noton the output. This looked very interesting at first, but in hindsight was totally wrong-headed! (Take a moment to try to figure out why before you read on!)hasall the features, and it wants to use the bottleneck tocompressthese features. ReLUs are for computing new features and create significant interference between dimensions, so it's actively unhelpful on the bottleneck. But they're key at theend, because they're used for the "computation" of cleaning up the noise of interference with other features.reallyseriously.linear bottleneck superpositionandneuron superposition.compression. It occurs when there's a linear map from a high dimensional space to a low dimensional space, and then a linear map back to a high dimensional spacewithouta non-linearity in the middle. Intuitively, the model alreadyhasthe features in the high dimensional space, but wants to map them to the low dimensional space in a way such that they can be recovered later for further computation. But it's not trying to computenewfeatures.mustbe happening - in GPT-2 Small there is a vocabulary of 50,000 possible input tokens, which are embedded to a residual stream of 768 dimensions, yet GPT-2 Small can still tell the difference between the tokens!computation. It occurs when there's more features than neurons, immediately after a non-linear activation function. Importantly, this means that the model has somehowcomputedmore features than it had neurons - because the model needed to use a non-linearity, these features were not previously represented as directions in space.waymore confused about it, so I'd be particularly excited to see more work here!alternating interferenceandsimultaneous interference. Let's consider the different cases when one direction represents both feature A and feature B.notpresent, and the model needs to figure out that despite there being some information along the direction, B isnotpresent, while still detecting how much A is present. In toy models, this mostly seem to have been done by using ReLU to round off small activations to zero.andB is present, and the model needs to figure out that both are present (and how much!)## Problems

This spreadsheetlists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)Notation:ReLU output modelis the main model in the Toy Models of Superposition paper which compresses features in a linear bottleneck,absolute value modelis the model studied with a ReLU hidden layer and output layer, and which uses neuron superposition.- Confusions about models that I want to see studied in a toy model:
- Explore neuron superposition by training their absolute value model on functions of multiple variables:

- Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Currently the features are uniform `[0, 1]` if on (and 0 if off):

- Have n input features and an output feature for each pair of input features, and train it to compute the max of each pair.
- Have discrete input data, eg if it's on, take on values in

max(|x|,|y|)A* 4.1 -Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results. Do the geometric configurations happen as before? And are the feature directions noticeably more (or less!) aligned with the hidden dimension basis?B-C* 4.2 -Replicate their absolute value model and try to study some of the variants of the ReLU output models in this context. Try out uniform vs non-uniform importance, correlated vs anti-correlated features, etc. Can you find any more motifs?B* 4.3 -Explore neuron superposition by training their absolute value model on a more complex function like`x -> x^2`

. This should need multiple neurons per function to do wellB* 4.4 -What happens to their ReLU output model when there's non-uniform sparsity? Eg one class of less sparse features, and another class of very sparse features.A* 4.5 -Make the inputs binary (0 or 1), and look at the AND or OR of pairs of elementsB* 4.6 -Keep the inputs as uniform reals in`[0, 1]`

and look at`max(x, y)`

A* 4.7 -Make the features 1 (ie exactly two possible values)B* 4.8 -Make the features discrete, eg 1, 2 or 3B* 4.9 -Make the features uniform`[0.5, 1]`

A-B* 4.10 -What happens if you replace ReLUs with GELUs in their toy models? (either for the ReLU output model, or the absolute value model). Does it just act like a smoother ReLU?C* 4.11 -Can you find a toy model where GELU acts significantly differently from ReLU? A common intuition is that GELU is mostly a smoother ReLU, but close to the origin GELU can act more like a quadratic. Does this ever matter?C* 4.12 -Build a toy model of a classification problem, where the loss function is cross-entropy loss (not mean squared error loss!)C* 4.13 -Build a toy model of neuron superposition that has many more hidden features to compute than output features. Ideas:`[1.0,2.0,3.0,4.0,5.0]`

, and have 5 output features per input feature, with the label being`[1,0,0,0,0],[0,1,0,0,0],...`

and mean-squared error loss.C* 4.14 -Build a toy model of neuron superposition that needs multiple hidden layers of ReLUs. Can computation in superposition happen across several layers? EgC-D* 4.15 -Build a toy model ofattention headsuperposition/polysemanticity. Can you find a task where a model wants to be doing different things with an attention head on different inputs? How do things get represented internally? How does it deal with interference?`A ... B -> A`

(ie, if the current token is B, and token A occurred in the past, predict that A comes next).C-D* 4.16 -Build a toy model where a model needs to deal with simultaneous interference, and try to understand how it does it (or if it can do it at all!).C* 4.17 -A learned example of a network with a "non-linear representation". Where its activationscanbe decomposed into independently understandable features, but not in a linear way (eg via different geometric regions in activation space aka polytopes)hascomputed feature X, but it is not represented as a direction". Maybe if the model can do computation within the non-linear representation, without ever needing to explicitly make it linear?C* 4.18 -A network that doesn't have a discrete number of features (eg. perhaps it has an infinite regression of smaller and smaller features, or fractional features, or something else)C* 4.19 -A neural network with a "non-decomposable" representation, ie where we can't break down its activations into independently understandable featuresTip:To studyinduction circuits, look atattn-only-2lin TransformerLens. To studyIndirect Object Identification, look atgpt2-small.B* 4.21 -Induction heads copy the token they attend to to the output, which involves storing which of the 50,000(!) input tokens it is in the 64 dimensional value vector. How are the token identities stored in the 64 dimensional space?B* 4.22 -The previous token head in an induction circuit communicates the value of the previous token to the key of the induction head. As above, how is this represented?B* 4.23 -The Indirect Object Identification circuit communicates names or positions between the pairs of composing heads. How is this represented in the residual stream? How many dimensions does it take up?B* 4.24 -In models like GPT-2 with absolute positional embeddings, knowing this positional information is extremely important, so the ReLU output model predicts that these should be given dedicated dimensions. Does this happen? Can you find any other components that write to these dimensions?firstpositional embedding is often weird, and I would ignore it.C-D* 4.25 -Can you find any examples of the geometric superposition configurations from the ReLU output model in the residual stream of a language model?C* 4.26 -Can you find any examples of locally almost-orthogonal bases? That is, where correlated features each get their own direction, but can interfere significantly with un/anti-correlated features.C* 4.27 -I speculate that an easy way to do bottleneck superposition with language data is to have "genre" directions which detect the type of text (newspaper article, science fiction novel, wikipedia article, Python code, etc), and then to represent features specific to each genre in thesamesubspace. Because language tends to fall sharply into one type of text (or none of them), the model can use the same genre feature to distinguish many other sub-features. Can you find any evidence for this hypothesis?D* 4.28 -Can you find any examples of a model learning to deal with simultaneous interference? Ie having a dimension correspond to multiple features and being able to deal sensibly with both being present?B* 4.29 -Look at a polysemantic neuron in a one layer language model. Can you figure out how the model disambiguates which feature it is?onlyimpact the output logits, so there's not that much room for complexityC* 4.30 -Do this on a two layer language model.B* 4.31 -Take one of the features that's part of a polysemantic neuron in a 1L language model and try to identifyeveryneuron that represents that feature (such that if you eg use activation patching on just those neurons, the model completely cannot detect the feature). Is this sparse (only done by a few neurons) or diffuse (across many neurons)?C* 4.32 -Try to fully reverse engineer that feature! See if you can understand how it's being computed, and how the model deals with alternating or simultaneous interference.C* 4.33 -Can you use superposition to create an adversarial example for a model?This is a widely studied subfield of interpretability (including non-mechanistic!) calledprobing, see aliterature review. In brief, it looks like taking a dataset of positive and negative examples of a feature, looking at model activations on both, and finding a direction that predicts the presence of the feature.C-D* 4.35 -Pick a simple feature of language (eg "is a number" or "is in base64") and train a linear probe to detect that in the MLP activations of a one layer language model (there's a range of possible methods! I'm not sure what's suitable). Can you detect the feature? And if so, how sparse is this probe? Try to explore and figure out how confident you are that the probe has actually found how the feature is represented in the model?C-D* 4.36 -Look for features in neuroscope that seem to be represented by various neurons in a 1L or 2L language model. Train a probe to detect some of them, and compare the performance of these probes to just taking that neuron. Explore and try to figure out how much you think the probe has found the true featureTheSoLU paperintroduces the SoLU activation and claims that it leads to more interpretable and less polysemantic neurons than GELUA* 4.37 -How do my SoLU and GELU models compare in neuroscope under the polysemanticity metric used in the SoLU paper? (what fraction of neurons seem monosemantic when looking at the top 10 activating dataset examples for 1 minute)B* 4.38 -The SoLU metrics for polysemanticity are somewhat limited. Can you find any better metrics? Can you be more reliable, or more scalable?B-C* 4.39 -The paper speculates that the LayerNorm after the SoLU activations lets the model "smuggle through" superposition, by smearing features across many dimensions, having the output be very small, and letting the LayerNorm scale it up. Can you find any evidence of this in`solu-1l`

?C* 4.42 -If you train a 1L or 2L language model with d_mlp = 100 * d_model, what happens? Does superposition go away? In theory it should have more than enough neurons for one neuron per feature.lotof effort!D* 4.44 -Can you take a trained model, freeze all weights apart from a single MLP layer, then make the MLP layer 10x the width, copy each neuron 10 times, add some noise and fine-tune? Does this get rid of superposition? Does it add in new features?C-D* 4.45 -There's a long list of open questions at the end ofToy Models. Pick one and try to make progress on it!