Superposition and Dropout

Edoardo Pona

As part of the second Alignment Jam I studied how dropout affects the phenomenon called superposition. Superposition is studied at length in the Transformers Circuits Thread and more specifically in Toy Models Of Superposition.

The question I tried to answer is whether introducing dropout has a noticeable effect in the extent to which superposition occurs in small toy models.

Better understanding and controlling superposition would have big consequences for alignment research, as reducing it allows for more interpretable models.

I find that dropout definitely has some effect on superposition. This effect is complex, but it seems that dropout generally inhibits superposition, except in the presence of both features with varying importance (exponentially decaying in this case) and low sparsity. I believe this is due to the need for some initial redundancy for a model with dropout, and a disproportionate increase in interference as this redundancy increases.

1 - Background

Dropout

Dropout is a simple regularisation technique that is often used (perhaps slightly less so nowadays) to reduce overfitting in neural networks. It consists in with zeroing out each neuron of a given layer with a certain probability . Intuitively, this encourages some level of redundancy in the network's representations.

Superposition

If you're not in a rush, read Toy Models Of Superposition, it's great. If you are in a rush, here is a quick summary.

The phenomenon called superposition occurs when models learn to represent more features than they have dimensions by activating multiple neurons at the same time for a given feature. This happens more reliably when the features we are dealing with are sparse, that is, most of them aren't present most of the time. This is something we can reasonably expect in real data. Take images for example: the space of possible features represented in an image is huge, but each individual image will contain a small subset of those features.

superposition image — This figure was taken from 'Toy Models of Superposition', Elhange et al.

Intuitively, the model learns that it can use the latent space more efficiently by packing these features 'almost-orthogonally' but not quite, representing more at the cost of some interference between them. If the features aren't sparse however (ie. they are always all present in the data), the model has less of an incentive to do this.

This figure was taken from 'Toy Models of Superposition', Elhange et al.

2 - Methods

We use the same exact setup of the Anthropic team, with the exception of dropout.

The models they study are simple Auto-Encoders with a bottleneck hidden layer, and ReLU activation functions. They can be described by the following equations.

$h = d r o p o u t (W x)$

$x^{'} = R e L U (W^{T} + b)$

where $x^{'}$ is the reconstructed input $x$ . Notice the introduction of dropout on the hidden bottleneck layer.

The data the models are trained on is synthetic. Each $x_{i}$ is considered to be a feature and has associated sparsity $S_{i}$ and importance $I_{i}$ . Each $x_{i}$ is 0 with probability $S_{i}$ , and otherwise sampled uniformly in $[0, 1]$ . Even in our case, we consider one value of sparsity for all the features at the same time. $I_{i}$ scales the mean squared error loss we use to train the model for reconstruction.

These models are simple and small enough to be tractable and clearly interpretable, but at the same time complex enough (nonlinear) to achieve the representation task effectively.

3 - Results

We can visualise the features learned by the models with 2 latent dimensions. From yellow to green the colour represents increasing feature importance.

Feature embeddings for a toy model with 5 features in 2 dimensions, without dropout

As we would expect, in the vanilla toy model we observe that superposition increases with sparsity, with the dense case representing orthogonally only the two most important features, and the sparsest case representing all 5 features in a regular pentagon.

We can compare this to what the model learns with dropout.

Feature embeddings for a toy model with 5 features in 2 dimensions, dropout p=0.4

Here we can start to see that, while superposition still does increase with sparsity, model never goes beyond antipodal pairs, never represents all 5 of the features. Furthermore, the features are now basis aligned with the neurons (this is because dropout induces a privileged basis).

If we look at larger models, we can see this effect even better. Here we train models with 20 features and a 5 dimensional bottleneck.

The bar charts show the norm of each feature's direction vector $| | W_{i} | |$ , which will be ~1 if the features is fully represented. The bars' colour shows the extent to which each feature is in superposition (whether it is orthogonal to the other features), from blue (not at all) to yellow. Each row represents an increase in sparsity. Furthermore, we show the matrix $W^{T} W$ on the right, this is the transformation that our model applies to the data.

visualisation of superposition and autoencoder weights

Once more, dropout hinders the occurrence of superposition, at least towards the higher end of sparsity. We can also note that dropout seems to encourage a 'weak' representation of less important features, even in the dense case, which then leaves space for more superposition. This initial effect is only present in experiments where feature importance varies. In the case of uniform importance, the effect of dropout is solely that of reducing superposition. In fact, despite this 'resistance' to superposition, the transition from one phase to the other is much more gradual, and present from the start.

We can also look at the number of hidden dimensions per feature represented in the model's hidden space. This is computed in the same way as in Toy Models of Superposition, that is $m / | | W | |_{F}^{2}$ . The figure shows us that that dropout does in fact increase the number of dimensions used per feature. From this graph we can also conclude that dropout affects the geometric structures used to represent features. In future work, it will be interesting to study how dropout affects feature geometry and learning dynamics. A more peculiar aspect of this metric is that it shows a reduced number of dimensions per feature for dropout even in the case of no sparsity, which doesn't correspond with the conclusions from the above bar charts. I suspect this probably has to do with how this metric responds to features that are not fully represented, for $| | W_{i} | |^{2}$ in between $0$ and 1.

Comparison of dimensions per represented feature across sparsity levels with and without dropout.

It is important to note, however, that performance in this model suffers, probably due to the fact that $p = 0.4$ is quite aggressive.

performance comparison for no dropout (p=0) and p=0.4

If we reduce the amount of dropout to a more reasonable 0.15, we get something in the middle in terms of both performance and superposition. Even using more compute, it does not seem the case that we can get dropout models to the same performance as the clean one: dropout is not simply 'slowing down' the training process. It remains to be seen in real-world tasks whether this performance tradeoff is worth it.

In fact, if we take a look at a model without dropout, trained to the same performance of the dropout model, this still exhibits much more superposition.

superposition visualisation for dropout p=0, trained to 0.05 mse loss

4 - Why?

Here I propose hypotheses for why dropout causes these effects. These are mostly speculative.

An intuitive explanation for why dropout might reduce superposition in the case of sparse features is that the more neurons a model uses to represent a given feature, the higher the chance that at least one of them get's perturbed by dropout. In fact, for a feature that is represented across $n$ neurons, the probability that none of the neurons be perturbed becomes $(1 - p)^{n}$ . We only notice this effect in high-sparsity regimes, because only then do neurons become highly polysemantic.

On the other hand, in the case of dense features, dropout seems to actually incentivise a small amount of superposition. This only occurs when we have features of varying importance, in this case exponentially decaying. In fact, this effect is noticeable even in the case of dense linear models with dropout. This effect is more complex. First of all, dropout forces the weights to share some of the representation with the biases, even for the most important features which would otherwise be fully represented (because they might be perturbed). In particular, for an input $x = (a, b, c)^{T}$ we have ${^x}_{i} = b_{i} + W_{i}^{T} W_{i} x_{i} + \sum_{j \neq i} W_{i}^{T} W_{j} x_{j}$ . Even for the most important feature, where $W_{i}^{T} W_{i} \sim 1$ and $W_{i}^{T} W_{i \neq j} \sim 0$ , we can always expect some perturbation caused by dropout to 'damage' the representation. Dropout on the hidden layer only affects the rows of $W_{k}$ , so the model compensates this by sharing some of the feature's representation with $b_{i}$ , which is not affected. Similarly, the interference term $\sum_{i \neq j} (W_{i} \cdot W_{j})^{2}$ is also initially reduced by dropout, as is the magnitude of $| | W_{j} | |, | | W_{j} | |$ , thus the model is less penalised for introducing superposition. Intuitively, the interference caused by a feature represented in superposition, will only be large if neither of them are perturbed.

5 - Future work

There are two main directions in which this should be extended.

First, it would be interesting to observes this effect in the real world. For example, can dropout applied to a vision model help us find more interpretable circuits and features? This should be relatively simple to try, given the amount of work that has already been done on vision models in the original Circuits Thread.

Second, on the more theoretical side, dropout seems to 'slow down' the phase transition between full representation of few features and superposition. This almost doesn't look like a phase transition at all. Why is that the case, and how does it change the theoretical models of the phase transitions proposed in 'Toy Models of Superposition'? In addition to that, a better formalisation of the arguments explained in 4 should be sought.

Most likely, this will be the first in a series of posts.

[-]WCargo1y10

Great post! I was wondering if the conclusion to be drawn is really that « dropout inibits superpositon »? My prior was that it should increase it (so this post proved me wrong on this part) mainly because in a model with mlp in parallel (like transformer) deopout would force redundancy of circuit, not inside one mlp, but across different mlps

Id like to see more on that, it would be super useful to know that dropout helps or not interpretability to enforce it or not on training

[-]Edoardo Pona1y20

Thanks!

I haven't yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can't be as 'packed'.

[-]WCargo10mo10

One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)

[-]Edoardo Pona10mo20

What do you mean exactly by 'in parallel' and 'in series' ?

In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).

The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.

I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).

In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement "independent" component for it to compute thing properly.

LESSWRONG
LW