Compressed Computation is (probably) not Computation in Superposition

Sara Molas Medina; Giorgi Giglemiani; StefanHex

Thank you for looking into this.

This investigation updated me more toward thinking that Computation in Superposition is unlikely to train in this kind of setup, because it's mainly concerned with minimising worst-case noise. It does lots of things, but it does them all with low precision. A task where the model is scored on how close to correct it gets many continuously-valued labels, as scored by MSE loss, is not good for this.

I think we need a task where the labels are somehow more discrete, or the loss function punishes outlier errors more, or the computation has multiple steps, where later steps in the computation depend on lots of intermediary results computed to low precision.

[-]Dan Braun4mo*32

I think this is a fun and (initially) counterintuitive result. I'll try to frame things as it works in my head, it might help people understand the weirdness.

The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP's perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by $W_{E} W_{E}^{T}$ not being an identity.

But it turns out that making up for this mess actually makes the problem easier!

[-]Jai Bhagat4mo10

Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary "feature mixing" we discuss

[-]Linda Linsefors4mo*20

I did some quick calculations for what the mse per feature should be for compressed storage. I.e. storing T features in D dimension where T > D.

I assume every feature is on with probably p. On feature equals 1, off feature equals 0. Mse is mean square error for linear readout of features.

For random embeddings (super possition):

mse_r = Tp/(T+D)

If using the D neurons to embed T features exactly, and output feature value constant p, for the rest.

mse_d = p(1-p)(T-D)/T

This suggest we should see a transition between these types of embeddings when

mse_r = mse_d when T^2/D^2=(p-1)/p

For T=100 and D=50, this means p=0.2

The model in this post is doing a bit more than just embedding features. But I don't think, it can't do better than the most effective embedding of the T output features in the D neurons?

mse_r only depends on E[(u dot v)^2]=1/D where v and u are diffrent embedding vectors. Lots of ebeddings have this property, e.g. embedding features along random basis vectors, i.e. assigning each feature to a random single neuron. This will result in some embedding vectors being exactly identical. But the mse (L2) loss is equally happy with this as with random (almost orthogonal) feature directions.

[-]Linda Linsefors4mo20

Does p=1 mean that all features are always on?

If yes, how did it fail to get perfect loss in this case?

[-]StefanHex4mo50

The features are on, but arbitrary values between -1 and 1 (I assume you were thinking of the binary case)

[-]Linda Linsefors4mo20

Yes, thanks!

[-]TheManxLoiner4mo10

Is code for experiments open source?

[-]StefanHex4mo30

Yep, the main code is in this folder! You can find

the main logic in mlpinsoup.py
notebooks reproducing the plots for each result section in the nb{1,2,3,4}... files

^{^}

This loss corresponds to implementing half the features, ignoring the other half. The number is derived from integrating $R e L U (x)$ .

^{^}

We’re using the terms feature and function interchangeably, both refer to the 100 inputs and their representation in the model.

^{^}

Our $W_{i n}$ / $W_{o u t}$ are just 100x50 dimensional, while Braun et al. (2025)’s are 1000x50 dimensional due to the embedding, but this does not make a difference.

^{^}

Braun et al. (2025) say “We suspect our model’s solutions to this task might not depend on the sparsity of inputs as much as would be expected, potentially making ‘compressed computation’ and ‘computation in superposition’ subtly distinct phenomena”. We now know that this was due to evaluating a model’s performance only at its training feature probability, rather than across a set of evaluation feature probabilities, i.e. they measured the black dotted line in Figure 2.

^{^}

For Figure 5 we chose the optimal noise scale for each model, which happens to be close to the standard deviation of the mixing matrix corresponding to the embeddings chosen in Braun et al. (2025). This was coincidental as Braun et al. (2025) chose $d_{e m b e d} = 1000$ for unrelated reasons.

^{^}

Two further differences between Braun et al. (2025)'s embed-like case and the symmetrized case are (a) the diagonal of $M$ is zero in the embed-like case, and (b) the entries of $M$ are not completely independent in the embed-like case as they are all derived by different combinations of the embedding matrix entries. We find that setting the diagonal of the random $M$ to zero almost matches the low feature probability loss of the embed-like case, though it slightly increases the high- $p$ loss.

^{^}

The eigenvalues of and the singular values of $M + I d e n t i t y$ are approximately related.

^{^}

This is not due to $M$ being low-rank; the singular value spectrum is smooth (as expected from a random matrix).

^{^}

During our experiments we found that the loss depends on the rank of $M$ , though we found that, depending on the setting, reducing the rank could either increase or decrease the loss.

LESSWRONG
LW

LESSWRONG
LW

56

Compressed Computation is (probably) not Computation in Superposition

56

56

Summary

Introduction

Methods

Results

Qualitatively different solutions in sparse vs. dense input regimes

Quantitative analysis of the Compressed Computation model

Mechanism of the Compressed Computation model

Mechanism of the dense solution

Discussion