Toy Models of Superposition in the dense regime

Andre Assis

This small project was a joint effort between Tassilo Neubauer (Morpheus) and Andre Assis. We originally started working on this over a year ago. We ran a ton of experiments, and we want to document what we've done and found. We hope that other people can pick up from where we left.

Introduction

In this work, we investigated how Toy Models of Superposition behave as we move away from the high sparsity regime. This project is listed on the Timaeus website as a starter project.

To start, we reproduced the results from Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition. Then we explored what happens to the loss and the local learning coefficient (LLC) estimates over training over a wide range of sparsities under two different initialization conditions: optimal and random 4-gon initialization.

Our work is hardly complete, and we will not be pursuing this project further. This is our attempt to document our work and share it with others. There are many unanswered questions, and the reader should keep in mind that we were sometimes just doing stuff.

Experimental Setup

In all our experiments, we used 6 input dimensions and 2 hidden dimensions, which are then reconstructed back to 6 dimensions.

In order to have a smooth transition from the fully sparse regime (S=1) to the dense regime (S<<1), we defined our input vectors for the model like so:

Each vector has at least one entry equal to 1. All other entries are:
- 0 with probability S or
- 1 with probability S-1.^[1]
The sparsities S-values we tested were [0.993, 0.988, 0.980, 0.964, 0.938, 0.892, 0.811, 0.671, 0.426].

For each sparsity value, we ran 200 different seeds (from 0 to 199) under two different initialization conditions:

Using the optimal solution in the high-sparsity regime as the starting point
A 4-gon is initialized with some random noise added to the coordinates.

This gives us a total of 4000 experiments (2000 with random 4-gon initialization and 2000 initialized at optimal parameters for high sparsity).

The optimal solution for high sparsity is defined as:

A set of 6 vectors forming a hexagon, where each vector has a magnitude from the center, and the biases of all of them are $- 1$ ^[2]

All the auto-encoders were trained for 20,000 epochs, and we saved 50 logarithmically spaced snapshots along the training.

The complete configuration used in our experiments is available in the Appendix.

The LLC estimates were computed using the implementation from the Epsilon-Beta Visualizations notebook from Timaeus. We ran a grid search for batch_size and learning_rate and chose default values of 300 and 0.001, respectively, for the LLC estimates.

Results

Here we collect a series of observations from our results. For brevity, these are collected mostly as bullet points.

TMS Sparsity: Interactive Loss vs. LLC Visualization

We're also releasing a 100% vibe-coded companion Streamlit app if you would like to explore our results more interactively:

https://tms-sparsity.streamlit.app/

LLC vs. Loss Progression During Training

First, we analyzed how the loss and LLC values changed as training progressed for the different sparsity levels.

Points on loss and LLC values:
- Generally, we observed that lower loss solutions tended to have higher LLC estimates.
- The models tend to cluster at certain loss levels.
Points on the solutions found:
- Generally, the models initialized at the hexagon solution of the sparse regime find solutions that are close to optimal in the dense regime. The models with the lowest loss that were randomly initialized are in a similar range.
- The optimal solutions in the range 0.964-1 are all boring Hexagons. The Hexagon solutions are not often found when randomly initialized (although Tassilo checked with a few runs, and if we had chosen a step size of 0.05, Hexagons would be found more often).
- The randomly initialized models often have 1 or more "dead" neurons. Those dead neurons seem to be the main reason why the solutions found by the random initialization have higher loss on average.
- The lowest loss solution for 0.938 is a Pentagon with an Axial Symmetry
- At 0.892 sparsity, there is both a Pentagon and a 4-gon solution. Variation in the loss between instances of those solutions is larger than the difference between the Pentagon and the 4-gon solution.
- The low-loss solutions for 0.811 are various 4-gons.
- For sparsity 0.671, the low-loss solutions are both 3-gon and 4-gon solutions with 0 to 2 negative biases.
- The lowest loss solutions for 0.426 are 5-gons with all positive biases.

NOTE: The absolute values of the LLC estimates are highly dependent on the hyperparameters used. The absolute numbers of LLC estimates are not meaningful, and the reader should not pay attention to absolute values, but rather to relative changes in LLC values within one sparsity.

Figure 1 - The loss values and LLC estimates after 13 epochs. On the right: the runs initialized with the optimal solution in the sparse regime. Note that the higher the sparsity, the lower the loss. Interestingly, the LLC estimates increase up until S=0.892 and then decrease again. On the left: the randomly initialized runs. Note that the X and y axes are not shared between the plots.

Figure 2 - The loss values and LLC estimates after 85 epochs.

Figure 3 - The loss values and LLC estimates after 526 epochs. On the right: the loss values for lower sparsities is decreasing. For S=0.426 we start seeing two different discrete plateaus. On the left: we start seeing solutions in the lower sparsity regime lower than on the right for the same sparsity.

Figure 4 - The loss values and LLC estimates after 3243 epochs. On the right: the lower sparsity runs achieve much lower loss on average at the expense of higher LLC values (higher complexity). Note that for S=0.426 we see two levels of loss (0.25 and 0.15) for some runs. On the left: we start seeing discrete loss plateaus for each sparsity level, for S=0.426 these plateaus include the values on the right.

Figure 5 - The loss values and LLC estimates after 20000 epochs. Plateaus for both initializations become more pronounced, with the randomly initialized runs showing more plateaus then the optimally initialized runs.

Figure 6- We created dendrograms to cluster solutions based on their input-output functions. We then manually picked 1 model out of the 3-5 main clusters per sparsity as representatives of that cluster and put them in this gif (dendrograms can be found in our repository).

To illustrate one of the solutions in more detail, here are the solutions we find for the lowest loss level in the dense regime.

What happens here is that two solutions tend to do relatively well, with an average MSE a little above ~0.15. One is a Hexagon solution that is similar to the optimal solution in the sparse regime, just the biases are positive. The other solution is a triangle solution, where the model "wastes" an entire dimension of its internal embedding to encode 1 output dimension and uses the remaining 1 dimension to encode the remaining 5 dimensions. From looking at how those models were initialized, it seems like the models get stuck in this solution if one of the biases is initialized to be relatively large. (Orange is the test-loss and blue is the training-loss)

TODO ANDRE: Write description for the % k-gon plots

Figure 7 - **Training dynamics of a sparse autoencoder with 0.426 sparsity (Run 6).** The visualization shows weight evolution across five training checkpoints (steps 1, ~200, ~2000, ~10000, and final). Top row: geometric representation of encoder weights as vectors forming polygons in 2D space, with the convex hull shown as a red dashed line. Middle row: bias magnitudes for each neuron, where green bars indicate positive biases and red bars indicate negative biases, with black bars showing weight norms. Bottom: training and test loss curves on log-log scale, with vertical dashed lines marking the snapshot locations. The polygon structure evolves from a 4-gon shape to a more irregular configuration as training progresses, while both training and test losses converge to approximately the same level.

Figure 8 - Another autoencoder with 0.426 sparsity (Run 78). Note how 3 dimensions collapsed with near-zero weights and large negative biases. Generally Dimensions that were randomly initialized with a large negative bias and a small weight in the hidden layer get stuck as dead neurons.

Figure 9 - The training dynamics of an autoencoder optimally initialized (Run 0) with 0.426 sparsity. The 6-gon slowly evolves to have all positive biases and lower weights. Note how long it takes for the evolution to take place and reach a new loss level.

Figure 10 - Another autoencoder optimally initialized with 0.426 sparsity (Run 93). The solution achieved at the end of training is a 4-gon. This is one of the examples with loss around 0.25 shown in Figure 5.

Figure 11 - Autoencoder randomly initialized with 0.671 sparsity (Run 272). The solution at the end of training is a triangle. The neuron initialized with a small weight in the hidden layer and a large negative bias is stuck and doesn't change it's value through training (a dead neuron).

Figure 12 - Autoencoder optimally initialized with 0.671 sparsity (Run 268). The finally found solution is a 4-gon with 2 negative biases.

Figure 13 - Autoencoder optimally initialized with 0.671 sparsity (Run 273). The finally found solution is a 4-gon with all positive biases.

Figure 14 - The training dynamics of a randomly initialized 4-gon with 0.811 sparsity (Run 457). This run is the highest loss and lowest LLC run for this sparsity. Note how the 4-gon solution is persistent and "sticky".

Figure 15 - Another example of a randomly initialized 4-gon with 0.811 sparsity (Run 476). This run is the lowest loss and highest LLC run for this sparsity. Note how the 4-gon solution is persistent and "sticky".

Figure 16 - The training dynamics of an optimally initialized 6-gon with 0.811 sparsity (Run 507). Note how the 6-gon evolves to a 4-gon with one dimension with near-zero weight. This particular run is among the higher losses and lower LLC run for this sparsity value.

Figure 17 - Another example of an optimally initialized 6-gon with 0.811 sparsity (Run 508). The solution is also a 4-gon but this run achieves lower loss than Run 507. This particular run is the lowest loss for this sparsity level.

Figure 18 - The training dynamics of a randomly initialized 4-gon with 0.993 sparsity (Run 1795). The solution is "stuck" in a 4-gon configuration. This run has the highest loss for this sparsity level.

Figure 19 - Another example of a randomly initialized 4-gon with 0.993 sparsity (Run 1601). Note how the 4-gon evolves to a 5-gon with one of the dimensions collapsed with near-zero weight. This run is the lowest loss for this sparsity level.

Figure 20 - Autoencoder optimally initialized with a 6-gon with 0.993 sparsity (Run 1669) . This particular run has the highest loss for this sparsity level.

Figure 21 - Another optimally initialized 6-gon autoencoder (Run 1699). This is the lowest loss run for this sparsity level.

Figure 22 - Shows the fraction of different k-gon configurations of the optimally initialized autoencoders over training steps for sparsity 0.426

Figure 23 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.426

Figure 24 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.671.

Figure 25 - Shows the fraction of randomly initialized models and the n-gon of their convex hull trained on data with sparsity 0.811.

Figure 26 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.892.

Figure 27 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.938.

Figure 28 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.964.

Figure 29 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.980.

Figure 30 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 0.993.

Figure 31 - Shows the fraction of different k-gon configurations of the randomly initialized autoencoders over training steps for sparsity 1.

If you want to take a look at our models in more detail, we have uploaded the weights for all the training runs in our repository. The code to generate the plots of this post can be found under /notebooks/final_visualizations.py

Appendix

Version 1.14 is the optimally initialized series of experiments (see parameter use_optimal_solution) is set as True.

Version 1.15 is the randomly initialized 4-gon series of experiments (see parameter use_optimal_solution) is set as True.

config = {
    "1.14.0":
    {
        "m": [6],
        "n": [2],
        "num_samples": [1024],
        "num_samples_test": [192],
        "batch_size": [1024],
        "num_epochs": [20000],
        "sparsity": [x for x in generate_sparsity_values(5, 10) if x != 0] + [1],
        "lr": [0.005],
        "momentum": [0.9],
        "weight_decay": [0.0],
        "init_kgon": [6],  # Irrelevant when using optimal solution
        "no_bias": [False],
        "init_zerobias": [False],
        "prior_std": [10.0],  # Irrelevant when using optimal solution
        "seed": [i for i in range(200)],
        "use_optimal_solution": [True],
        "data_generating_class": [SyntheticBinarySparseValued],
    },
    "1.15.0":
    {
        "m": [6],
        "n": [2],
        "num_samples": [1024],
        "num_samples_test": [192],
        "batch_size": [1024],
        "num_epochs": [20000],
        "sparsity": [x for x in generate_sparsity_values(5, 10) if x != 0] + [1],
        "lr": [0.005],
        "momentum": [0.9],
        "weight_decay": [0.0],
        "init_kgon": [4],
        "no_bias": [False],
        "init_zerobias": [False],
        "prior_std": [1.0],
        "seed": [i for i in range(200)],
        "use_optimal_solution": [False],
        "data_generating_class": [SyntheticBinarySparseValued],
    },
}

Figure A1: Dendrogram for sparsity 0.671. We clustered the models by evaluating the models on all possible inputs {0,1}^6, and then we stacked all the output vectors together and clustered by the euclidean distance of those stacked output vectors. The main thing that models are clustering by is how many negative biases they have. See the repository for the other dendrograms.

^{^}
We initially ran these experiments with each entry in the input vector being 0 with probability S and 1 with probability 1-S, but then forgot that we had implemented it that way again and got confused by our own results. We then decided it made more sense for now to have at least one entry in the vector be non-zero.
^{^}
It turned out the global minimum solution for 6 input dimensions and two hidden dimensions (a 6-gon with corners $\sqrt{2}$ and bias $- 1$ ) was different from the lowest loss solution analyzed in Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition, because according to Zhongtian Chen: "We didn’t classify it because it’s at the boundary (-l^2/2 =b), so the potential is not analytic there."

LESSWRONG
LW