Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.
As a personal exercise for building research taste and experience in the domain of AI safety and specifically interpretability, I have done four minor projects, all building upon code previously written. They were done without previously formulated hypotheses or expectations, but merely to check for anything interesting in low-hanging fruit. In the end, they have not given major insights, but I hope they will be of small use and interest for people working in these domains.
This is the third project: extending Timaeus' toy models of superposition devinterp results, made for studying the formation of superposition during the training process and using the LLC. It is partially this project.
The TMS results have been redone using the original code, while changing the activation function.
Broadly, there were four main observations, all somewhat confusing and suggesting that additional research is needed:
The last three correspond to previous findings (see project #2).
The basis for these findings is the TMS notebook of the devinterp library, plus the example code given.
All resulted notebooks, extracted graphs, tables, and word files with clean, tabular comparisons can be found here.
General observations have been given above.
Specific observations for the functions:
Loss functions with bounded outputs (sigmoids, tanhs, softsign, softmin) have small loss curves and similarly small values for LLC. Their limited results give a very flat loss landscape. They also show somewhat limited superposition outside of antipodal pairs, likely due to their continuity, symmetry, and limited output. This contradicts the findings of post #2 as well as Lad and Kostolansky (2023), which showed significant superposition for Sigmoid. While model differences could be to blame (this uses Autoencoders, while project #2 was for simple MLPs), Lad and Kostolansky (2023) also used autoencoders, with the nonlinearity at the end, and their findings do not correspond to the ones here. This is confusing and requires further research.
Softplus has slow-developing superposition, likely due to a less clearly delimited (than 0) elbow. That said, it had some of the most prominent superposition.
GELU (bottom) is mostly just a softer ReLU (top), with slightly reduced and delayed superposition.
SoLU (bottom) gave greater superposition than softmax (top), despite being intended to reduce it. This contradicts previous research (see project #2). The lack of LN may be to blame. This is very confusing. More research is needed.
(LeakyReLU), GELU, SoLU (!) and Softplus showed the most superposition. Possible causes:
Tanhshrink and Softshrink, on the other hand, produced significant superposition. The less sharp boundaries seem to help. This seems confusing.
The development of superposition happens after the initial large drop in losses, and does correspond to phase changes to a degree. Does superposition correspond to flat basins in the loss landscape, without it being necessary for their development? Do the basins correspond to symmetrical/deeply entrenched basins?
Previous findings have been validated, with the addition that low degrees of superposition usually correspond to slower formation as well. Similar considerations and areas for further research apply here as well. The only large surprise is the vastly different superposition of limiting loss functions (sigmoid especially), which is confusing, and potentially SoLU.
More research is needed, potentially with finer-grained (applications of LLC or other) SLT tools. Otherwise, LLC behaved reasonably.
The broad conclusion is that more superposition leads to marginally smaller losses; the development of superposition happens after the initial large drop in losses during phase changes; sharp, non-negative, asymmetrical activation functions gave the best superposition and lowest losses.
I would like to thank the original Timaeus team for starting this research direction, establishing its methods, and writing the relevant code.