Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost to a set of slides containing an update to a project that was the subject of a previous post ([Interim research report] Taking features out of superposition with sparse autoencoders).

The update is very small and scrappy. We haven't had much time to devote to this project since posting the Interim Research Report.

TL;DR for the slides: 

  • We trained a minuscule language model (LM) (residual size = 16; 6 layers) and then trained sparse autoencoders on MLP activations (dimension =  64) from the third layer of that model.
  • We found that, when we compared the 'ground truth feature recovery' plots, the plots for the toy data and LM data were much more similar than in the Interim Research Report.
  • Very, very tentatively, we found the layer had somewhere between 512-1024 features. By labelling a subset of these features, we estimate there are roughly 600 easily labellable (monosemantic) features. For instance, we found a feature that activates for a period immediately after 'Mr', 'Mrs', or 'Dr'.
  • We suspect that the reason the toy data and LM data plots had previously looked different was due to severely undertrained sparse autoencoders.  

We're hopeful that with more time to devote to this project we can confirm the results and apply the method to larger LMs. If it works, it would give us the ability to tell mechanistic stories about what goes on inside large LMs in terms of monosemantic features.

New Comment
5 comments, sorted by Click to highlight new comments since:

As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.

I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.

I strongly suspect this is the case too! 

In fact, we might be able to speed up the learning of common features even further:

Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't had time to check, but it's presumably biased to recover the most common features first since they're the most likely to be in a given data point. 

*The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time. 

We have our replication here for anyone interested!

Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?

No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report.