Research engineer at Apollo Research (London). My main research interests are mechanistic interpretability and inner alignment.
This is a good idea and is something we're (Apollo + MATS stream) working on atm. We're planning on releasing our agenda related to this and, of course, results whenever they're ready to share.
Makes sense! Thanks!
Great! I'm curious, what was it about the sparsity penalty that you changed your mind about?
Hey thanks for your review! Though I'm not sure that either this article or Cunningham et al. can reasonably be described as a reproduction of Anthropic's results (by which I assume you're talking about Bricken et al.), given their relative timings and contents.
Comments on the outcomes of the post:
Some regrets related to the post:
I and collaborators are working on each of these problems.
Here is a reference that supports the claim using simulations https://royalsocietypublishing.org/doi/10.1098/rspb.2008.0877But I think you're right to flag it - other references don't really support it as the main reason for stripes. https://www.nature.com/articles/ncomms4535
Thanks Akash! I agree that this feels neglected.Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712Looking forward to it coming out!
Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers.Dictionary learning - This is one of my main bets for comprehensive interpretability. Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709
No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report.
I strongly suspect this is the case too! In fact, we might be able to speed up the learning of common features even further:Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't had time to check, but it's presumably biased to recover the most common features first since they're the most likely to be in a given data point. *The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time.