Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
Definitely seems like multiple ways to interpret this work, as also described in SAE feature geometry is outside the superposition hypothesis. Either we need to find other methods and theory that somehow finds more atomic features, or we need to get a more complete picture of what the SAEs are learning at different levels of abstraction and composition.
Both seem important and interesting lines of work to me!
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the "Starts-with-L"-component from the "lion"-token direction and activate the "Starts-with-L" latent instead. Although this would come at the cost of worse sparsity/reconstruction.
Having been at two LH parties, one with music and one without, I definitely ended up in the "large conversation with 2 people talking and 5 people listening"-situation much more in the party without music.
That said, I did find it much easier to meet new people at the party without music, as this also makes it much easier to join conversations that sound interesting when you walk past (being able to actually overhear them).
This might be one of the reasons why people tend to progressively increase the volume of the music during parties. First give people a chance to meet interesting people and easily join conversations. Then increase the volume to facilitate smaller conversations.
I just finished reading "Zen and the Art of Motorcycle Maintenance" yesterday, which you might enjoy reading as it explores the topic of Quality (what you call excellence). From the book:
“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up).
Yes! This is indeed a direction that we're also very interested in and currently working on.
As a sneak preview regarding the days of the week, we indeed find that one weekday feature in the 768-feature SAE, splits into the individual days of the week in the 49152-feature sae, for example Monday, Tuesday..
The weekday feature seems close to mean of the individual day features.
Interesting! I actually did a small experiment with this a while ago, but never really followed up on it.
I would be interested to hear about your theoretical work in this space, so sent you a DM :)
Thanks!
Yeah, I think that's fair and don't necessarily think that stitching multiple SAEs is a great way to move the pareto frontier of MSE/L0 (although some tentative experiments showed they might serve as a good initialization if retrained completely).
However, I don't think that low L0 should be a goal in itself when training SAEs as L0 mainly serves as a proxy for the interpretability of the features, by lack of good other feature quality metrics. As stitching features doesn't change the interpretability of the features, I'm not sure how useful/important the L0 metric still is in this context.
You might enjoy this classic: https://www.lesswrong.com/posts/9HSwh2mE3tX6xvZ2W/the-pyramid-and-the-garden