When working with SAE features, I've usually relied on a linear intuition: a feature firing with twice the strength has about twice the "impact" on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature's additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens is equal to the exponent of the logit difference .
If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn't 5x its effect on , but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.
Note that this only holds for the direct 'logit lens'-like effect of a feature. This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers.
Interesting idea, I had not considered this approach before!
I'm not sure this would solve feature absorption though. Thinking about the "Starts with E-" and "Elephant" example: if the "Elephant" latent absorbs the "Starts with E-" latent, the "Starts with E-" feature will develop a hole and not activate anymore on the input "elephant". After the latent is absorbed, "Starts with E-" wouldn't be in the list to calculate cumulative losses for that input anymore.
Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.
Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).
I haven't actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.
Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak:
The SAEs here were only trained for 100M tokens (1/3 the TinyStories[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more 'real' language models and train SAEs with much more data.
I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et al. as implemented in SAEBench!
Is there a nice way to extend the Matryoshka method to top-k SAEs?
Yes! My experiments with Matryoshka SAEs are using BatchTopK.
Are you planning to continue this line of research? If so, I would be interested to collaborate (or otherwise at least coordinate on not doing duplicate work).
Sing along! https://suno.com/song/35d62e76-eac7-4733-864d-d62104f4bfd0
You might enjoy this classic: https://www.lesswrong.com/posts/9HSwh2mE3tX6xvZ2W/the-pyramid-and-the-garden
On the origins of "you can just do things"
About once every 15 minutes, someone tweets "you can just do things". It seems like a rather powerful and empowering meme and I was curious where it came from, so I did some research into its origins. Although I'm not very satisfied with what I was able to reconstruct, here are some of the things that I found:
In 1995, Steve Jobs gives the following quote in an interview:
Although he says nothing close to the phrase "you can just do things", I think it'd be a fair summary of his message.
Fast forward to October 2020, where Twitter user nosilverv tweets:
Although this is the first tweet I've found[1] that contains the exact phrasing, it doesn't seem to quite match the current sentiment. It seems to imply that you can do random things and shouldn't let bad feelings stop you, rather than the "high agency" framing it has today.
A year later, in October 2021, @Neel Nanda publishes the blog post What's Stopping You?. I think this post points exactly in the "you can just do things" direction, and contains the following quote:
So close, just missing the "just"!
Then in February 2022, substacker crypticdefinitions posts a post titled You can just do things:
From here on, in the first half of 2022, some people on Twitter seem to start to adopt this phrase, such as @AskYatharth and @m_ashcroft. The phrase seems to slowly but steadily gain popularity in TPOT in the years 2023-2024. In January 2024, Cate Hall writes a blog post on "How to be more agentic" and announces the working title of her book "You can just do things".
In December 2024 Sam Altman tweets "you can just do things" with 25K likes and the meme seems to break all containment.
Unfortunately, the Twitter search function is completely broken and useless, so there may be earlier tweets I've not been able to find.