Experiments with an alternative method to promote sparsity in sparse autoencoders

Great work!

Did you ever run just the L0-approx & sparsity-frequency penalty separately? It's unclear if you're getting better results because the L0 function is better or because there are less dead features.

Also, a feature frequency of 0.2 is very large! 1/5 tokens activating is large even for positional (because your context length is 128). It'd be bad if the improved results are because polysemanticity is sneaking back in through these activations. Sampling datapoints across a range of activations should show where the meaning becomes polysemantic. Is it the bottom 10% (or 10% of max-activating example is my preferred method)

[-]Eoin Farrell2y90

Did you ever run just the L0-approx & sparsity-frequency penalty separately? It's unclear if you're getting better results because the L0 function is better or because there are less dead features.

Good point - this was also somewhat unclear to me. What I can say is that when I run with the L0-approx penalty only, without the sparsity frequency penalty, I either get lots of dead features (50% or more), with a substantially worse MSE (a factor of a few higher), similar to when I run with only an L1 penalty. When I run with the sparsity-frequency penalty and a standard L1 penalty (i.e. without L0-approx), I get models with a similar MSE and L0 a factor of ~2 higher than the SAEs discussed above.

Also, a feature frequency of 0.2 is very large! 1/5 tokens activating is large even for positional (because your context length is 128). It'd be bad if the improved results are because polysemanticity is sneaking back in through these activations. Sampling datapoints across a range of activations should show where the meaning becomes polysemantic. Is it the bottom 10% (or 10% of max-activating example is my preferred method)

Absolutely! A quick look at the 9 features with frequencies > 0.1 shows the following:

Feature #8684 (freq: 0.992) fires with large amplitude on all but the BOS token (should I remove this in training?)
Feature #21769 (freq: 0.627), 10217 (freq: 0.370) & 24409 (freq: 0.3372) are positional based, but possibly contain more info. The positional dependence of the activation strength for all non-zero activations is shown in the plot below for these three features. Here, the bottom 10% seem interpretable, at least the positional based info. Given the scatter in the plot, it looks like more info might be contained in the feature. Looking at max activations for a given position did not shed any further light. I don't know whether it's reasonable to expect GPT2-small to actually have & use features like this.
Feature #21014 (freq: 0.220) fires at the 2nd position in sentences, after new lines and full stops, and then has smaller activations for 3rd, 4th & 5th position after new lines and full stops (so the bottom 10% seem interpretable, i.e. they are further away from the start of a sentence)
Feature #16741 (freq: 0.171) unclear from the max/min activating examples, maybe polysemantic
Feature #12123 (freq: 0.127) fires after "the", "an", "a", again stronger for the token immediately after, and weaker for 2nd, 3rd, 4th positions after. Bottom 10% seem interpretable in this context, but again there are some exceptions, so I'm not completely sure.
Feature #22430 (freq: 0.127) fires after "," more strongly at the first position after "," and weaker for tokens at the 2nd, 3rd, 4th positions away from ",". The bottom 10% seem somewhat interpretable here, i.e. further after "," but there are exceptions so I'm not completely sure.
Feature #6061(freq: 0.109) fires on nouns, both at high and low activations.

While I think these interpretations seem reasonable, it seems likely that some of these SAE features are at least somewhat polysemantic. They might be improved by training the SAE longer (I trained on ~300M tokens for these SAEs).

I might make dashboards or put the SAE on Neuronpedia to be able to make a better idea of these and other features.

[-]Bogdan Ionut Cirstea2y40

There's also an entire literature of variations of [e.g. sparse or disentangled] autoencoders and different losses and priors that it might be worth looking at and that I suspect SAE interp people have barely explored; some of it literally decades-old. E.g. as a potential starting point https://lilianweng.github.io/posts/2018-08-12-vae/ and the citation trails to and from e.g. k-sparse autoencoders.

[-]Eoin Farrell2y30

Interesting, thanks for sharing! Are there specific existing ideas you think would be valuable for people to look at in the context of SAEs & language models, but that they are perhaps unaware of?

[-]faul_sname2y30

This is really cool!

I did some tests on random features for interpretability, and found them to be interpretable. However, one would need to do a detailed comparison with SAEs trained on an L1 penalty to properly understand whether this loss function impacts interpretability. For what it’s worth, the distribution of feature sparsities suggests that we should expect reasonably interpretable features.

One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. "900 of the 2048 features detected by the -trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model"). I'd also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the $L_{a p p r o x}^{0}$ -trained model but don't appear in an L1-trained SAE on the training dataset.

[-]Eoin Farrell2y80

Thanks!

One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. "900 of the 2048 features detected by the -trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model").

I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:

2501 out of 24576 (10%) of the features detected by the $L_{a p p r o x}^{0}$ -trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
7774 out of 24576 (32%) had cosine sim > 0.8
50% have cosine sim > 0.686

I'm not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the $L_{a p p r o x}^{0}$ -trained SAEs.

I'd also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the $L_{a p p r o x}^{0}$ -trained model but don't appear in an L1-trained SAE on the training dataset.

I think I'll look more at this. Some summarised examples are shown in the response above.

[-]faul_sname2y20

The other baseline would be to compare one L1-trained SAE against another L1-trained SAE -- if you see a similar approximate "1/10 have cossim > 0.9, 1/3 have cossim > 0.8, 1/2 have cossim > 0.7" pattern, that's not definitive proof that both approaches find "the same kind of features" but it would strongly suggest that, at least to me.

^{^}

MSE computed by Joseph’s old definition for comparison purposes

Model^[1]	L0	MSE	# Dead Features
JB L1 reference	14.60	1.1e-3	3777
$λ_{0} = 3 \times 10^{- 5}$	19.34	7.0e-4	79
$λ_{0} = 3.5 \times 10^{- 5}$	16.94	7.4e-4	86
$λ_{0} = 5 \times 10^{- 5}$	13.76	7.8e-4	94
$λ_{0} = 7 \times 10^{- 5}$	10.95	8.7e-4	161
$λ_{0} = 9 \times 10^{- 5}$	9.27	9.3e-4	218

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

29

Experiments with an alternative method to promote sparsity in sparse autoencoders

29

29

Summary

Motivation

Approximations of the sparsity

Training the SAEs

Feature sparsity distributions

Comparison of training curves

Discussion

Acknowledgements