L0 is not a neutral hyperparameter

Adrià Garriga-alonso

Great post, thank you for running these experiments!

L0 should not be viewed as an arbitrary hyperparameter. We should assume that there is a "correct" L0, and we should aim to find it.

I agree with this sentiment! (Insofar as our goal is to extract the "true features" from superposition.^[1])

Edit: I mixed up L0 & dictionary size, the questions below do not belong to this post

~~What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section of~~ ~~Templeton et al.~~)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail?

Re case 3 experiments: Are the extra SAE features your SAE learned dead, in the sense of having a small magnitude? Generally I would expect that in practice, those features should be dead (if allowed by architecture) or used for something else. In particular, if your dataset had correlations, I would expect them to go off and do feature absorption (~~Chanin~~, ~~Till, etc.).~~

^{^}
Rather than e.g. using SAEs as an ad-hoc training data attribution (Marks et al.) or spurious correlation discovery (Bricken et al.) method.

[-]chanind3mo32

What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section of Templeton et al.)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail

I view SAE width and SAE L0 as two separate parameters we should try to get right if we can. In toy models, similar failure modes to what we see with low L0 SAEs also happen if the SAE is narrower than the number of true features, in that the SAE tries to "cheat" and get better MSE loss by mixing correlated features together. If we can't make the SAE as wide as the number of true features, I'd still expect wider SAEs to learn cleaner features to narrower SAEs. But then wider SAEs make feature absorption a lot worse, so that's a problem. I don't think multi-L0 SAEs would help or hurt in this case though - capturing near-infinite features requires a near-infinite width SAE regardless of the L0.

For setting the correct L0 for a given SAE width, I don't think there's a trade-off with absorption - getting the L0 correct should always improve things. I view the feature completeness stuff as also being somewhat separate from the choice of L0, since L0 is about how many features are active at the same time regardless of the total number of features. Even if there's infinite features, there's still hopefully only a small / finite number of features active for any given input.

Re case 3 experiments: Are the extra SAE features your SAE learned dead, in the sense of having a small magnitude? Generally I would expect that in practice, those features should be dead (if allowed by architecture) or used for something else. In particular, if your dataset had correlations, I would expect them to go off and do feature absorption (Chanin, Till, etc.).

In all the experiments across all 3 cases, the SAEs have the same width (20), so the higher L0 SAEs don't learn any more features than lower L0 SAEs.

We looked into what happens if SAEs are wider than the number of true features in toy models in an earlier post, and found exactly what you suspect: the SAE starts inventing arbitrary combo latents (e.g. a "red triangle" latent in addition to "red" and "triangle" latents), or creating duplicate latents, or just killing off some of the extra latents.

For both L0 and width, it seems like giving the SAE more capacity than it needs to model the underlying data results in the SAE misusing the extra capacity and finding degenerate solutions.

[-]StefanHex3mo20

Sorry for my confused question, I totally mixed up dictionary size & L0! Thank you for the extensive answer, and the link that that earlier paper!

^{^}

Kantamneni, Subhash, et al. "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing." Forty-second International Conference on Machine Learning.

^{^}

Bussmann, Bart, et al. "Learning Multi-Level Features with Matryoshka Sparse Autoencoders." Forty-second International Conference on Machine Learning.

^{^}

Chanin, David, Tomáš Dulka, and Adrià Garriga-Alonso. "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders." arXiv preprint arXiv:2505.11756 (2025).

^{^}

Chanin, David, et al. "A is for absorption: Studying feature splitting and absorption in sparse autoencoders." arXiv preprint arXiv:2409.14507 (2024).

^{^}

Bussmann, Bart, Patrick Leask, and Neel Nanda. "BatchTopK Sparse Autoencoders." NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning.

	MSE loss
Case 1 (correct) SAE, trained with k=5 and cut to k=4	0.53
Case 2 (broken) SAE, trained with k=4	0.42

LESSWRONG
LW

LESSWRONG
LW

24

L0 is not a neutral hyperparameter

24

24

Toy model setup

SAE setup

Case 1: SAE L0 = Toy Model L0

Case 2: SAE L0 < Toy Model L0

Cheating improves MSE loss

Lowering SAE L0 even more

Case 3: SAE L0 > Toy Model L0

Increasing L0 even more

Watching an SAE break in real-time

Mixing high and low L0 penalties together