LESSWRONG
LW

Interpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

24

L0 is not a neutral hyperparameter

by chanind, Adrià Garriga-alonso
19th Jul 2025
6 min read
3

24

Interpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

24

L0 is not a neutral hyperparameter
0StefanHex
3chanind
2StefanHex
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 3:11 PM
[-]StefanHex2mo*00

Great post, thank you for running these experiments!

L0 should not be viewed as an arbitrary hyperparameter. We should assume that there is a "correct" L0, and we should aim to find it.

I agree with this sentiment! (Insofar as our goal is to extract the "true features" from superposition.[1])

Edit: I mixed up L0 & dictionary size, the questions below do not belong to this post

What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section of Templeton et al.)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail?

Re case 3 experiments: Are the extra SAE features your SAE learned dead, in the sense of having a small magnitude? Generally I would expect that in practice, those features should be dead (if allowed by architecture) or used for something else. In particular, if your dataset had correlations, I would expect them to  go off and do feature absorption (Chanin, Till, etc.).

  1. ^

    Rather than e.g. using SAEs as an ad-hoc training data attribution (Marks et al.) or spurious correlation discovery (Bricken et al.) method.

Reply
[-]chanind2mo32

What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section of Templeton et al.)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail

I view SAE width and SAE L0 as two separate parameters we should try to get right if we can. In toy models, similar failure modes to what we see with low L0 SAEs also happen if the SAE is narrower than the number of true features, in that the SAE tries to "cheat" and get better MSE loss by mixing correlated features together. If we can't make the SAE as wide as the number of true features, I'd still expect wider SAEs to learn cleaner features to narrower SAEs. But then wider SAEs make feature absorption a lot worse, so that's a problem. I don't think multi-L0 SAEs would help or hurt in this case though - capturing near-infinite features requires a near-infinite width SAE regardless of the L0.

For setting the correct L0 for a given SAE width, I don't think there's a trade-off with absorption - getting the L0 correct should always improve things. I view the feature completeness stuff as also being somewhat separate from the choice of L0, since L0 is about how many features are active at the same time regardless of the total number of features. Even if there's infinite features, there's still hopefully only a small / finite number of features active for any given input.

Re case 3 experiments: Are the extra SAE features your SAE learned dead, in the sense of having a small magnitude? Generally I would expect that in practice, those features should be dead (if allowed by architecture) or used for something else. In particular, if your dataset had correlations, I would expect them to  go off and do feature absorption (Chanin, Till, etc.).

In all the experiments across all 3 cases, the SAEs have the same width (20), so the higher L0 SAEs don't learn any more features than lower L0 SAEs.

We looked into what happens if SAEs are wider than the number of true features in toy models in an earlier post, and found exactly what you suspect: the SAE starts inventing arbitrary combo latents (e.g. a "red triangle" latent in addition to "red" and "triangle" latents), or creating duplicate latents, or just killing off some of the extra latents.

For both L0 and width, it seems like giving the SAE more capacity than it needs to model the underlying data results in the SAE misusing the extra capacity and finding degenerate solutions.

Reply
[-]StefanHex2mo20

Sorry for my confused question, I totally mixed up dictionary size & L0! Thank you for the extensive answer, and the link that that earlier paper!

Reply
Moderation Log
More from chanind
View more
Curated and popular this week
3Comments

When we train Sparse Autoencoders (SAEs), the sparsity of the SAE, called L0 (the number of latents that fire on average), is treated as an arbitrary design choice. All SAE architectures include plots of L0 vs reconstruction, as if any choice of L0 is equally valid. 

However, recent work that goes beyond just calculating sparsity vs reconstruction curves shows the same trend: low L0 SAEs learn the wrong features [1][2]. 

In this post, we investigate this phenomenon in a toy model with correlated features and show the following:

  • If the L0 of the SAE is lower than the true L0 of the underlying features, the SAE will "cheat" to get a better MSE loss score than a correct SAE would achieve by learning broken mixtures of correlated features. This can be viewed as a form of feature hedging[3]. This is worse the lower the L0 of the SAE, and affects high-frequency features more severely.
  • If the L0 of the SAE is higher than the true L0 of the underlying features, the SAE will find degenerate solutions, but does not engage in feature hedging. This is worse the higher the L0 of the SAE.
  • If we mix together a too-low L0 loss with a too-high-L0 loss, we can learn the true features (like a Matryoshka SAE[2], except with multiple L0s instead of multiple widths). This may provide a way to learn a correct SAE despite not knowing the "true L0" of the underlying training data.

The phenomenon of poor performance due to incorrect L0 can be viewed from the same lens of Feature Hedging: If we do not give SAEs enough resources in terms of L0 or width to reconstruct the input, the SAE will find ways to cheat by learning incorrect features. In light of this, we feel that L0 should not be viewed as an arbitrary hyperparameter. We should assume that there is a "correct" L0, and we should aim to find it.

In the remainder of this post, we will walk through our experiments and results. Code is available in this Colab Notebook.

Toy model setup

We set up a toy model with 20 mutually-orthogonal true features f0 through f19,  where features 1-19 are positively correlated with f0. For each of these features, we assign a base firing probability Pi. Feature fi fires with probability Pi if feature 0 is firing, and probability 0.5∗Pi if f0 is not firing. Thus, each feature can fire on its own, but is more likely to fire if f0 is also firing. Feature f0 fires with probability 0.35, and P1 through P19  linearly decreases from 0.74 to 0.19, so that f0 is more likely to fire overall than f1, and f1 is more likely to fire than f2, etc... To keep everything simple, each feature fires with mean magnitude 1.0 and stddev 0.15. The stddev is needed to keep the SAE for engaging in Feature Absorption[4], as studying absorption is not the goal of this exploration.

Firing probabilities Pi for features f1 through f19 in our toy model.

These probabilities were chosen so the true L0 (the average number of features active per sample) is roughly 5.

SAE setup

We use a Global BatchTopK SAE[5] with the same number of latents (20) as the number of features in our toy model. We use a BatchTopK SAE to allow us to control the L0 of the SAE directly so we can study the effect of L0 in isolation of everything else. The SAE is trained on 25 Million samples generated from the toy model.

Case 1: SAE L0 = Toy Model L0

We begin by setting the L0 of the SAE to 5 to match the L0 of the underlying toy model. As we would hope, the SAE perfectly learns the true features.

When the SAE L0 matches the true L0, the SAE recovers the underlying features perfectly.

 Case 2: SAE L0 < Toy Model L0

Next, we set the L0 of the SAE to 4, just below the correct L0 of 5. The results are shown below:

When the SAE L0 (4) is lower than the true L0 (5), the SAE "cheats" by merging f0 into all other latents, and no longer directly represents f0 in its own latent. The higher-frequency latents also appear more broken than lower-frequency latents.

We now see clear signs of hedging: the SAE has decided to mix f0 into all other latents to avoid needing to represent it in its own latent. In addition, the latents tracking high-frequency features (features 1-5) appear much more broken than latents tracking lower-frequency latents.

Cheating improves MSE loss

Why would the SAE do this? Why not still learn the correct latents, and just fire 4 of them instead of 5? Below, we calculate the mean MSE loss from using the correct SAE from Case 1, except selecting the top 4 instead of 5 features, with the broken SAE we trained in Case 2.

 MSE loss
Case 1 (correct) SAE, trained with k=5 and cut to k=4 0.53
Case 2 (broken) SAE, trained with k=40.42

Sadly, the broken behavior we see above achieves better MSE loss than correctly learning the underlying features. We are actively incentivizing the SAE to engage in feature hedging and learn broken latents when the SAE L0 is too low.

Lowering SAE L0 even more

Next, we lower the L0 of the SAE further, to 3. Results are shown below:

When the SAE L0 (3) is much lower than the true L0 (5), the SAE breaks more even severely. The magnitude of hedging is larger, and all higher-frequency latents are now completely broken. 

Lowering L0 further to 3 makes everything far worse. Although it's hard to tell from the plot, the magnitude of hedging (the extent to which f0 is mixed into all other latents) is higher than with L0=4, and now all latents tracking higher-frequency features (features 1-10) are completely broken.

Case 3: SAE L0 > Toy Model L0

What happens if we set the SAE L0 too high? We now set the L0 of the SAE to 6. Results are shown below:

When the SAE L0 (6) is higher than the true L0 (5), the SAE learns some incorrect latents, but there is no sign of hedging.

We see that the SAE learns some slightly broken latents, but there is no sign of systematic hedging. Instead, it seems like having too high of a L0 makes it so there are multiple ways to get perfect reconstruction loss, so we should not be surprised that the SAE settles into an imperfect result.

Increasing L0 even more

Next, we see what happens when we increase the SAE L0 even further. We set SAE L0 to 8. Results are shown below:

When the SAE L0 (8) is much higher than the true L0 (5), the SAE learns much more broken latents, but there is still no sign of hedging.

We now see the SAE is learning far worse latents than before, with most latents being completely broken. However, we still don't see any sign of systematic hedging like we saw with low L0 SAEs.

Watching an SAE break in real-time

Below, we record a training run where slowly decrease K from 10 to 2. The true L0 (and thus correct K) is 5 for this model. When K is too high, the SAE can find degenerate solutions, and low-frequency latents (latenst 10-19) are worse affected. When K=5, the SAE learns the true features perfectly. When K drops too low (below 5), we see feature hedging emerging, and the high-frequency latents (0-10) begin breaking.

Mixing high and low L0 penalties together

Clearly, if we knew the correct L0 of the underlying data, the best thing to do is to train at that L0. In reality, we do not yet have a way to find the true L0, but we find that we can still improve things by mixing together two MSE losses during training: One loss uses a low L0 and another uses a high L0.

This is conceptually similar to how Matryoshka SAEs[3] work. In a Matryoshka SAE, multiple losses are summed using different width prefixes. Here, we sum two losses using different L0s:

L=Lk1+λLk2

In this formualtion, Lk1 is the MSE loss term using a lower L0, and Lk2 is the MSE loss term using a higher L0. We add a coefficient λ so we can control the relative balance of these two losses.

Below, we train an SAE with k1=4, k2=8, λ=1:

Multi-L0 SAE with k1=4, k2=8, λ=1.

This looks a lot better than our Case 1 SAE - we still see a dedicated latent for f0, but there's also clear hedging going on still. Let's try increasing λ further to 20:

Multi-L0 SAE with k1=4, k2=8, 20.

We've now perfectly recovered the true features again! It seems like the low-L0 loss helps keep the high-L0 loss from learning a degenerate solution, while the high-L0 loss keeps the low-L0 loss from engaging in hedging.

 

  1. ^

    Kantamneni, Subhash, et al. "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing." Forty-second International Conference on Machine Learning.

  2. ^

    Bussmann, Bart, et al. "Learning Multi-Level Features with Matryoshka Sparse Autoencoders." Forty-second International Conference on Machine Learning.

  3. ^

    Chanin, David, Tomáš Dulka, and Adrià Garriga-Alonso. "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders." arXiv preprint arXiv:2505.11756 (2025).

  4. ^

    Chanin, David, et al. "A is for absorption: Studying feature splitting and absorption in sparse autoencoders." arXiv preprint arXiv:2409.14507 (2024).

  5. ^

    Bussmann, Bart, Patrick Leask, and Neel Nanda. "BatchTopK Sparse Autoencoders." NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning.