447

LESSWRONG
LW

446
AI
Frontpage

32

Sparsity is the enemy of feature extraction (ft. absorption)

by 7vik, chanind, Adrià Garriga-alonso
3rd May 2025
8 min read
0

32

AI
Frontpage

32

New Comment
Moderation Log
More from 7vik
View more
Curated and popular this week
0Comments

Sparse Autoencoders (and other related feature extraction tools) often optimize for sparsity to extract human-interpretable latent representations from a model's activation space. We show analytically that sparsity naturally leads to feature absorption in a simplified untied SAE, and discuss how this makes SAEs less trustworthy to use for AI safety with some ongoing efforts to fix this. This might be obvious to people working in the field - but we ended up writing a proof sketch so we're putting it out here. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort.

 

The dataset (a distribution with feature hierarchy)

In this proof, we consider a dataset D with points sampled to exhibit features from a set of feature F={f1,f2,f3,⋯,fd}. Particularly, we will consider two features (f1,f2) that follow the hierarchy f2⊂f1 (think f_2 = elephant and f_1 = animal for instance), where existence of  f2 implies existence of f1. 

Hierarchy in human-interpretable features is prevalent (and hard to study in LLMs). While other unrelated features still exist, for f1 and f2, we can partition the probability of this dataset for four combinations:

Featuresf1¬f1
f2p11p01
¬f2p10p00

So these are the individual probabilities of a datapoint eliciting these combinations of features:

  • p11≡pf1,f2 (both features present, think elephant, which implies animal),
  • p10≡pf1,¬f2 (only f1 present, think cat or dog)
  • p01≡p¬f1,f2 (only f2 present, which should be zero, because of hierarchy)
  • p00≡p¬f1,¬f2 (neither feature present, maybe talking about volcanoes)

Each feature f∈Rd is a vector with unit norm, and we assume that all features are mutually orthogonal, so fa⋅fb=0 ∀fa,fb∈F. Each activation h∈Rd in the model's residual stream is a sum of all active features.

Training a Sparse Autoencoder (SAE)

Given a language model with residual stream activations h∈Rd, the sparse autoencoder learns a mapping fϕ such that  ^h=fϕ(h) reconstructs h. The total loss consists of a reconstruction term Lrec, which minimizes the squared error, and a sparsity term Lsp, enforcing sparsity via an Lp-norm penalty on ^h. The model parameters ϕ are optimized   via gradient descent to minimize L. The complete loss:

L=Lrec+λLsp

where Lrec=Ex∼D[∥^h−h∥22]  (the reconstruction loss) and Lsp=Ex∼D[λ∑i|^hi|p] (the sparsity loss).

The SAE constructs latent activations ^h as ^h=Wdz, where z=ReLU(Weh).

We consider encoders and decoders without biases for simplicity, and we look at the latents of the SAE that are related to the two features f1 and f2, and show that absorption doesn't affect reconstruction and increases sparsity, and that optimizing for sparsity pushes for higher absorption.

Evaluating the SAE loss under δ-absorption

We use zi to denote the hidden activation for latent i, ei to denote the SAE encoder for latent i, and di to denote the SAE decoder for latent i. We assume the decoder is linear, that is, the reconstruction ^h is a linear combination of latents ^h=∑idizi. We assume that the first 2 latents (z1 and z2) track f1 and f2 without loss of generality.

We define δ-absorption as part-way between no absorption (δ=0) and full absorption (δ=1). Thus, e1=f1−δf2, e2=f2, d1=f1, and d2=f2+δf1, and we look at the reconstruction and sparsity losses under varying amounts of absorption.

Reconstruction under δ-absorption

Case 1:  h=f1 (parent feature only)

In this case, the parent feature fires on its own, so the SAE just needs to reconstruct f1. 

z1=ReLU((f1−δf2)h)=ReLU((f1−δf2)f1)=1

The first latent fires with magnitude 1, since f1⋅f1=1 but f1⋅f2=0 since f1⊥f2 .

z2=ReLU(f2z)=ReLU(f2f1)=0

The second latent fires with magnitude 0 since f1⋅f2=0..

^h1=z1f1=f1^h2=z2(f2+δf1)=0^h=^h1+^h2=f1

The decoder output is thus just f1, perfectly reconstructing the input.

Case 2:  h=f1+f2 (parent and child together)

In this case, the parent and child are firing together, so the SAE needs to reconstruct the sum of the parent and child features f1+f2.

z1=ReLU((f1−δf2)h)=ReLU((f1−δf2)(f1+f2))=1−δ

The first latent fires with magnitude 1−δ since f1⋅f1=1 and f2⋅f2=1, but f1⋅f2=0 since f1⊥f2. We have z2=ReLU(f2z)=ReLU(f2(f1+f2))=1.

The second latent fires with magnitude 1 since f2 is present in h.

^h1=z1f1=(1−δ)f1=f1−δf1^h2=z2(f2+δf1)=f2+δf1^h=^h1+^h2=f1+f2

The decoder output sums to f1+f2, again perfectly reconstructing the input.

Case 3:  h=0 (nothing fires)

Here, z1=ReLU((f1−δf2)h)=0 and z2=ReLU(f2z)=0.

When neither feature is present, neither latents fire:

^h1=z1f1=0^h2=z2(f2+δf1)=0^h=^h1+^h2=0

And thus the decoder output is still 0, achieving perfect reconstruction.

We do not need to consider the case where f2 fires alone, as this is not allowed by our absorption setup. As we see above, in all cases, any level of absorption δ still achieves perfect reconstruction.

Intuitively, absorption does not hinder reconstruction because knowing the elephant feature is active is enough to infer animal even if that feature gets absorbed.

Sparsity under δ-absorption

We calculate the sparsity loss as follows:

Lsp=∑x∼D(∑f∈hLp(Encfϕ(x))+∑f∉hLn(Encfϕ(x))

where h={f′1,f′2} represents the two features that are related to f1 and f2, and Ln is the norm (to disambiguate with p). Since the representation and sparsity of the unrelated features stays the same, we look at this value of sparsity and how it changes with the amount of absorption in our data distribution.

Case 1: δ=0 (no absorption)

With no absorption (which means the SAE learned the true features f1 and f2 in both the encoder and decoder), the sparsity comes out to be (with n-norm, where we assume fenc∈{0,1} for simplicity):

Lsp=p11⋅21/n+p10+L∉h

Case 2: δ=1 (absolute absorption)

With absolute absorption of a feature (all datapoints exhibiting it), with the decoder learning f2 and f2+f1 respectively for the encoder features f1∖f2 (exclusion due to absorption) and f2. In this case, we get the following sparsity loss:

Lsp=p11+p10+L∉h

Case 3: General Case (some δ amount of absorption)

Finally, with an arbitrary amount of absorption of f1 into f2, we get:

Lsp=p11⋅(2−δ)1/p+p10+L∉h

It is evident that absorption leads to higher sparsity in encoder activations for hierarichical features. Now, we show that minimizing sparsity via differentiating naturally leads to increasing absorption.

Differentiating the Sparsity Loss

Assume that the sparsity penalty is given by an Ln norm on the encoder activation: 

Lsp(δ)=Ex[|z(x)|p].

We can see that the derivative of the Lp sparsity loss with respect to δ promotes absorption:

dLspdδ=−p11p(2−δ)1p−1

This derivative is always negative as long as p11>0, i.e., datapoints with both features exist in our distribution, so increasing sparsity (decreasing the loss) always increases absorption. Next, we show that absorption makes SAEs (and other related feature extractions) much less trustworthy for safety.

Absorption makes SAEs less trustworthy for safety

The hope with SAEs (and other related research) is that we extract model-internal latents representing human-interpretable features that we can detect and control. For this to be usable, we want this to work successfully for complex, harmful behaviours such as lying, strategic deception, power-seeking, sycophancy, backdoors, etc.

Absorption, and sparsity in feature-extraction which promotes absorption, means that even if we find linear latents for these features (for which we don't really have good progress in the first place [TODO: cite], we can't trust them.

A feature for deception, with absorption, might actually have learned deception-except-deception-in-2027, or a feature with power seeking might have learned power-seeking-except-instrumental-convergence. Unless we fix our feature extractors to not optimize to maximise just sparsity in latent activations (some other ideas and starting points include KL-divergence and parameter-space related directions), we can't trust our SAEs, and their features can potentially be more misleading and harmful than helpful.

 

An example of feature absorption in real-world models

In the original feature absorption paper, absorption is shown to occur in real SAEs trained on LLMs. SAE latents that seem to track starting letter information (e.g. a feature that fires on tokens that start with S) fails to fire on seemingly arbitrary tokens that start with (like the token "_short"). The paper shows that this is due to more specific latents "absorbing" the "starts with S" direction.

In feature absorption, we find gerrymandered SAE latents which appear to track an interpretable concept but have holes in their recall. Here, we see the dashboard for Gemma Scope layer 3 16k, latent 6510 which appears to track "starts with S", but mysteriously doesn't fire on "_short".

The original paper hypothesizes that feature absorption is a logical consequence of the sparsity penalty in SAE training, but we now have a proof that naively optimizing an SAE for sparsity will indeed lead to feature absorption.

How much of this applies to other SAE variants and feature-extraction methods?

We should expect feature absorption in any SAE architecture that incentivizes sparsity, which at present is all common architectures. JumpReLU, TopK, and standard L1 SAEs all extract feature representations by making use of sparsity and we should thus expect them all to engage in feature absorption.

Cross-layer SAE variants such as crosscoders[1] and transcoders[2] also rely on sparsity to extract features and we should thus also expect that these new architectures will also suffer from feature absorption. Indeed, recent work on transcoders finds they do experience feature absorption[3].

Matryoshka SAEs to fix Absorption

Matryoshka SAEs are a promising approach to fixing absorption in SAEs. These SAEs encode a notion of hierarchy by forcing earlier latents to reconstruct the full output on their own, making it more difficult for parent latents to have holes in their recall for child features.

 

  1. ^

    Sparse Crosscoders for Cross-Layer Features and Model Diffing [link], Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, Christopher Olah, Transformer Circuits Thread 2024

  2. ^

    Transcoders Find Interpretable LLM Feature Circuits [link], Jacob Dunefsky, Philippe Chlenski, Neel Nanda, arXiv:2406.11944, 2024

  3. ^

    Transcoders Beat Sparse Autoencoders for Interpretability [link], Gonçalo Paulo, Stepan Shabalin, Nora Belrose, arXiv:2501.18823, 2025