This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
First, we need to be clear about what SAE assumes: SAE assumes that there’s a sparse representation ()inside a neural network, such that the original activation v can be approximately reconstructed as with very few non-zero entries in . Those non-zero entries are supposed to correspond to “interpretable features.” To enforce this sparsity, SAE adds an penalty or regularization term.
From the normalized representation , we know that ss lies on the unit sphere and its components satisfy . Each dimension d contributes to the logit of every vocabulary token via , where is the LM head weight connecting dimension to token . This means every dimension influences the entire vocabulary — there is no sparse connectivity. Therefore, the representation is inherently dense, with all dimensions playing a global role. SAE, however, assumes the existence of a sparse representation () where only a few entries are non-zero, and those non-zero entries correspond to “interpretable features.” This assumption directly contradicts the dense nature of the representation shown above: in a dense system where every dimension globally contributes, forcing sparsity is not revealing an inherent property — it is imposing an artificial constraint. Hence, the “sparse features” SAE finds are not intrinsic to the model, they are artifacts of the sparsity regularization.
Secondly,If representations were as sparse as SAE assumes, then the normalized vector ss (and current models widely use normalization layers) should have only a handful of large-magnitude dimensions, with the rest near zero. But since , a few large values force all the rest to be very close to zero. The catch: normalization itself has no mechanism that forces most dimensions to be near zero. For an arbitrarily normalized direction with . any dimension can be arbitrarily small — but non-zero — and that’s perfectly allowed.
For a sparse representation to actually exist, would need to be expressible as a linear combination of just a few basis vectors. But once we write , that decomposition is unique. Any attempt to further decompose into sparse basis vectors would look like this:
Suppose ,where K << D and are columns of some overcomplete dictionary. Since itself lives in with fixed ,there are infinitely many such representations — as long as , you have degrees of freedom. This shows that the “sparse representation” SAE finds is not an inherent property of the model. It’s just some arbitrarily chosen coordinate system. That’s also why different random seeds yield completely different it has nothing to do with how the model actually computes.
At the very least, the “sparse” representation SAE looks for simply does not exist in LLMs. Every dimension influences the whole vocabulary. If SAE tries to find sparse features here — it’s bound to fail. If it ever finds any, that would be truly bizarre.
In recent months, even academia has grown deeply skeptical of SAE.
Google DeepMind’s interpretability team, for instance, explicitly stated in their.Their team leads have also expressed pessimism about SAE-style foundational research in public channels.
First, we need to be clear about what SAE assumes: SAE assumes that there’s a sparse representation( ) inside a neural network, with very few non-zero entries in . penalty or regularization term.
such that the original activation v can be approximately reconstructed as
Those non-zero entries are supposed to correspond to “interpretable features.”
To enforce this sparsity, SAE adds an
From the normalized representation , we know that ss lies on the unit sphere and its components satisfy . Each dimension d contributes to the logit of every vocabulary token via , where is the LM head weight connecting dimension to token . This means every dimension influences the entire vocabulary — there is no sparse connectivity. Therefore, the representation is inherently dense, with all dimensions playing a global role. SAE, however, assumes the existence of a sparse representation ( ) where only a few entries are non-zero, and those non-zero entries correspond to “interpretable features.” This assumption directly contradicts the dense nature of the representation shown above: in a dense system where every dimension globally contributes, forcing sparsity is not revealing an inherent property — it is imposing an artificial constraint. Hence, the “sparse features” SAE finds are not intrinsic to the model, they are artifacts of the sparsity regularization.
Secondly,If representations were as sparse as SAE assumes, , a few large values force all the rest to be very close to zero. The catch: normalization itself has no mechanism that forces most dimensions to be near zero. For an arbitrarily normalized direction with . any dimension can be arbitrarily small — but non-zero — and that’s perfectly allowed.
then the normalized vector ss (and current models widely use normalization layers)
should have only a handful of large-magnitude dimensions, with the rest near zero.
But since
For a sparse representation to actually exist, would need to be expressible as a linear combination of just a few basis vectors. But once we write , that decomposition is unique. Any attempt to further decompose into sparse basis vectors would look like this:
Suppose ,where K << D and are columns of some overcomplete dictionary. Since itself lives in with fixed ,there are infinitely many such representations — as long as , you have degrees of freedom. This shows that the “sparse representation” SAE finds is not an inherent property of the model. It’s just some arbitrarily chosen coordinate system. That’s also why different random seeds yield completely different it has nothing to do with how the model actually computes.
At the very least, the “sparse” representation SAE looks for simply does not exist in LLMs. Every dimension influences the whole vocabulary. If SAE tries to find sparse features here — it’s bound to fail. If it ever finds any, that would be truly bizarre.
In recent months, even academia has grown deeply skeptical of SAE.
Google DeepMind’s interpretability team, for instance, explicitly stated in their.Their team leads have also expressed pessimism about SAE-style foundational research in public channels.
Linker: https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
Since SAE is fundamentally wrong even on the most basic assumptions in mathematics, it's only natural that his performance is poor...