by Zachary Baker, Yuxiao Li, Maxim Panteleev, Maxim Finenko, Eslam Zaher

August 2025 | SPAR Spring '25

A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"

Introduction

Here, we compare the difference of variational sparse autoencoders (vSAE) against a standard sparse autoencoder (SAE). We hypothesize that adding the KL divergence term into the loss function and sampling the activations from a Gaussian distribution should cause a pressure for features to 'move apart' leading to better representations of the language model's latent space, leading to increased performance and interpretability. We ultimately find that this addition does not benefit the interpretability of the language model and degrades performance.

Standard Sparse AutoEncoder (SAE)

The SAE learns a sparse representation of a language model's (LM) activations through deterministic encoding with sparsity regularization. Here, we use the residual stream of a 1-layer GeLU transformer decoder block as the LM activations (the blue input). The SAE is constructed in a very similar manner to an MLP with one hidden layer, as follows:

Encoder:

Decoder:

^x = W_{d e c} f + b_{d e c}

Loss Function:

L_{S A E} = ∥ x -^x ∥_{2}^{2} + λ ∥ f ∥_{1}

where λ controls the sparsity-reconstruction tradeoff and the L1 penalty ( $∥ f ∥_{1}$ ) enforces sparse activations.

Variational Sparse AutoEncoder (vSAE)

In the vSAE, the activations of the SAE dictionary features are sampled from a gaussian (mean=activation, variance=1), represented by the horizontal bars in the 'dictionary features' box. This changes the feature as a pointer in some direction representing the state of the LM into an area about some direction. We hypothesize that this will cause a 'pressure' for other features to move away from each other, leading to a more organized space of features for the LM state representations.

The vSAE adds in a probabilistic framework using a Gaussian prior:

Encoder (Mean, as variance is always set to 1):

μ = W_{e n c} (x - b_{d e c}) + b_{e n c}

Reparameterization:

z = μ + ϵ ⊙ σ, ϵ \sim N (0, I)

Decoder:

^x = W_{d e c} z + b_{d e c}

Loss Function:

L_{V S A E} = ∥ x -^x ∥_{2}^{2} + β \cdot KL [q (z | x) ∥ p (z)] + λ ∥ f ∥_{1}

where

q (z | x) = N (μ, σ^{2}),

p (z) = N (0, I)

KL [q (z | x) ∥ p (z)] = \frac{1}{2} d \sum i = 1 (μ_{i}^{2} + σ_{i}^{2} - 1 - log σ_{i}^{2})

The fundamental distinction lies in sparsity mechanisms: standard SAEs achieve sparsity through deterministic ReLU gating and explicit L1 regularization, while vSAEs achieve sparsity through stochastic sampling from a learned posterior regularized toward a sparse prior. This allows vSAEs to model uncertainty in feature activations and capture richer representational structure through the probabilistic latent space.

When we say an SAE is variational, we mean to say that the reparameterization trick was added (using an isotropic gaussian prior) as well as a KL loss term to encourage the learned distribution towards the standard normal prior.

The Gaussian was chosen as a simple and familiar prior. The goal is to cause the features to disperse in a more organized way, so the specific prior used is not as important.

Why Apply Variational Methods to SAEs?

We applied variational methods to SAEs to better organize the representation of the language model, increasing interpretability.

Under the linear representation hypothesis, SAE features are represented as directional vectors with specific magnitudes. Variational methods replace this deterministic approach with probabilistic sampling. Instead of fixed directional vectors, features sample from Gaussian distributions centered around those original directions.

This Gaussian sampling creates a "blurred area" around each feature activation. We hypothesize this blurring generates two beneficial pressures: features push away from each other to avoid overlap, and features organize more coherently to function if they are similar and both features would fire anyways.

Methodology

For this project, we used a fork of the popular GitHub repository for SAEs named Dictionary Learning. This allowed us to more easily run benchmarks such as SAE Bench. We were also able to modify SAE Vis for visualization of individual features. Code is available at our GitHub.

Our experimental framework provides controlled comparisons between standard SAE variants and their variational counterparts across a architectures. We evaluate all models on Neel Nanda's gelu-1l transformer at layer 0 (blocks.0.resid_post), using a dictionary size of 4× the MLP dimension (2,048 features total).

Training occurs over 20k steps with bfloat16 precision using the same c4-code-pretokenized dataset the transformer was trained on, using consistent buffer configurations (2.5k contexts of length 128) and identical evaluation metrics including fraction variance explained, L0 sparsity, and loss recovery. Key hyperparameters—learning rate and method-specific parameters (e.g., KL coefficient for variational methods, TopK values)—are systematically varied while maintaining architectural consistency. Variational SAE experiments additionally incorporate KL annealing schedules to prevent posterior collapse. Training occurred on a local 3080 (10GB VRAM).

Results

Purely applying the variational methods to the SAE led to poor results: the KL divergence was not a strong enough regularizer to cause sparseness. Therefore, we paired the variational method with the TopK method while running experiments.

We chose to focus on TopK after running extensive studies on SAE architectures (see preliminary analysis here).

We would like to compare the vSAE against its SAE counterpart. We first look at the benchmark results to see how directly competitive the vSAE is. Then we look at the local, specific features which stood out to us. We then look at the global latent analysis of the features using tSNE.

We ultimately found that we lost performance from applying the variational methods on the SAE compared across the other models.

SAE Bench

SAE Bench is a comprehensive evaluation framework for Sparse Autoencoders that provides standardized benchmarking across several evaluation metrics, including: Feature Absorption, L0/Loss Recovered, Spurious Correlation Removal (SCR), Targeted Probe Perturbation (TPP), and Sparse Probing. We use this tool to compare our vSAE TopK to the standard TopK across different sparsities.

TopK consistently outperforms vSAE TopK on quality metrics: KL divergence, cross-entropy loss, explained variance, and cosine similarity all improve with higher K values, with TopK maintaining a clear advantage. The main concern is the fraction of living features: the vSAE has a very low fraction of living features: when k=256, only half of the features are functional.

Other SAE Bench results

As sparsity increases (moving right on x-axis), both methods achieve better reconstruction quality (higher explained variance). TopK consistently outperforms vSAE TopK at each sparsity level, particularly at lower k values. Both methods follow similar Pareto frontier curves, with the highest quality gains occurring around k=256 for the vSAE.

At the bottom left we see that the vSAE TopK performs well when an additional TopK mask is placed on top. This is likely because the dead features are all active across this metric. At the bottom right, we see a trend towards degradation as the vSAE increases K, but a trend away from it as the SAE TopK grows.

Higher is better. These SCR (Spurious Correlation Removal) benchmarks show a slower degradation of the vSAE over the SAE. From the SAEBench paper, "Starting with a biased linear probe classifier that has learned both intended signals (e.g., profession) and spurious correlations (e.g., gender), we measure how effectively zero-ablating a small number of SAE latents can remove the unwanted correlation from the SAE’s output." This means that under stress of spurious relationships, the vSAE has more robust representations than its SAE counterpart.

Higher is better. From the SAEbench paper: "A high TPP score indicates that concepts are captured by distinct sets of latents—ablating latents relevant to one class should primarily degrade that class’s probe accuracy while leaving other class probes unaffected." The selectivity ratio at the bottom left shows the vSAE somewhat outperforming the SAE, meaning that the representations of the vSAE are more robust.

The experiments show TopK achieves superior reconstruction fidelity with consistent dictionary utilization, while vSAE TopK exhibits more variable sparsity patterns and consistently lower reconstruction quality across all K values tested. We do, however, see that the features are slightly more robust in the other results.

The fraction of living features may be the culprit: the KL term applies a regularization pressure to cause feature activations to tend to zero, causing a low fraction of living features. This reduces the computational capacity of the vSAE. This causes the SAE to outperform our vSAE.

This KL loss pressure which kills many features may also be the reason we see improved robustness.

SAE Vis

For this feature analysis, we took the SAE_Vis library and rewrote it to fit our needs. We take individual features to have a local view of how the network and inspect them to gauge interpretability of the autoencoders in question.

We find in the later global tSNE analysis that many of the vSAE features are dead, and we find the same result in SAE Vis: many of the features of the vSAE have no results, while the SAE has extremely few dead features that failed to fire over the test set.

A few examples are given below that feel like they represent the spread of features, although there are a few caveats: the synthetic tokens (taken by maximizing the feature activation via stochastic gradient ascent) seem to be completely unrelated to the real dataset examples that cause the features to fire. The real activations are generally interpretable, but the synthetic activations do not seem to correlate to anything sensible. The logit effects seem to make sense as well, but they also seem to be totally separate to the real activations' interpretability?

We find that a majority of the features are interpretable in both the vSAE as well as SAE. A feature of both have been given to help show how the networks perform generally.

vSAE:

This feature activates on locational features, especially with respect to direction. the synthetic activations seem to be unrelated, although the logits

TopK SAE:

We can see that this feature seems to be polysemantic: it fires for time signals, but also fires on the term Christmas. The highest activations found from the synthetic activations seem to be unrelated to the activations found from the dataset. The logit effects seem to fire strongly in the positive as well as negative directions on suffixes.

In general, we found that the SAE was more interpretable than the vSAE, and had far fewer dead features.

Latent Analysis

To test our hypothesis that variational methods create pressure for features to disperse and organize more coherently, we examine the spatial organization of learned representations through latent analysis. If the Gaussian sampling mechanism successfully pushes features apart as theorized, we expect to observe fewer tightly packed clusters in the vSAE compared to standard SAE representations and a uniform spread.

We also seek evidence against pathological activation patterns—specifically, we want to identify and minimize features that fire consistently across all inputs, as these indicate failed sparsity learning rather than meaningful feature differentiation.

Additionally, we analyze whether activation magnitudes distribute evenly across discovered clusters, which would suggest balanced utilization of the representational space rather than concentration in specific regions. Through arbitrary clustering of the learned features, we can quantify these organizational properties and determine whether the probabilistic framework actually delivers the improved spatial organization that motivated our variational approach.

We test across 20,000 LM activation samples using the same LM and dataset from training.

Top left (Feature Clusters): Shows t-SNE dimensionality reduction of the feature space with 10 arbitrary clusters color-coded.
Top right (Feature Utilization Rate): Maps the same t-SNE space but colors points by how frequently each feature activates. Purple indicates rarely-used features (~0.1 utilization), while yellow shows heavily-utilized features (~0.8). The spatial distribution reveals utilization patterns across the learned representation.
Bottom left (Mean Activation, Size=Utilization): Combines both activation strength (color) and frequency (point size) in the same spatial layout. This reveals whether high-activation features also tend to fire frequently, and whether these properties cluster spatially.
Bottom right (Cluster Utilization): Quantifies utilization rates across the 10 clusters, showing cluster sizes and mean utilization rates.

K=512

K=256 and K=128 cases

k=64

For vSAE k=64, we see very few unique features are used at this low K, and the activations themselves are quite low and consistent save for clusters 5 and 9. These clusters fire loud and often.
266/2048 total features.

we see that while SAE grows more sparse as K increases, vSAE shows the opposite effect: a larger K causes more unique features to fire. We also see that there is far more variability in the feature utilization.

We also see a general trend that at least one feature activates strongly and consistently, despite whether the variational method is present or not. This may be a result of the TopK method itself.

Effect of KL term on latent space

The KL term is the $β$ term from earlier. It is a weight on the KL loss term. Higher is more impactful on the network.

At this high KL loss term weight, We can see that there are clusters of high utilization rate (notably cluster 4), with an average feature utilization rate of ~0.5.

We can see two notable groups of high activation and utilization in clusters 9 and 7. Cluster 3 is also notable. Mean utilization rate falls dramatically to ~0.25.

We can now see very interesting clusters of low utilization. Although the tSNE plot has been divided into 10 regions, we can see small clusters (esp. in 2 and 3) which have very low utilization. Mean utilization now varies more substantially than before.

Based on these visualizations, lower KL values pressure the model to use a greater number of total features more selectively. This creates more interpretable representations at the cost of representational capacity. We always see at least one cluster of high utilization, suggesting certain feature types are essential regardless of sparsity constraints. There seems to be a balance of high KL = dense, distributed representations; Low KL = sparse, selective representations. We ultimately chose KL=1 for our experiments.

Cosine Similarity of Features

To calculate the cosine similarity of features, we load both trained autoencoder models and extract their feature vectors from the decoder weights. It then computes the cosine similarity between every pair of features in each model to create a complete similarity matrix.

We then test the models on sample data from the LM to identify which features actually activate during real usage, since dormant features aren't meaningful for comparison. The analysis focuses only on these active features, calculating statistical properties like mean, variance, and distribution of their cosine similarities.

Note the values: the bars are not normalized across the figure.

Both vSAETopK and SAETopK achieve good feature orthogonality, but their relative performance depends critically on sparsity level. At high sparsity (k=512), both models perform nearly identically with minimal feature correlation. At lower sparsity (k=64), vSAE TopK shows significantly more feature correlation while SAE TopK maintains better orthogonality but with higher variance.

This suggests SAE TopK may be more robust across different sparsity regimes, while vSAE TopK's performance is more sparsity-dependent. The k=64 differences (0.0058 vs 0.0001) could contribute to the performance gaps seen earlier in the SAEBench section. Both architectures successfully learn sparse representations, but SAE TopK demonstrates more consistent feature separation across sparsity levels. This disparity is could be due to the low number of live features that we saw before.

discussion

why do so many features die?

The likely reason that we had far fewer live features in the vSAE compared to the SAE is the KL divergence pressure: we added a term to the loss which pushed the latent features towards 0. Featrues did not have the ability to grow without penalty, so there was a higher chance that they would be killed in the process. The smaller number of live faetures reduces the computational capacity of the vSAE, explaining the performance drop.

Dead features due to the KL loss term were both symptom and cause of poor performance.

Future Steps

augment vSAE architecture to maintain living features

In the future, augmentations to the loss function and hyperparameters could reduce the pressure which kills the features: could more live features recover the performance drop that we see?

Run on pythia 70m

We were only able to run our experiments on the 1l-GeLU network due to time and memory constraints. A one layer transformer can only embed bigram and trigram statistics: could upgrading to a multi-layer transformer augment the results?

increase number of activations in tSNE to activate all possible features.

Due to memory constraints, we were only able to use the features that DID fire on a subset of the dataset in the tSNE dataset. How would this change if we could activate a greater number of features in the global analysis?

conclusion

We ultimately found that the vSAE was not competitive in SAEBench, had many more dead features than its SAE counterpart, and was slightly less interpretable in the feature analysis. We did, however, find that the vSAE features were more robust than its counterpart. We found slightly more correlated features in the vSAE, and found the best performance to be at 512 features.

references

@misc{sae_vis,
    title  = {{SAE Visualizer}},
    author = {Callum McDougall},
    howpublished    = {\url{https://github.com/callummcdougall/sae_vis}},
    year   = {2024}
}

@misc{marks2024dictionary_learning,
   title = {dictionary_learning},
   author = {Samuel Marks, Adam Karvonen, and Aaron Mueller},
   year = {2024},
   howpublished = {\url{https://github.com/saprmarks/dictionary_learning}},
}

@misc{karvonen2025saebenchcomprehensivebenchmarksparse,
title={SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability},
author={Adam Karvonen and Can Rager and Johnny Lin and Curt Tigges and Joseph Bloom and David Chanin and Yeu-Tong Lau and Eoin Farrell and Callum McDougall and Kola Ayonrinde and Demian Till and Matthew Wearden and Arthur Conmy and Samuel Marks and Neel Nanda},
year={2025},
eprint={2503.09532},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.09532},
}

LESSWRONG
LW