by Zachary Baker, Yuxiao Li, Maxim Panteleev, Maxim Finenko, Eslam Zaher
August 2025 | SPAR Spring '25
A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"
Here, we compare the difference of variational sparse autoencoders (vSAE) against a standard sparse autoencoder (SAE). We hypothesize that adding the KL divergence term into the loss function and sampling the activations from a Gaussian distribution should cause a pressure for features to 'move apart' leading to better representations of the language model's latent space, leading to increased performance and interpretability. We ultimately find that this addition does not benefit the interpretability of the language model and degrades performance.
The SAE learns a sparse representation of a language model's (LM) activations through deterministic encoding with sparsity regularization. Here, we use the residual stream of a 1-layer GeLU transformer decoder block as the LM activations (the blue input). The SAE is constructed in a very similar manner to an MLP with one hidden layer, as follows:
Encoder:
Decoder:
Loss Function:
where λ controls the sparsity-reconstruction tradeoff and the L1 penalty () enforces sparse activations.
In the vSAE, the activations of the SAE dictionary features are sampled from a gaussian (mean=activation, variance=1), represented by the horizontal bars in the 'dictionary features' box. This changes the feature as a pointer in some direction representing the state of the LM into an area about some direction. We hypothesize that this will cause a 'pressure' for other features to move away from each other, leading to a more organized space of features for the LM state representations.
The vSAE adds in a probabilistic framework using a Gaussian prior:
Encoder (Mean, as variance is always set to 1):
Reparameterization:
Decoder:
Loss Function:
where
The fundamental distinction lies in sparsity mechanisms: standard SAEs achieve sparsity through deterministic ReLU gating and explicit L1 regularization, while vSAEs achieve sparsity through stochastic sampling from a learned posterior regularized toward a sparse prior. This allows vSAEs to model uncertainty in feature activations and capture richer representational structure through the probabilistic latent space.
When we say an SAE is variational, we mean to say that the reparameterization trick was added (using an isotropic gaussian prior) as well as a KL loss term to encourage the learned distribution towards the standard normal prior.
The Gaussian was chosen as a simple and familiar prior. The goal is to cause the features to disperse in a more organized way, so the specific prior used is not as important.
We applied variational methods to SAEs to better organize the representation of the language model, increasing interpretability.
Under the linear representation hypothesis, SAE features are represented as directional vectors with specific magnitudes. Variational methods replace this deterministic approach with probabilistic sampling. Instead of fixed directional vectors, features sample from Gaussian distributions centered around those original directions.
This Gaussian sampling creates a "blurred area" around each feature activation. We hypothesize this blurring generates two beneficial pressures: features push away from each other to avoid overlap, and features organize more coherently to function if they are similar and both features would fire anyways.
For this project, we used a fork of the popular GitHub repository for SAEs named Dictionary Learning. This allowed us to more easily run benchmarks such as SAE Bench. We were also able to modify SAE Vis for visualization of individual features. Code is available at our GitHub.
Our experimental framework provides controlled comparisons between standard SAE variants and their variational counterparts across a architectures. We evaluate all models on Neel Nanda's gelu-1l transformer at layer 0 (blocks.0.resid_post), using a dictionary size of 4× the MLP dimension (2,048 features total).
Training occurs over 20k steps with bfloat16 precision using the same c4-code-pretokenized dataset the transformer was trained on, using consistent buffer configurations (2.5k contexts of length 128) and identical evaluation metrics including fraction variance explained, L0 sparsity, and loss recovery. Key hyperparameters—learning rate and method-specific parameters (e.g., KL coefficient for variational methods, TopK values)—are systematically varied while maintaining architectural consistency. Variational SAE experiments additionally incorporate KL annealing schedules to prevent posterior collapse. Training occurred on a local 3080 (10GB VRAM).
Purely applying the variational methods to the SAE led to poor results: the KL divergence was not a strong enough regularizer to cause sparseness. Therefore, we paired the variational method with the TopK method while running experiments.
We chose to focus on TopK after running extensive studies on SAE architectures (see preliminary analysis here).
We would like to compare the vSAE against its SAE counterpart. We first look at the benchmark results to see how directly competitive the vSAE is. Then we look at the local, specific features which stood out to us. We then look at the global latent analysis of the features using tSNE.
We ultimately found that we lost performance from applying the variational methods on the SAE compared across the other models.
SAE Bench is a comprehensive evaluation framework for Sparse Autoencoders that provides standardized benchmarking across several evaluation metrics, including: Feature Absorption, L0/Loss Recovered, Spurious Correlation Removal (SCR), Targeted Probe Perturbation (TPP), and Sparse Probing. We use this tool to compare our vSAE TopK to the standard TopK across different sparsities.
Other SAE Bench results
The experiments show TopK achieves superior reconstruction fidelity with consistent dictionary utilization, while vSAE TopK exhibits more variable sparsity patterns and consistently lower reconstruction quality across all K values tested. We do, however, see that the features are slightly more robust in the other results.
The fraction of living features may be the culprit: the KL term applies a regularization pressure to cause feature activations to tend to zero, causing a low fraction of living features. This reduces the computational capacity of the vSAE. This causes the SAE to outperform our vSAE.
This KL loss pressure which kills many features may also be the reason we see improved robustness.
For this feature analysis, we took the SAE_Vis library and rewrote it to fit our needs. We take individual features to have a local view of how the network and inspect them to gauge interpretability of the autoencoders in question.
We find in the later global tSNE analysis that many of the vSAE features are dead, and we find the same result in SAE Vis: many of the features of the vSAE have no results, while the SAE has extremely few dead features that failed to fire over the test set.
A few examples are given below that feel like they represent the spread of features, although there are a few caveats: the synthetic tokens (taken by maximizing the feature activation via stochastic gradient ascent) seem to be completely unrelated to the real dataset examples that cause the features to fire. The real activations are generally interpretable, but the synthetic activations do not seem to correlate to anything sensible. The logit effects seem to make sense as well, but they also seem to be totally separate to the real activations' interpretability?
We find that a majority of the features are interpretable in both the vSAE as well as SAE. A feature of both have been given to help show how the networks perform generally.
vSAE:
This feature activates on locational features, especially with respect to direction. the synthetic activations seem to be unrelated, although the logits
TopK SAE:
We can see that this feature seems to be polysemantic: it fires for time signals, but also fires on the term Christmas. The highest activations found from the synthetic activations seem to be unrelated to the activations found from the dataset. The logit effects seem to fire strongly in the positive as well as negative directions on suffixes.
In general, we found that the SAE was more interpretable than the vSAE, and had far fewer dead features.
To test our hypothesis that variational methods create pressure for features to disperse and organize more coherently, we examine the spatial organization of learned representations through latent analysis. If the Gaussian sampling mechanism successfully pushes features apart as theorized, we expect to observe fewer tightly packed clusters in the vSAE compared to standard SAE representations and a uniform spread.
We also seek evidence against pathological activation patterns—specifically, we want to identify and minimize features that fire consistently across all inputs, as these indicate failed sparsity learning rather than meaningful feature differentiation.
Additionally, we analyze whether activation magnitudes distribute evenly across discovered clusters, which would suggest balanced utilization of the representational space rather than concentration in specific regions. Through arbitrary clustering of the learned features, we can quantify these organizational properties and determine whether the probabilistic framework actually delivers the improved spatial organization that motivated our variational approach.
We test across 20,000 LM activation samples using the same LM and dataset from training.
K=256 and K=128 cases
we see that while SAE grows more sparse as K increases, vSAE shows the opposite effect: a larger K causes more unique features to fire. We also see that there is far more variability in the feature utilization.
We also see a general trend that at least one feature activates strongly and consistently, despite whether the variational method is present or not. This may be a result of the TopK method itself.
Effect of KL term on latent space
The KL term is the term from earlier. It is a weight on the KL loss term. Higher is more impactful on the network.
Based on these visualizations, lower KL values pressure the model to use a greater number of total features more selectively. This creates more interpretable representations at the cost of representational capacity. We always see at least one cluster of high utilization, suggesting certain feature types are essential regardless of sparsity constraints. There seems to be a balance of high KL = dense, distributed representations; Low KL = sparse, selective representations. We ultimately chose KL=1 for our experiments.
To calculate the cosine similarity of features, we load both trained autoencoder models and extract their feature vectors from the decoder weights. It then computes the cosine similarity between every pair of features in each model to create a complete similarity matrix.
We then test the models on sample data from the LM to identify which features actually activate during real usage, since dormant features aren't meaningful for comparison. The analysis focuses only on these active features, calculating statistical properties like mean, variance, and distribution of their cosine similarities.
Both vSAETopK and SAETopK achieve good feature orthogonality, but their relative performance depends critically on sparsity level. At high sparsity (k=512), both models perform nearly identically with minimal feature correlation. At lower sparsity (k=64), vSAE TopK shows significantly more feature correlation while SAE TopK maintains better orthogonality but with higher variance.
This suggests SAE TopK may be more robust across different sparsity regimes, while vSAE TopK's performance is more sparsity-dependent. The k=64 differences (0.0058 vs 0.0001) could contribute to the performance gaps seen earlier in the SAEBench section. Both architectures successfully learn sparse representations, but SAE TopK demonstrates more consistent feature separation across sparsity levels. This disparity is could be due to the low number of live features that we saw before.
The likely reason that we had far fewer live features in the vSAE compared to the SAE is the KL divergence pressure: we added a term to the loss which pushed the latent features towards 0. Featrues did not have the ability to grow without penalty, so there was a higher chance that they would be killed in the process. The smaller number of live faetures reduces the computational capacity of the vSAE, explaining the performance drop.
Dead features due to the KL loss term were both symptom and cause of poor performance.
We ultimately found that the vSAE was not competitive in SAEBench, had many more dead features than its SAE counterpart, and was slightly less interpretable in the feature analysis. We did, however, find that the vSAE features were more robust than its counterpart. We found slightly more correlated features in the vSAE, and found the best performance to be at 512 features.
references
@misc{sae_vis,
title = {{SAE Visualizer}},
author = {Callum McDougall},
howpublished = {\url{https://github.com/callummcdougall/sae_vis}},
year = {2024}
}
@misc{marks2024dictionary_learning,
title = {dictionary_learning},
author = {Samuel Marks, Adam Karvonen, and Aaron Mueller},
year = {2024},
howpublished = {\url{https://github.com/saprmarks/dictionary_learning}},
}
@misc{karvonen2025saebenchcomprehensivebenchmarksparse,
title={SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability},
author={Adam Karvonen and Can Rager and Johnny Lin and Curt Tigges and Joseph Bloom and David Chanin and Yeu-Tong Lau and Eoin Farrell and Callum McDougall and Kola Ayonrinde and Demian Till and Matthew Wearden and Arthur Conmy and Samuel Marks and Neel Nanda},
year={2025},
eprint={2503.09532},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.09532},
}