by Yuxiao Li, Henry Zheng, Zachary Baker, Eslam Zaher, Maxim Panteleev, Maxim Finenko
June 2025 | SPAR Spring '25
A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"
TL;DR: We frame saprse autoencoding as a latent variable model (LVM) and inject simple correlated priors to help untangle latent features in a toy model scope. On synthetic benchmarkes as in "Toy Model of Superposition", a global-correlation prior () yields far cleaner feature recovery than the isotropic VAE baseline, validating our variational framework.
This post kicks off a multi-part exploration of feature geometry in large-model embeddings and how to bake that geometry into priors for sparse autoencoders (SAEs). We've been working since February through the SPAR 2025 program with a fantastic group of mentees, bringing together tools from probability, geometry, and mechanistic interpretability.
Our approach rests on two core intuitions:
Together, these ideas form a systematic framework for building and evaluating SAEs with inductive biases matched to the true geometry of model features.
Series Table of Contents
➡️ Part I (you are here): Toy model comparison of isotropic vs global-correlation priors in V-SAE
Part II: Block-diagonal & graph-Laplacian structures in LM embeddings
Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)
Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes
Classic sparse autoencoders (SAEs) learn a dictionary to reconstruct activations under a hard-sparsity constraint
.
These monosemantic features often align with human-interpretable concepts. Yet when true factors are correlated, SAEs fragment (“feature splitting”) or absorb one factor into another, giving polysemantic atoms that obscure mechanistic insight.
We treat the encoder-decoder as a probabilistic model:
where is the latent feature, is a learned decoder matrix and is fixed. The first term resembles the decoder and the seconder resembles the prior.
We introduce a Gaussian encoder
where is a learnable Gaussian encoder, which outputs mean and diagonal log-variance.
Inspired by variational inference methods, we propose Variational SAEs (V-SAEs). Specifically, we derive the training objective from the Evidence Lower Bound (ELBO) on , which we write as the loss:
,
The hyperparameter controls the trade-off econstruction accuracy against the strength of the prior.
In this toy-model study, we compare two Gaussian priors :
These two extremes let us isolate the effect of allowing arbitrary latent correlations (full) versus assuming none (iso).
In summary:
Generative model:
Variational posterior:
Training loss (negative ELBO)
.
III. Toy Model Experiments
We compare two bottleneck models on synthetic superposition benchmarks:
We generate three families of toy datasets, each designed to stress different aspects of correlated superposition:
In each case we sample points , project to with additive Gaussian noise.
We compare exactly two priors :
Basic Cases
Case | SAE MSE | VAE MSE | SAE Sparsity | VAE Sparsity |
---|---|---|---|---|
1 | 0.015 | 0.010 | 8 / 32 | 12 / 32 |
2 | 0.020 | 0.014 | 8 / 32 | 12 / 32 |
Setwise Correlation / Anti-Correlation
Case | SAE MSE | VAE MSE | SAE Sparsity | VAE Sparsity |
---|---|---|---|---|
3 | 0.025 | 0.018 | 10 / 32 | 14 / 32 |
4 | 0.030 | 0.022 | 10 / 32 | 13 / 32 |
5 | 0.028 | 0.020 | 11 / 32 | 14 / 32 |
Full Correlation Matrix (Case 6)
Model | SAE MSE | VAE MSE | SAE Sparsity | VAE Sparsity | KL Term |
---|---|---|---|---|---|
SAE | 0.032 | — | 12 / 32 | — | — |
VAE | — | 0.025 | — | 15 / 32 | 6.4 |
In every scenario, the VAE variant achieves lower MSE—often by 20–30 %—while maintaining comparable or higher sparsity. The learned KL term remains moderate, confirming the covariance prior is active but not over-regularizing.
Below are sample reconstructions for each covariance pattern under both priors. In every plot, the top two row shows ground-truth vectors, both conceptual and sampled from data; the middle row shows vanilla SAE reconstructions; the bottom row shows isotropic Gaussian prior reconstructions.
Independent Features with Varying sparsity
Features with Anti-correlated Pairs
Features with Correlated and Anti-correlated Pairs
Features with Correlation Matrix
These toy model experiments demonstrate that structured variational inference can restore true latent features where classic SAE objectives struggle to. In this minimal setting we have:
These findings motivate our next steps: replacing the fully learned covariance with semantically structured priors (block-diagonal, Laplacian) in Part II, and ultimately integrating these ideas into real‐LM benchmarks. Stay tuned for the coming series!