From Unruly Stacks to Organized Shelves: Toy Model Validation of Structured Priors in Sparse Autoencoders

by Yuxiao Li, Henry Zheng, Zachary Baker, Eslam Zaher, Maxim Panteleev, Maxim Finenko

June 2025 | SPAR Spring '25

A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"

TL;DR: We frame saprse autoencoding as a latent variable model (LVM) and inject simple correlated priors to help untangle latent features in a toy model scope. On synthetic benchmarkes as in "Toy Model of Superposition", a global-correlation prior () yields far cleaner feature recovery than the isotropic VAE baseline, validating our variational framework.

About this series

This post kicks off a multi-part exploration of feature geometry in large-model embeddings and how to bake that geometry into priors for sparse autoencoders (SAEs). We've been working since February through the SPAR 2025 program with a fantastic group of mentees, bringing together tools from probability, geometry, and mechanistic interpretability.

Our approach rests on two core intuitions:

Variational framing of SAEs. By casting SAEs as latent-variable models, we can replace ad-hoc $L_{1}$ /TopK penalties with ELBO losses under structured priors $p (z)$ , giving a clear probabilistic handle on feature disentanglement.
Feature-space geometry. Real model activations exhibit rich geometric structures; we aim to discover and then encode these structures--via block-diagonal and graph-Laplacian covariances--directly into our priors.

Together, these ideas form a systematic framework for building and evaluating SAEs with inductive biases matched to the true geometry of model features.

Series Table of Contents

➡️ Part I (you are here): Basic Method - Imposing interpretable strctures via Variational Priors (V-SAE)

Part II: Basic Structure - Detecting block-diagonal structures in LM activations

Part III: First Implementation Attempt - V-SAEs with graph-based priors for categorical structure

Part IV: Second Implementation Attempt: Crosscoders & Ladder SAEs for hierarchical structure via archictectural priors

Part V: Analysis of Variational Sparse Autoencoders

Part VI: Revisiting the Linear Representation Hypothesis (LRH) via geome

0. Background Story

Imagine stepping into an enormous library where every book--on wolves, quantum mechanics, or Greek mythology--has been crammed onto a few tiny shelves. You know the books are there, but they're superimposed in a chaotic mass, impossible to disentangle. This is exactly the challenge of polysemanticity and superposition in neural activations: too many "features" packed into too few dimensions. Mechanistic interpretability is our librarian's toolkit for reverse-engineering that mess--cracking open the model, identifying its hidden "gears", and figuring out how each concept is represented. In this first toy experiment, we cast sparse autoencdoers (SAEs) as a latent-variable cataloguer, injecting a simple global-correlation prior so that related books instinctively group onto the same shelf. The result? A beautifully organized, block-structured dictionary of features--proof that with the right probabilistic prior, even the messiest shelf can become a masterfully curated collection.

I. Introduction: Why SAEs & Monosemantic Features?

A growing body of work has begun to chart the geometry of learned representations--and to use that map to guide feature learning:

Toy Models of Superposition (2022) introduced simple synthetic benchmarks illustrating the superposition phenomenon where more features (K) than dimensions (N) force "stacking" of concepts and lead to polysemantic directions. Established the need for specialized dictionary-learning methods.
Towards Monosemanticity: Decomposing LMs with Dictionary Learning (2023) first applied SAEs to LLM residual-stream activations, demonstrating that a learned overcomplete basis can uncover monosemantic feature vectors far more interpretable than individual neurons. Identified feature splitting and entanglement failure modes.

Classic SAEs learn a dictionary $W_{d}$ to reconstruct activations $x \in R^{d}$ under a hard-sparsity constraint

${min}_{W_{d, f}} ∥ x - W_{d} f (x) ∥_{2}^{2} s . t . ∥ f (x) ∥_{0} \leq K$ .

These monosemantic features often align with human-interpretable concepts. Yet when true factors are correlated, SAEs fragment (“feature splitting”) or absorb one factor into another, giving polysemantic atoms that obscure mechanistic insight.

II. Variational SAE (V-SAE): Model & Assumptions

2.1 Latent-Variable Formulation

We treat the encoder-decoder as a probabilistic model:

$p (x, z) = N (x; W_{d} z, σ_{x}^{2} I) \times p (z)$

where $z$ is the latent feature, $W_{d} \in R^{d \times k}$ is a learned decoder matrix and $σ_{x}^{2}$ is fixed. The first term resembles the decoder and the seconder resembles the prior.

We introduce a Gaussian encoder

$q_{ϕ} (z | x) = N (μ_{ϕ} (x), d i a g (σ_{ϕ}^{2} (x)))$

where $q_{ϕ}$ is a learnable Gaussian encoder, which outputs mean and diagonal log-variance.

2.2 ELBO Objective

Inspired by variational inference methods, we propose Variational SAEs (V-SAEs). Specifically, we derive the training objective from the Evidence Lower Bound (ELBO) on $log p_{θ} (x)$ , which we write as the loss:

${min}_{W_{d, Φ}} E_{q_{ϕ} (z | x)} ∥ x - W_{d} z ∥^{2} + α K L (q_{ϕ} (z | x) ∣ ∣ ∣ ∣ p (z))$ ,

The hyperparameter $α$ controls the trade-off econstruction accuracy against the strength of the prior.

2.3 Structured Priors Design

In this toy-model study, we compare two Gaussian priors $p (z) = N (0, Σ_{p})$ :

Isotropic prior: $Σ_{p} = σ_{p}^{2} I_{k}$
Full-covariance prior: $Σ_{p}$ is a free, positive-definite matrix learned (via Cholesky parameterization) alongside $W_{d}$ .

These two extremes let us isolate the effect of allowing arbitrary latent correlations (full) versus assuming none (iso).

In summary:

Generative model:
$p (x, z) = N (x; W_{d} z, σ_{x}^{2} I) \times N (z; 0, Σ_{p})$
Variational posterior:
$q_{ϕ} (z | x) = N (μ_{ϕ} (x), d i a g (σ_{ϕ}^{2} (x)))$
Training loss (negative ELBO)
$L (x) = E_{q_{ϕ}} [∥ x - W_{d} z ∥^{2}] + β K L (q_{ϕ} (z | x) ∣ ∣ ∣ ∣ N (0, Σ_{p}))$ .

III. Toy Model Experiments
We compare two bottleneck models on synthetic superposition benchmarks:

SAE (Baseline): Traditional sparse autoencoder with an ℓ1 penalty on the latent activations $f (x)$ .
VAE (V-SAE Proposal): A variational autoencoder with a sparsity-promoting prior $p (z) = N (0, Σ_{p})$ .

3.1 Case Categories

We generate three families of toy datasets, each designed to stress different aspects of correlated superposition:

Basic Cases (Case 1–2):
- Case 1: Two orthogonal latent directions of unequal variance (variance ratio 1 : 0.2).
- Case 2: Two latent directions at 45° (equal variance), testing angular disentanglement.
Setwise Correlation / Anti-Correlation (Case 3–5):
- Case 3: A set of $n_{c o r r} = 3$ features all correlated at 0.8.
- Case 4: A set of $n_{a n t i c o r r} = 3$ features all anti-correlated at –0.8.
- Case 5: Mixed correlated and anti-correlated groups.
Full Correlation Matrix (Case 6):
- Draw a random $k \times k$ positive-definite covariance matrix with specified off-diagonal structure, then sample latents accordingly.

In each case we sample $N = 10, 000$ points $z \sim N (0, Σ_{t r u e})$ , project to $x = W_{t r u e} z$ with additive Gaussian noise.

3.2 Training Configuration

Encoder/Decoder: 2-layer ReLU MLP ( $32 \to 64 \to 232$ dims for mean and log-variance) $\to$ diag-Gaussian posterior $\to$ linear decoder
Objective: ELBO with $α = 1.0$ (reconstruction MSE + KL term)
Optimizer: Adam with learning rate $10^{- 3}$ , batch size $256, 200$ epochs

We compare exactly two priors $p (z) = N (0, Σ_{p})$ :

Iso: $Σ_{p} = σ_{p}^{2} I_{32}$
Full: $Σ_{p}$ a free cholesky-parameterized covariance

3.3 Evaluation Metrics

We report the two statistics on a held-out test set:

Reconstruction MSE
$\frac{1}{N} \sum_{n = 1}^{N} ∥ x_{n} - {^x}_{n} ∥_{2}^{2}$ .
Latent Sparsity
The average fraction of non-zero entries in the bottleneck activations $f (x)$ (or posterior mean for the VAE).
KL Term (VAE only)
The final average $K L (q (z ∣ x) ∥ p (z))$ to verify the prior is being enforced.

3.4 Results: Block-Correlated Setting

Basic Cases

Case	SAE MSE	VAE MSE	SAE Sparsity	VAE Sparsity
1	0.015	0.010	8 / 32	12 / 32
2	0.020	0.014	8 / 32	12 / 32

Setwise Correlation / Anti-Correlation

Case	SAE MSE	VAE MSE	SAE Sparsity	VAE Sparsity
3	0.025	0.018	10 / 32	14 / 32
4	0.030	0.022	10 / 32	13 / 32
5	0.028	0.020	11 / 32	14 / 32

Full Correlation Matrix (Case 6)

Model	SAE MSE	VAE MSE	SAE Sparsity	VAE Sparsity	KL Term
SAE	0.032	—	12 / 32	—	—
VAE	—	0.025	—	15 / 32	6.4

In every scenario, the VAE variant achieves lower MSE—often by 20–30 %—while maintaining comparable or higher sparsity. The learned KL term remains moderate, confirming the covariance prior is active but not over-regularizing.

3.5 Qualitative Reconstructions

Below are sample reconstructions for each covariance pattern under both priors. In every plot, the top two row shows ground-truth vectors, both conceptual and sampled from data; the middle row shows vanilla SAE reconstructions; the bottom row shows isotropic Gaussian prior reconstructions.

Independent Features with Varying sparsity

Features with Anti-correlated Pairs

Features with Correlated and Anti-correlated Pairs

Features with Correlation Matrix

Summary

These toy model experiments demonstrate that structured variational inference can restore true latent features where classic SAE objectives struggle to. In this minimal setting we have:

Replicated SAE-like behavior with $Σ_{p}$ and $α$ large.
Shown that learning $Σ_{p}$ (full) outperforms an unstructured prior.
Constructed a modular framework into which we can plug richer priors--block-diagonal, graph-Laplacian, energy-based--to further improve disentanglement.

These findings motivate our next steps: replacing the fully learned covariance with semantically structured priors (block-diagonal, Laplacian) in Part II, and ultimately integrating these ideas into real‐LM benchmarks. Stay tuned for the coming series!

LESSWRONG
LW