LESSWRONG
LW

Debate (AI safety technique)Interpretability (ML & AI)Language Models (LLMs)AI
Frontpage

4

Toy Model Validation of Structured Priors in Sparse Autoencoders

by Yuxiao
6th Jul 2025
5 min read
0

4

Debate (AI safety technique)Interpretability (ML & AI)Language Models (LLMs)AI
Frontpage

4

New Comment
Moderation Log
Curated and popular this week
0Comments

by Yuxiao Li, Henry Zheng, Zachary Baker, Eslam Zaher, Maxim Panteleev, Maxim Finenko

June 2025 | SPAR Spring '25

A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"

TL;DR: We frame saprse autoencoding as a latent variable model (LVM) and inject simple correlated priors to help untangle latent features in a toy model scope. On synthetic benchmarkes as in "Toy Model of Superposition", a global-correlation prior (ρ=0.8) yields far cleaner feature recovery than the isotropic VAE baseline, validating our variational framework. 

About this series

This post kicks off a multi-part exploration of feature geometry in large-model embeddings and how to bake that geometry into priors for sparse autoencoders (SAEs). We've been working since February through the SPAR 2025 program with a fantastic group of mentees, bringing together tools from probability, geometry, and mechanistic interpretability.

Our approach rests on two core intuitions:

  1. Variational framing of SAEs. By casting SAEs as latent-variable models, we can replace ad-hoc L1/TopK penalties with ELBO losses under structured priors p(z), giving a clear probabilistic handle on feature disentanglement.
  2. Feature-space geometry. Real model activations exhibit rich geometric structures; we aim to discover and then encode these structures--via block-diagonal and graph-Laplacian covariances--directly into our priors.

Together, these ideas form a systematic framework for building and evaluating SAEs with inductive biases matched to the true geometry of model features.

Series Table of Contents

➡️ Part I (you are here): Toy model comparison of isotropic vs global-correlation priors in V-SAE

Part II: Block-diagonal & graph-Laplacian structures in LM embeddings

Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)

Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes


I. Introduction: Why SAEs & Monosemantic Features?

Classic sparse autoencoders (SAEs) learn a dictionary Wd to reconstruct activations x∈Rd under a hard-sparsity constraint

minWd,f∥x−Wdf(x)∥22 s.t. ∥f(x)∥0≤K.

These monosemantic features often align with human-interpretable concepts. Yet when true factors are correlated, SAEs fragment (“feature splitting”) or absorb one factor into another, giving polysemantic atoms that obscure mechanistic insight.

II. Variational SAE (V-SAE): Model & Assumptions

2.1 Latent-Variable Formulation

We treat the encoder-decoder as a probabilistic model:

p(x,z)=N(x;Wdz,σ2xI)×p(z)

where z is the latent feature, Wd∈Rd×k is a learned decoder matrix and σ2x is fixed. The first term resembles the decoder and the seconder resembles the prior.

We introduce a Gaussian encoder

qϕ(z|x)=N(μϕ(x),diag(σ2ϕ(x)))

where qϕ is a learnable Gaussian encoder, which outputs mean and diagonal log-variance.

2.2 ELBO Objective

Inspired by variational inference methods, we propose Variational SAEs (V-SAEs). Specifically, we derive the training objective from the Evidence Lower Bound (ELBO) on logpθ(x), which we write as the loss:

minWd,ΦEqϕ(z|x)∥x−Wdz∥2+αKL(qϕ(z|x)∣∣∣∣p(z)),

The hyperparameter α controls the trade-off econstruction accuracy against the strength of the prior.

2.3 Structured Priors Design

In this toy-model study, we compare two Gaussian priors p(z)=N(0,Σp):

  1. Isotropic prior: Σp=σ2pIk
  2. Full-covariance prior: Σp is a free, positive-definite matrix learned (via Cholesky parameterization) alongside Wd.

These two extremes let us isolate the effect of allowing arbitrary latent correlations (full) versus assuming none (iso).

In summary:

  • Generative model:

    p(x,z)=N(x;Wdz,σ2xI)×N(z;0,Σp)

  • Variational posterior:

    qϕ(z|x)=N(μϕ(x),diag(σ2ϕ(x)))

  • Training loss (negative ELBO)

    L(x)=Eqϕ[∥x−Wdz∥2]+βKL(qϕ(z|x)∣∣∣∣N(0,Σp)).

III. Toy Model Experiments
We compare two bottleneck models on synthetic superposition benchmarks:

  • SAE (Baseline): Traditional sparse autoencoder with an ℓ1​ penalty on the latent activations f(x).
  • VAE (V-SAE Proposal): A variational autoencoder with a sparsity-promoting prior p(z)=N(0,Σp).

3.1 Case Categories

We generate three families of toy datasets, each designed to stress different aspects of correlated superposition:

  • Basic Cases (Case 1–2):
    • Case 1: Two orthogonal latent directions of unequal variance (variance ratio 1 : 0.2).
    • Case 2: Two latent directions at 45° (equal variance), testing angular disentanglement.
  • Setwise Correlation / Anti-Correlation (Case 3–5):
    • Case 3: A set of ncorr=3 features all correlated at 0.8.
    • Case 4: A set of nanticorr=3 features all anti-correlated at –0.8.
    • Case 5: Mixed correlated and anti-correlated groups.
  • Full Correlation Matrix (Case 6):
    • Draw a random k×k positive-definite covariance matrix with specified off-diagonal structure, then sample latents accordingly.

In each case we sample N=10,000 points z∼N(0,Σtrue), project to x=Wtruez with additive Gaussian noise.

3.2 Training Configuration

  • Encoder/Decoder: 2-layer ReLU MLP (32→64→232 dims for mean and log-variance) → diag-Gaussian posterior → linear decoder
  • Objective: ELBO with α=1.0 (reconstruction MSE + KL term)
  • Optimizer: Adam with learning rate 10−3, batch size 256,200 epochs

We compare exactly two priors p(z)=N(0,Σp):

  • Iso: Σp=σ2pI32 ​
  • Full: Σp​ a free cholesky-parameterized covariance

3.3 Evaluation Metrics

We report the two statistics on a held-out test set:

  1. Reconstruction MSE
    1N∑Nn=1∥xn−^xn∥22.
  2. Latent Sparsity
    The average fraction of non-zero entries in the bottleneck activations f(x) (or posterior mean for the VAE).
  3. KL Term (VAE only)
    The final average KL(q(z∣x)∥p(z)) to verify the prior is being enforced.

3.4 Results: Block-Correlated Setting

Basic Cases

CaseSAE MSEVAE MSESAE SparsityVAE Sparsity
10.0150.0108 / 3212 / 32
20.0200.0148 / 3212 / 32

Setwise Correlation / Anti-Correlation

CaseSAE MSEVAE MSESAE SparsityVAE Sparsity
30.0250.01810 / 3214 / 32
40.0300.02210 / 3213 / 32
50.0280.02011 / 3214 / 32

Full Correlation Matrix (Case 6)

ModelSAE MSEVAE MSESAE SparsityVAE SparsityKL Term
SAE0.032—12 / 32——
VAE—0.025—15 / 326.4

In every scenario, the VAE variant achieves lower MSE—often by 20–30 %—while maintaining comparable or higher sparsity. The learned KL term remains moderate, confirming the covariance prior is active but not over-regularizing.

3.5 Qualitative Reconstructions

Below are sample reconstructions for each covariance pattern under both priors. In every plot, the top two row shows ground-truth vectors, both conceptual and sampled from data; the middle row shows vanilla SAE reconstructions; the bottom row shows isotropic Gaussian prior reconstructions.

Independent Features with Varying sparsity

Features with Anti-correlated Pairs

Features with Correlated and Anti-correlated Pairs

Features with Correlation Matrix

Summary

These toy model experiments demonstrate that structured variational inference can restore true latent features where classic SAE objectives struggle to. In this minimal setting we have:

  • Replicated SAE-like behavior with Σp and α large.
  • Shown that learning Σp (full) outperforms an unstructured prior.
  • Constructed a modular framework into which we can plug richer priors--block-diagonal, graph-Laplacian, energy-based--to further improve disentanglement.

These findings motivate our next steps: replacing the fully learned covariance with semantically structured priors (block-diagonal, Laplacian) in Part II, and ultimately integrating these ideas into real‐LM benchmarks. Stay tuned for the coming series!