Thanks for the post! Not sure if I understood (I like it) it correctly about the idea of alignment between the geometry and the learned dictionary: this produces "separation" between features that is closer to the "true" dictionary—maybe it's just a paraphrasing of the same thing, but what did you mean by "semantic continuity"?
by Yuxiao Li, Zachary Baker, Maxim Panteleev, Maxim Finenko
June 2025 | SPAR Spring '25
A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"
About this series
This is the second post of our series on how realistic feature geometry in language model (LM) embeddings can be discovered and then encoded into sparse autoencoder (SAE) priors. Since February, we have combined probabilistic modeling, geometric analysis, and mechanistic interpretability.
Series Table of Contents
Part I: Basic Method - Imposing interpretable strctures via Variational Priors (V-SAE)
➡️ Part II (you are here) Basic Structure - Detecting block-diagonal structures in LM activations
Part III: First Implementation Attempt - V-SAEs with graph-based priors for categorical structure
Part IV: Second Implementation Attempt: Crosscoders & Ladder SAEs for hierarchical structure via archictectural priors
Part V: Analysis of Variational Sparse Autoencoders
Part VI: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes
0. Background Story
I. From Priors to Geometry: Why We Need to "Look Under the Hood"
Last week we kicked off this series with a dive into Variational SAEs (V-SAEs) and show: the ability to sculpt feature disentanglement via variational priors. On simple synthetic benchmarks, the vanilla SAE struggled to separate correlated latents, whereas a global-correlation prior dramatically purified fetaures. But one question remained:
Answering that requires exploring the true layout of the SAE's latent space and the model's raw embedding space. Only once we know how concepts cluster "in the wild" can we bake those patterns back into our priors--sharpening real semantic blocks and eliminating spurious overlap.
II. Related Work: Mapping and Manipulating Semantic Geometry
A growing body of work has begun to chart the geometry of learned representations--and to use that map to guide feature learning:
Taken together, these works paint a picture: LLM feature spaces are neither randomly nor purely against orthogonally when presenting superposition--they exhibit structured geometry that we can and should exploit.
III. Propositions on Feature Geometry (Informal)
Before designing SAE priors, we posit three core hypotheses about how relational features arrange themselves in latent space:
Finally, these observed patterns suggest a guiding principle for our next step:
IV. Empirical Structure Discovery
We tested Propositions 1-3 on synthetic family-tree embeddings in toy experiments.
V. From Geometry to Prior Design
Our key insight:
Concrete next-step priors:
Σp=blockdiag(ΣnucFam,ΣinLaw) with intra-block correlation ρ\rhoρ.
Build a small family-tree graph G, compute its Laplacian LG, and add λTr(Z⊤LGZ) to encourage activations z to vary smoothly along known edges (parent–child vs. sibling).
For truly hierarchical relations, a hyperbolic latent prior naturally embeds tree distances and can be plugged into a variational framework (e.g. Poincaré VAE).
These geometry-aware priors will form the backbone of our Part III experiments, where we deploy V-SAEs and Crosscoders on real LLM activations—enforcing the semantic blocks and hierarchies we have now fully characterized.