Thanks for the post! Not sure if I understood (I like it) it correctly about the idea of alignment between the geometry and the learned dictionary: this produces "separation" between features that is closer to the "true" dictionary—maybe it's just a paraphrasing of the same thing, but what did you mean by "semantic continuity"?
by Yuxiao Li, Zachary Baker, Maxim Panteleev, Maxim Finenko
June 2025 | SPAR Spring '25
A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"
TL;DR: We explore the intrinsic block-diagonal geometry of LLM feature space--first observed in raw embeddings and family-tree probes--by measuring cosine-similarity heatmaps. These diagnostics set the stage for baking block-structured and graph-Laplacian priors into V-SAEs and Crosscoders in later posts. Assumptions. tbd.
This is the second post of our series on how realistic feature geometry in language model (LM) embeddings can be discovered and then encoded into sparse autoencoder (SAE) priors. Since February, we have combined probabilistic modeling, geometric analysis, and mechanistic interpretability.
Series Table of Contents
Part I: Toy model comparison of isotropic vs global-correlation priors in V-SAE
➡️ Part II (you are here): Block-diagonal structures in toy LM activations
Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)
Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes
Imagine walking into a library where every book—on wolves, quantum mechanics, and Greek myths—is piled into a single cramped shelf, pages interleaved and impossible to separate. This is the chaos of superposition in neural activations: too many concepts squeezed into too few dimensions. Traditional SAEs deal with the problem by demanding perfect orthogonality—like a librarian who insists each title stand ten feet apart—scattering lions and tigers to opposite ends of the building. In this post, we instead first map out the library’s natural wings—block-diagonal clusters of related concepts in LLM activations—so that when we design our SAE priors, we preserve true semantic neighborhoods while still untangling every hidden concept.
Last week we kicked off this series with a dive into Variational SAEs (V-SAEs) and show: the ability to sculpt feature disentanglement via variational priors. On simple synthetic benchmarks, the vanilla SAE struggled to separate correlated latents, whereas a global-correlation prior dramatically purified fetaures. But one question remained:
What kind of prior should we choose if we want SAEs to capture the rich, categoircal geometry we actually see in language models?
Answering that requires exploring the true layout of the SAE's latent space and the model's raw embedding space. Only once we know how concepts cluster "in the wild" can we bake those patterns back into our priors--sharpening real semantic blocks and eliminating spurious overlap.
A growing body of work has begun to chart the geometry of learned representations--and to use that map to guide feature learning:
Taken together, these works paint a picture: LLM feature spaces are neither randomly nor purely against orthogonally when presenting superposition--they exhibit structured geometry that we can and should exploit.
Before designing SAE priors, we posit three core hypotheses about how relational features arrange themselves in latent space:
Informal Proposition 1 (Categorical Block-Diagonal Structure). Embeddings of relations from the same semantic category cluster together, yielding a block-diagonal pattern in the pairwise similarity matrix. Off-block correlations (across distinct categories) remain near zero.
Informal Proposition 2 (Hierarchical Sub-Clustering). Within each category's block, finer sub-relations form nested sub-clusters: a two-level hierarchy that can be revealed by removing global distractor dimensions.
Informal Proposition 3 (Linear Co-Linearity of Related Roles). Embeddings of semantically related roles lie nearly on a common line--or small subspaces--so that one can predict one embedding as a linear combination of another plus a small residual.
Finally, these observed patterns suggest a guiding principle for our next step:
Informal Proposition 4 (Geometry-Aware Priors Refine Features). If we can measure the latent geometry (blocks, hierarchies, co-linearity), then we can engineer V-SAE priors--via block-structured covariance, tree-based Laplacians, or hyperbolic priors--to align the learned dictionary with that geometry, avoiding artificial orthogonality and preserving true semantic continuity.
We tested Propositions 1-3 on synthetic family-tree embeddings in toy experiments.
Our key insight:
"If we can measure structure in latent space (blocks, hierarchies, co-linearity), we can engineer priors to match that structure--using KL-divergence or graph regularizers to guide SAE learning accordingly."
Concrete next-step priors:
These geometry-aware priors will form the backbone of our Part III experiments, where we deploy V-SAEs and Crosscoders on real LLM activations—enforcing the semantic blocks and hierarchies we have now fully characterized.