Sparse Concept Anchoring

Sandy Fraser

Developing interpretability

6 Sparse Concept Anchoring

by Sandy Fraser

8th May 2025

4 min read

0

6

Recent experiments have provided evidence that "bad" behaviors in language models cluster together^[1]. While surprising at first, it makes sense on reflection: these behaviors likely cluster because it's simply the most efficient way to compress information during training. But intervening on model behavior remains challenging: we don't know all directions in latent space correspond to undesirable traits, and we don't know how entangled they might be with benign concepts.

This post introduces an experimental approach to address this challenge: Sparse Concept Anchoring. Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we propose imposing useful structure on latent spaces during training but without over-constraining representations. We hypothesize that this would yield more interpretable representations and controllable behavior without broad capability degradation.

This post was planned and edited with help from Claude 3.7 Sonnet. The underlying research was conducted with coding assistance from Gemini 2.5 Pro, GPT-4o, and GPT-4.1. Edited for clarity in May 2026.

Entanglement and the difficulty of finding latent embeddings

In purely task-based training regimes (e.g. token prediction), we have little control over how models organize knowledge internally. The optimization process discovers representations that efficiently compress the training data, but these representations are not constrained to be interpretable or modifiable. Further, it has been shown that representations are harder to change later in training, so suboptimal features that a model learns early (e.g. due to the particular mix of data it has seen) may persist^[2].

Three problems stand out:

Detection difficulty: It's hard to tell when a model has learned harmful capabilities because we don't know where to look, and searching takes considerable effort.
Fragmentation: Two concepts that are nearly synonymous related may end up with very different embeddings.
Entanglement: Several distinct concepts may have been learned with representations that are similar.

That is, the model may have learned sub-optimal representations that make interpretability and control very difficult. Features that you find may only tell part of the story, allowing the model to route knowledge and behavior around your interventions. Alternatively, attempts to remove or intervene on harmful capabilities may also impact performance on benign tasks.

Hypothesis: anchored concepts as attractors

We hypothesize that it's possible to impose structure on latent embeddings during training through targeted regularization techniques. Specifically, we propose that:

If we identify key concepts in the training data, and apply regularization to position only those concepts in latent space from the start, then they will act as attractors around which related knowledge naturally organizes.

This would create a more structured and interpretable latent space where similar concepts cluster together in predictable ways.

Potential advantages:

The relevant directions would be known even before training, so you don't need to search for them.
Directions of interest should act as attractors for similar concepts, reducing the chance that unrelated (benign) concepts become entangled with them.

This could enable more surgical interventions, where specific capabilities can be modified or removed without affecting others.

Color as a test domain

To explore this hypothesis, we need a domain with clear ground truth, natural hierarchical structure, and tractable visualization. Color provides a good testbed:

Clear concept boundaries: Primary colors have distinct identities
Natural hierarchical relationships: Secondary colors (like yellow) depend on primaries (red and green), and tertiary colors (orange) depend on primaries and secondaries (red and yellow)
Intuitive visualization: We can directly visualize the latent space, with color representing the actual data

If we can successfully impose structure on color embeddings, it would provide proof-of-concept for applying similar techniques to more complex domains.

A three-panel visualization showing the evolution of color embeddings in latent space, displayed against a black background. The left panel shows six colored dots (red, yellow, green, cyan, blue, and magenta) arranged in a circle, representing primary and secondary colors. The middle panel shows a densely populated circular arrangement of colored dots forming a complete color wheel with a full spectrum of hues. The right panel displays an expanded three-dimensional structure where colors maintain their circular arrangement by hue, but also vary by brightness and saturation, creating a dome-like structure with brighter colors toward the edges and darker shades toward the center. — Visualization from our initial experiments showing the evolution of color embedding in latent space across training phases. Left: Initial embedding of primary colors establishes the basic structure in the plane of the first two dimensions. Middle: Expansion to include all hues creates a complete color wheel. Right: Introduction of variations in brightness and saturation forms a four-dimensional structure while preserving the circular arrangement of hues.

Research plan

This post is the first in a planned series exploring structured latent spaces for alignment. The full sequence will span these milestones:

Discovering structure through constraint: Experiments with bottleneck autoencoders and regularization of specific colors. For positive results, see Selective regularization for alignment-focused representation engineering; for lessons, see Side quests in curriculum learning and regularization.
Practical control and intervention: Using the structure to suppress or forget knowledge. For example, suppress or delete red and orange, but not yellow.
Transfer to transformers: Training a tiny transformer to do color mixing operations with our imposed structured latent spaces.
Language model application: Applying what we have learned to impose structure on the latent representations of a transformer language model.

Each milestone will build on the previous findings, gradually moving from proof-of-concept in simple domains toward more complex architectures relevant to language model alignment.

^{^}
In Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Betley et al. (2025) showed that models finetuned solely to write insecure code subsequently exhibited a range of misaligned behaviors in unrelated contexts, from expressing anti-human views to giving malicious advice.
^{^}
In Critical Learning Periods in Deep Networks, Achille et al. (2023) find that DNNs lose plasticity over the course of training.

Interpretability (ML & AI)AI

Frontpage

6

3 comments22 karma

New Comment

Moderation Log