Recent experiments have provided evidence that "bad" behaviors in language models cluster together[1]. While surprising at first, it makes sense on reflection: these behaviors likely cluster because it's simply the most efficient way to compress information during training. But intervening on model behavior remains challenging: we don't know all directions in latent space correspond to undesirable traits, and we don't know how entangled they might be with benign concepts.
This post introduces an experimental approach to address this challenge: Sparse Concept Anchoring. Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we propose imposing useful structure on latent spaces during training but without over-constraining representations. We hypothesize that this would yield more interpretable representations and controllable behavior without broad capability degradation.
This post was planned and edited with help from Claude 3.7 Sonnet. The underlying research was conducted with coding assistance from Gemini 2.5 Pro, GPT-4o, and GPT-4.1. Edited for clarity in May 2026.
Entanglement and the difficulty of finding latent embeddings
In purely task-based training regimes (e.g. token prediction), we have little control over how models organize knowledge internally. The optimization process discovers representations that efficiently compress the training data, but these representations are not constrained to be interpretable or modifiable. Further, it has been shown that representations are harder to change later in training, so suboptimal features that a model learns early (e.g. due to the particular mix of data it has seen) may persist[2].
Three problems stand out:
Detection difficulty: It's hard to tell when a model has learned harmful capabilities because we don't know where to look, and searching takes considerable effort.
Fragmentation: Two concepts that are nearly synonymous related may end up with very different embeddings.
Entanglement: Several distinct concepts may have been learned with representations that are similar.
That is, the model may have learned sub-optimal representations that make interpretability and control very difficult. Features that you find may only tell part of the story, allowing the model to route knowledge and behavior around your interventions. Alternatively, attempts to remove or intervene on harmful capabilities may also impact performance on benign tasks.
Hypothesis: anchored concepts as attractors
We hypothesize that it's possible to impose structure on latent embeddings during training through targeted regularization techniques. Specifically, we propose that:
If we identify key concepts in the training data, and apply regularization to position only those concepts in latent space from the start, then they will act as attractors around which related knowledge naturally organizes.
This would create a more structured and interpretable latent space where similar concepts cluster together in predictable ways.
Potential advantages:
The relevant directions would be known even before training, so you don't need to search for them.
Directions of interest should act as attractors for similar concepts, reducing the chance that unrelated (benign) concepts become entangled with them.
This could enable more surgical interventions, where specific capabilities can be modified or removed without affecting others.
Color as a test domain
To explore this hypothesis, we need a domain with clear ground truth, natural hierarchical structure, and tractable visualization. Color provides a good testbed:
Clear concept boundaries: Primary colors have distinct identities
Natural hierarchical relationships: Secondary colors (like yellow) depend on primaries (red and green), and tertiary colors (orange) depend on primaries and secondaries (red and yellow)
Intuitive visualization: We can directly visualize the latent space, with color representing the actual data
If we can successfully impose structure on color embeddings, it would provide proof-of-concept for applying similar techniques to more complex domains.
Visualization from our initial experiments showing the evolution of color embedding in latent space across training phases. Left: Initial embedding of primary colors establishes the basic structure in the plane of the first two dimensions. Middle: Expansion to include all hues creates a complete color wheel. Right: Introduction of variations in brightness and saturation forms a four-dimensional structure while preserving the circular arrangement of hues.
Research plan
This post is the first in a planned series exploring structured latent spaces for alignment. The full sequence will span these milestones:
Practical control and intervention: Using the structure to suppress or forget knowledge. For example, suppress or delete red and orange, but not yellow.
Transfer to transformers: Training a tiny transformer to do color mixing operations with our imposed structured latent spaces.
Language model application: Applying what we have learned to impose structure on the latent representations of a transformer language model.
Each milestone will build on the previous findings, gradually moving from proof-of-concept in simple domains toward more complex architectures relevant to language model alignment.
Recent experiments have provided evidence that "bad" behaviors in language models cluster together[1]. While surprising at first, it makes sense on reflection: these behaviors likely cluster because it's simply the most efficient way to compress information during training. But intervening on model behavior remains challenging: we don't know all directions in latent space correspond to undesirable traits, and we don't know how entangled they might be with benign concepts.
This post introduces an experimental approach to address this challenge: Sparse Concept Anchoring. Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we propose imposing useful structure on latent spaces during training but without over-constraining representations. We hypothesize that this would yield more interpretable representations and controllable behavior without broad capability degradation.
Entanglement and the difficulty of finding latent embeddings
In purely task-based training regimes (e.g. token prediction), we have little control over how models organize knowledge internally. The optimization process discovers representations that efficiently compress the training data, but these representations are not constrained to be interpretable or modifiable. Further, it has been shown that representations are harder to change later in training, so suboptimal features that a model learns early (e.g. due to the particular mix of data it has seen) may persist[2].
Three problems stand out:
That is, the model may have learned sub-optimal representations that make interpretability and control very difficult. Features that you find may only tell part of the story, allowing the model to route knowledge and behavior around your interventions. Alternatively, attempts to remove or intervene on harmful capabilities may also impact performance on benign tasks.
Hypothesis: anchored concepts as attractors
We hypothesize that it's possible to impose structure on latent embeddings during training through targeted regularization techniques. Specifically, we propose that:
This would create a more structured and interpretable latent space where similar concepts cluster together in predictable ways.
Potential advantages:
This could enable more surgical interventions, where specific capabilities can be modified or removed without affecting others.
Color as a test domain
To explore this hypothesis, we need a domain with clear ground truth, natural hierarchical structure, and tractable visualization. Color provides a good testbed:
If we can successfully impose structure on color embeddings, it would provide proof-of-concept for applying similar techniques to more complex domains.
Research plan
This post is the first in a planned series exploring structured latent spaces for alignment. The full sequence will span these milestones:
Each milestone will build on the previous findings, gradually moving from proof-of-concept in simple domains toward more complex architectures relevant to language model alignment.
In Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Betley et al. (2025) showed that models finetuned solely to write insecure code subsequently exhibited a range of misaligned behaviors in unrelated contexts, from expressing anti-human views to giving malicious advice.
In Critical Learning Periods in Deep Networks, Achille et al. (2023) find that DNNs lose plasticity over the course of training.