Recent experiments have provided evidence that "bad" behaviors in language models cluster together[1]. While surprising at first, it makes intuitive sense: these behaviors likely cluster because it's simply the most efficient way to compress information during training. However, intervening on model behavior remains a tremendous challenge—partly because we don't know which directions in latent space correspond to undesirable traits, and we don't know how entangled they might be with benign concepts.
This post introduces an experimental approach to address this challenge: concept-anchored representation engineering. Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we explore whether it's possible to impose useful structure on latent spaces during training, creating a more interpretable representation from the start.
In this post, "we" refers to "me and Claude 3.7 Sonnet," who helped draft this content. The underlying research was conducted with coding assistance from Gemini 2.5 Pro, GPT-4o, and GPT-4.1.
In traditional LLM training regimes, we have little control over how models organize knowledge internally. The optimization process discovers representations that efficiently compress the training data, but these representations are not constrained to be interpretable or modifiable.
This creates problems for alignment:
Epistemic status: 80%, 70% respectively.
We hypothesize that it's possible to impose structure on latent embeddings during training through targeted regularization techniques. Specifically, we propose that[2]:
If we identify key concepts in the training data, and apply regularization to position these concepts in latent space from the start, then these concepts will act as attractors around which related knowledge naturally organizes.
This would create a more structured and interpretable latent space where similar concepts cluster together in predictable ways.
Potential advantages:
Epistemic status: 80%, 50%, 30% confidence, respectively.
To explore this hypothesis, we need a domain with clear ground truth, natural hierarchical structure, and tractable visualization. Color provides an ideal testbed:
If we can successfully impose structure on color embeddings, it could provide proof-of-concept for applying similar techniques to more complex domains like language.
This post is the first in a planned series exploring structured latent spaces for alignment. The full sequence will span these milestones:
Each milestone will build on the previous findings, gradually moving from proof-of-concept in simple domains toward more complex architectures relevant to language model alignment.
A key example comes from Betley et al. (2025), who showed that models finetuned solely to write insecure code subsequently exhibited a range of misaligned behaviors in unrelated contexts, from expressing anti-human views to giving malicious advice. arxiv:2502.17424v1
Our initial approach has focused heavily on curriculum learning, introducing concepts sequentially to establish structure before adding complexity. As we've gathered more experimental evidence, we now suspect that full-dataset training with per-sample regularization weights may be more effective.