Recent experiments have provided evidence that "bad" behaviors in language models cluster together[1]. While surprising at first, it makes intuitive sense: these behaviors likely cluster because it's simply the most efficient way to compress information during training. However, intervening on model behavior remains a tremendous challenge—partly because we don't know which directions in latent space correspond to undesirable traits, and we don't know how entangled they might be with benign concepts.

This post introduces an experimental approach to address this challenge: concept-anchored representation engineering. Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we explore whether it's possible to impose useful structure on latent spaces during training, creating a more interpretable representation from the start.

In this post, "we" refers to "me and Claude 3.7 Sonnet," who helped draft this content. The underlying research was conducted with coding assistance from Gemini 2.5 Pro, GPT-4o, and GPT-4.1.

The problem: latent space entanglement

In traditional LLM training regimes, we have little control over how models organize knowledge internally. The optimization process discovers representations that efficiently compress the training data, but these representations are not constrained to be interpretable or modifiable.

This creates problems for alignment:

  1. Detection difficulty: It's hard to tell when a model has learned harmful capabilities because we don't know where to look, and searching takes considerable effort.
  2. Entanglement: Attempts to remove or intervene on harmful capabilities harms performance on benign tasks because the underlying representations are entangled.

Epistemic status: 80%, 70% respectively.

Hypothesis: anchored concepts as attractors

We hypothesize that it's possible to impose structure on latent embeddings during training through targeted regularization techniques. Specifically, we propose that[2]:

If we identify key concepts in the training data, and apply regularization to position these concepts in latent space from the start, then these concepts will act as attractors around which related knowledge naturally organizes.

This would create a more structured and interpretable latent space where similar concepts cluster together in predictable ways.

Potential advantages:

  1. Known directions: The relevant directions would be known even before training, so you don't need to search for them.
  2. Reduced entanglement: Directions of interest should act as attractors for similar concepts, reducing the chance that unrelated (benign) concepts become entangled with them.
  3. Precise control: This could enable more surgical interventions, where specific capabilities can be modified or removed without affecting others.

Epistemic status: 80%, 50%, 30% confidence, respectively.

Color as a test domain

To explore this hypothesis, we need a domain with clear ground truth, natural hierarchical structure, and tractable visualization. Color provides an ideal testbed:

  • Clear concept boundaries: Primary colors have distinct identities
  • Natural hierarchical relationships: Secondary colors (like yellow) depend on primaries (red and green), and tertiary colors (orange) depend on primaries and secondaries (red and yellow)
  • Intuitive visualization: We can directly visualize the latent space and compare it to color wheels

If we can successfully impose structure on color embeddings, it could provide proof-of-concept for applying similar techniques to more complex domains like language.

A three-panel visualization showing the evolution of color embeddings in latent space, displayed against a black background. The left panel shows six colored dots (red, yellow, green, cyan, blue, and magenta) arranged in a circle, representing primary and secondary colors. The middle panel shows a densely populated circular arrangement of colored dots forming a complete color wheel with a full spectrum of hues. The right panel displays an expanded three-dimensional structure where colors maintain their circular arrangement by hue, but also vary by brightness and saturation, creating a dome-like structure with brighter colors toward the edges and darker shades toward the center.
Visualization from our initial experiments showing the evolution of color embedding in latent space across training phases. Left: Initial embedding of primary colors establishes the basic structure in the plane of the first two dimensions. Middle: Expansion to include all hues creates a complete color wheel. Right: Introduction of variations in brightness and saturation forms a four-dimensional structure while preserving the circular arrangement of hues.

Research plan

This post is the first in a planned series exploring structured latent spaces for alignment. The full sequence will cover:

  1. Conceptual foundation: Motivation and hypothesis (this post).
  2. Discovering structure through constraint: Experiments with bottleneck autoencoders and regularization of specific colors.
  3. Training dynamics: Comparing smooth vs. stepped hyperparameter transitions, and dataset curricula.
  4. Intervening on latent space: Demonstrating selective concept suppression during inference using pre-determined embeddings. We hope to demonstrate suppression of specific concepts and those nearby without suppression of other concepts, even if they're in superposition. For example, suppress red and orange, but not yellow.
  5. Ablating specific knowledge: Removing capabilities entirely, i.e. deleting the relevant weights rather than only intervening on activations.
  6. Transfer to transformers: Applying these techniques to a simple transformer architecture.

Each post will build on the previous findings, gradually moving from proof-of-concept in simple domains toward more complex architectures relevant to language model alignment.


  1. ^

    A key example comes from Betley et al. (2025), who showed that models finetuned solely to write insecure code subsequently exhibited a range of misaligned behaviors in unrelated contexts, from expressing anti-human views to giving malicious advice. arxiv:2502.17424v1

  2. ^

    Our initial approach has focused heavily on curriculum learning, introducing concepts sequentially to establish structure before adding complexity. As we've gathered more experimental evidence, we now suspect that full-dataset training with per-sample regularization weights may be more effective.

New Comment
Curated and popular this week