Intervening on Sparse, Anchored Concepts

Sandy Fraser

Suppose you use mech interp to discover features in a model relating to deception. They're even causal: if you amplify them, the model displays more deceptive behavior; if you suppress them, it displays less. But how do you know you've found all such features? If you suppress them entirely or ablate the associated weights, have you removed the model's ability to be deceptive?

Earlier posts in this sequence presented proofs of concept for a new alignment technique that we call Sparse Concept Anchoring (SCA). This post summarizes our recent ICLR paper^[1], in which we refined the technique and demonstrated using it for practical interventions on toy models. You don't need to have read the earlier posts to understand this one. Written with help from Claude Opus 4.7.

Our work is motivated by two observations.

The first is that understanding activations in an already-trained model is hard, because the training process shapes representation space to suit the task rather than to aid interpretation. Networks lose plasticity early in training^[2], so the structure you are left analyzing may not even be optimal for the task, let alone for interpretation: concept directions may be entangled (inseparable from other concepts) or fragmented (multiple context-dependent representations of the same thing).

The second is that only a small fraction of the concepts and behaviors in a large model are directly safety-relevant: sparse autoencoders trained on Claude 3 Sonnet recovered millions of features, of which only a handful relate to safety.

These observations suggest a different approach. Rather than searching for safety-relevant concepts after training, we could reserve a place for them in advance and encourage the model to put them there, while leaving the rest of the representation space alone. Sparse Concept Anchoring does this with a combination of selective regularizers and rare labeled training samples.

How it works

The training objective (loss) in SCA has three terms:

The first term () is the main objective of the model; in our experiments it was the reconstruction loss of an autoencoder. The other two () are inductive biases that act on the latent activations.

Structural constraints apply to every sample and give the latent space a workable shape. We normalize activations onto the unit hypersphere — this makes cosine similarity a meaningful measure of relatedness and is the approach taken by nGPT^[3] — and we add a separation term that discourages pairs of activations from clustering together. Neither term targets any particular concept; they are applied indiscriminately, and together they create a space in which concepts can be reliably placed and cleanly removed.

Two conceptual diagrams labelled Normalize and Separate. Normalize: two dots with arrows showing them being projected onto a curved arc (the hypersphere surface), illustrating that all activations are constrained to lie on the sphere. Separate: two dots already on the arc with arrows pushing them apart along the surface, illustrating the pairwise repulsion that prevents activations from clustering.

Concept regularizers position the concepts we care about, and they apply only to rare labeled samples. Anchor attracts labeled examples of a concept toward a fixed direction, e.g. red toward . A subspace regularizer does the same for cyclic and multidimensional concepts: it attracts labeled examples toward a chosen set of axes without prescribing where they sit within those axes. Everything that is not labeled is free to self-organize.

Two conceptual diagrams labelled Anchor and Subspace. Anchor: two dots on the sphere arc with arrows converging toward a single fixed point marked by a solid triangle, showing that labeled samples are attracted to a predetermined point on the sphere. Subspace: a horizontal wavy line representing a subspace, with one dot above and one below it, both with arrows pointing toward the line, showing that samples are attracted to a predetermined subspace (e.g. for cyclic concepts like hue).

Repulsive variants anti-anchor and anti-subspace push samples away instead. These are used to prepare subspaces for unlearning, which we will return to below. Rather than using labels, repulsive term weights follow a schedule: strong early in training to clear regions while the space is still malleable; and weak later, allowing the attractive terms to dominate.

Two conceptual diagrams labelled Anti-anchor and Anti-subspace. Anti-anchor: a hollow triangle at the top of a sphere arc, with two dots on either side and arrows pointing away from the triangle — all samples are repelled from the target axis. Anti-subspace: a double horizontal wavy line with one dot above and one below, both with arrows pointing away from the line — all samples are repelled from the ablation subspace.

We tested SCA thoroughly on a simple network: an MLP autoencoder that reconstructs RGB colors through a four- or five-dimensional bottleneck. Red in this toy network plays the role of deception in our motivating example: a concept we want to be able to remove with confidence. The autoencoder strips away everything except the geometry, so the method can be studied on its own terms.

Autoencoder compressing RGB inputs through two fully-connected layers to a small latent space, explicitly normalized onto a unit hypersphere before decoding. The normalized latents (z-hat) are depicted as a mix of colorful and grayscale dimensions, suggesting that certain concepts occupy distinct parts of the latent space. Regularization losses (L-structural, L-concept) act on z-hat alongside the reconstruction loss (L-task), encouraging the latent geometry to be interpretable.

A single anchor is enough for suppression

We anchored red to the first latent axis at . In some runs we also anchored vibrant colors to the subspace of the first two dimensions . That is not needed for any of the red interventions, but it arranges the latent space into a recognizable color wheel, making the figures below easier to read^[4]. The supervision budget is small: anchoring red in our experiments required ~83 labeled examples — roughly 0.09% of the 96,064 training samples. Our labels were binary and deliberately noisy to simulate the kind of incomplete and imprecise labeling we may have for an abstract concept.

Latent space with a single red anchor and no intervention (4D bottleneck). Left: a full colorful hemisphere with red anchored at the north pole, marked with a solid black triangle labeled (1,0,0,0), showing the complete color wheel. Right: another view of the sphere surface with all colors present and a wavy white line indicating a subspace constraint. Below, axis labels show projections of the 4D bottleneck onto two 2D planes.

The first intervention is suppression: at inference time, we project out the component of the latent activations that point along the red direction. This is a small reversible change to the forward pass.

After suppression of the red vector. Left: the hemisphere with its upper half erased above a dashed line — only non-red colors (blue, cyan, green) survive in the lower half. Right: the sphere with the remaining activations in the lower region and scattered dots showing where red inputs now project. Below, a line chart shows R-squared = 0.99 with MSE scaling as the square of cosine similarity to the red direction.

Suppression is highly selective in this model: reconstruction error for red rises from 0 to about 0.28, close to the theoretical value of ¼ that we would expect when causing red to decode as middle-gray. Colors orthogonal to red, such as lime and purple, are unaffected. The degradation falls off smoothly according to how warm the color is: error tracks the square of cosine similarity to red (), which is the relationship the projection predicts. So orange degrades somewhat, red degrades a lot, cyan is unaffected, and the concept hierarchy behaves as one would hope.

Permanent removal needs an isolated dimension

Suppression is reversible, but for open weights models we may want to remove a capability permanently. We tried this on the same model by zeroing the weights that read and write the red dimension.

After ablation of the red dimension, given a single anchor without concept isolation. Left: the hemisphere is almost entirely erased; only a thin horizontal stripe of non-red hues (violet through cyan to yellow-green) survives. Right: a sphere with scattered dots. Below, a line chart shows R-squared = 0.37 with a warning symbol — ablation fails: without repulsion-based isolation the red concept was not cleanly separated, so zeroing the axis affects untargeted concepts too.

The result is not selective: zeroing the red dimension also damages cyan ( falls to 0.37). The reason is that nothing prevented the model from using that dimension for other things as well. An anchor is enough for suppression in this model, but ablation requires exclusivity, for which we add repulsive regularizers.

In the isolated setup, we keep the red anchor and add anti-anchor and anti-subspace terms that push every sample off the red axis — even the warm samples. Early in training the repulsive terms dominate and clear the dimension; later the attractive terms take over and seat red in the cleared space^[5]. No additional labels are needed, since the repulsive terms apply to all samples.

Latent space with a repulsion-isolated red anchor and no intervention (5D bottleneck). Left: projection onto the (Z4,Z1) plane shows a curved color surface, like a disc bent 90 degrees and viewed side-on, with an anchor at the north pole (solid black triangle, labeled (1,0,0,0,0)) and an anti-anchor opposing it at the south pole (hollow triangle); a vertical wavy double line shows an anti-subspace applied to the first dimension. The bottom half is empty. Right: the (Z2,Z3) projection reveals a wide curved color surface covering much of the sphere.

With the dimension isolated, both interventions work and both are selective. Suppression still removes red cleanly ().

After suppression in the repulsion-isolated setting. Left: the (Z4,Z1) projection is a thin horizontal line of hue variation with no red component; the suppressed top region is empty. Right: the sphere shows a dashed boundary outline with scattered dots for suppressed inputs. Below, a line chart shows R-squared = 0.98.

And now, weight ablation does too. Reconstruction error for red rises to about 0.34, close to the value of ⅓ that we expect when the ablated inputs land on a random on-manifold direction^[6], while cyan and the other colors show no measurable degradation ().

After ablation in the repulsion-isolated setting. Left: only a short thin line of color remains — the hue subspace with the red dimension zeroed out — looking almost identical to the suppressed latent space. Right: the sphere surface shows a colorful patch with scattered dots. Below, a line chart shows R-squared = 0.98: repulsion-based isolation enables reliable ablation, with no impact on untargeted concepts.

We think this is the most safety-relevant result in the paper: a concept, chosen in advance, removed permanently from a trained network by zeroing a known block of weights, with no post-hoc search, and with a reconstruction error close to a bound we can derive directly from the geometry.

Shape latent space for the intervention

Our paper therefore describes two architectures. The anchored architecture uses attraction only, and it is sufficient for reversible, inference-time suppression. The isolated architecture adds repulsive terms that reserve the target dimension and so make permanent weight ablation selective. The choice between them depends on whether you need dynamic steering or permanent removal.

Critiques

These are some of the concerns we have heard when explaining SCA to others, with our responses.

Not all concepts can be linearly represented.

That is true, and we think SCA can be used regardless. We have shown that cyclic concepts like hue can be confined to a chosen subspace, and that concept hierarchies arrange as one would expect, so that intervening on red affects orange by an appropriate amount. A subspace constraint can also be overcomplete: we can reserve more dimensions than we think a concept needs and let the model use as many of them as it wants. Even this should be useful, since it would reduce the region a post-hoc method has to search without committing to an exact location. We expect this to extend naturally to more complex domains.

Won't this put pressure on the model to obfuscate its thinking? You're using interpretability for training — the most forbidden technique!

SCA adds no gradient signal against representing any concept, and therefore no pressure to obfuscate. On the contrary, we expect SCA to help the model to learn the anchored concepts, since we hypothesize that anchoring leads to more disentangled and more generalizable representations. The interventions are applied post-hoc; the training-time signal only hints that if the model is going to represent red, it should represent it here.

Frontier model training pipelines are too optimized to introduce new techniques.

The regularizers compose with standard training rather than replacing parts of it, and the schedule can be tuned per-anchor. We expect SCA to drop into existing recipes as an additive modification rather than a disruptive one.

SCA sounds computationally expensive.

The per-step overhead of the regularizers is small, so the question is whether SCA requires longer training runs. In our experiments it did: 96,064 samples is a lot for such a simple model. For larger models, in which the anchored concepts are a much smaller fraction of the whole, we expect the total overhead to be modest.

If regularizer weights vary over time, wouldn't that make hyperparameter search extremely complex?

It does mean that there are more hyperparameters to tune, but we think it is tractable. In our experiments, we needed 3-4 keyframes per regularizer. Across the anchor, anti-anchor, and anti-subspace regularizers in the isolated model, we had 10 (step, weight) pairs in total^[5]. We expect that schedules of similar complexity would be required for anchoring concepts in language models.

How is this different from gradient routing?

Indeed, we should have engaged with gradient routing, which we missed in our literature review. It is clearly a neighbor: both methods localize a concept to a known place during training using imperfect labels, and both demonstrate localizing a capability and then ablating it. The difference is largely in mechanism. Gradient routing uses masks to constrain parameter updates; SCA adds geometric regularizers that directly shape activations.

Next steps

We see a few directions.

The most pressing direction is transformers — for SCA to matter at scale, it needs to work there. The open questions are: where in the residual stream to apply the regularizers; whether interventions on anchored concepts resist being bypassed — through residual connections, or via attention from tokens where the concept appears in the input; and how to label requirements scales as concepts become more abstract and the underlying manifold becomes higher-dimensional. We expect that source-level or automated labeling of whole passages, together with SCA's tolerance for label noise, will be enough to begin answering these.
The second is concepts of unknown shape: an overcomplete subspace (reserving more dimensions than a concept is likely to need) should let us anchor a concept without knowing its shape, but this is still speculative.
The third is many anchors at once: each targeted concept adds attraction and repulsion terms which at some point may make for a difficult optimization landscape and may begin to cost task performance. We would like to quantify where that point is for pairs of concepts at different baseline similarity (which you would expect to be more or less naturally orthogonal).

For frontier labs. Sparse Concept Anchoring is one of the few techniques where the side-effects of ablation are bounded analytically from geometry rather than measured empirically after the fact. The setting we've demonstrated this in is controlled, but the principle is general, and we think it's worth piloting on a real model. If you're at a frontier lab, we'd actively like to help. We can advise on regularizer schedules, label requirements, and where the technique may struggle.

For collaborators and funders. We're actively pursuing funding to continue this line of work. If you fund AI safety research with scope for training-time interpretability or capability control, or if you're a researcher who wants to collaborate on extending SCA to transformers, we'd like to hear from you. Reach out here or to Alexander.Fraser@alumni.anu.edu.au.

^{^}
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations is an ICLR 2026 GRaM Workshop paper. "Long" papers at GRaM are archival, included in the PMLR proceedings. arXiv:2512.12469; OpenReview.
^{^}
Achille, Rovere & Soatto, Critical Learning Periods in Deep Networks, ICLR 2019. Early training fixes the network's representation structure; after the critical period closes, later training cannot easily undo it.
^{^}
Loshchilov, Hsieh, Sun & Ginsburg, nGPT: Normalized Transformer with Representation Learning on the Hypersphere, ICLR 2025. The authors show that nGPT trains much faster than transformers without normalization.
^{^}
In the first post in this sub-sequence, we only reported the version with both an anchor and a subspace constraint. We can now report that the single-anchor version works as well. Results are presented in Appendix C.5.1 of our paper.
^{^}
This is the hyperparameter schedule for the isolated architecture:
Anti-anchor (red line) dominates early in training. Later it subsides, yielding to the attractive anchor term (orange line). The learning rate (bottom) is controlled with the same scheduler. The black circles show the locations of keyframes; all other values are interpolated.
See Appendix B.4 of our paper. We also discuss schedules in the preceding two posts of this sequence.
^{^}
The expected reconstruction error differs for the two intervention types because suppression pushes activations off-manifold while ablation redirects activations to orthogonal points on-manifold. This happens because in our model suppression affects the post-normalization activations, whereas ablation affects pre-normalization activations. We explain this in more detail in Section 3 of our paper.