Suppose you use mech interp to discover features in a model relating to deception. They're even causal: if you amplify them, the model displays more deceptive behavior; if you suppress them, it displays less. But how do you know you've found all such features? If you suppress them entirely or ablate the associated weights, have you removed the model's ability to be deceptive?
Earlier posts in this sequence presented proofs of concept for a new alignment technique that we call Sparse Concept Anchoring (SCA). This post summarizes our recent ICLR paper[1], in which we refined the technique and demonstrated using it for practical interventions on toy models. You don't need to have read the earlier posts to understand this one. Written with help from Claude Opus 4.7.
Our work is motivated by two observations.
The first is that understanding activations in an already-trained model is hard, because the training process shapes representation space to suit the task rather than to aid interpretation. Networks lose plasticity early in training[2], so the structure you are left analyzing may not even be optimal for the task, let alone for interpretation: concept directions may be entangled (inseparable from other concepts) or fragmented (multiple context-dependent representations of the same thing).
These observations suggest a different approach. Rather than searching for safety-relevant concepts after training, we could reserve a place for them in advance and encourage the model to put them there, while leaving the rest of the representation space alone. Sparse Concept Anchoring does this with a combination of selective regularizers and rare labeled training samples.
How it works
The training objective (loss) in SCA has three terms:
The first term () is the main objective of the model; in our experiments it was the reconstruction loss of an autoencoder. The other two () are inductive biases that act on the latent activations.
Structural constraints apply to every sample and give the latent space a workable shape. We normalize activations onto the unit hypersphere — this makes cosine similarity a meaningful measure of relatedness and is the approach taken by nGPT[3] — and we add a separation term that discourages pairs of activations from clustering together. Neither term targets any particular concept; they are applied indiscriminately, and together they create a space in which concepts can be reliably placed and cleanly removed.
Structural constraints. Left:Normalize places activations on the hypersphere. Right: Separate keeps them from clustering.
Concept regularizers position the concepts we care about, and they apply only to rare labeled samples. Anchor attracts labeled examples of a concept toward a fixed direction, e.g. red toward . A subspace regularizer does the same for cyclic and multidimensional concepts: it attracts labeled examples toward a chosen set of axes without prescribing where they sit within those axes. Everything that is not labeled is free to self-organize.
Attractive regularizers.Anchor aligns a simple linear concept to a direction; subspace collects a multidimensional or cyclic concept onto a set of axes.
Repulsive variantsanti-anchor and anti-subspace push samples away instead. These are used to prepare subspaces for unlearning, which we will return to below. Rather than using labels, repulsive term weights follow a schedule: strong early in training to clear regions while the space is still malleable; and weak later, allowing the attractive terms to dominate.
Repulsive regularizers.Anti-anchor and anti-subspace push all samples away from a reserved direction or subspace.
We tested SCA thoroughly on a simple network: an MLP autoencoder that reconstructs RGB colors through a four- or five-dimensional bottleneck. Red in this toy network plays the role of deception in our motivating example: a concept we want to be able to remove with confidence. The autoencoder strips away everything except the geometry, so the method can be studied on its own terms.
Our autoencoder architecture. The encoder maps RGB inputs through fully-connected layers ⤭ to 4D or 5D activations , normalized Ⓝ to the unit hypersphere as . The decoder reconstructs RGB outputs from ; also drives the regularizer loss terms.
A single anchor is enough for suppression
We anchored red to the first latent axis at . In some runs we also anchored vibrant colors to the subspace of the first two dimensions . That is not needed for any of the red interventions, but it arranges the latent space into a recognizable color wheel, making the figures below easier to read[4]. The supervision budget is small: anchoring red in our experiments required ~83 labeled examples — roughly 0.09% of the 96,064 training samples. Our labels were binary and deliberately noisy to simulate the kind of incomplete and imprecise labeling we may have for an abstract concept.
Anchored baseline.Red sits where we placed it, and the other colors are organized around it. The solid triangle and wavy line show the locations of the anchor and subspace regularizers, respectively.
The first intervention is suppression: at inference time, we project out the component of the latent activations that point along the red direction. This is a small reversible change to the forward pass.
Suppression.Red and warm colors degrade; colors orthogonal to red are left untouched. Top: latent space post-intervention. Middle: color swatches showing reconstructed colors (large squares) and true colors (small squares). Bottom: line chart of reconstruction error vs. hue.
Suppression is highly selective in this model: reconstruction error for red rises from 0 to about 0.28, close to the theoretical value of ¼ that we would expect when causing red to decode as middle-gray. Colors orthogonal to red, such as lime and purple, are unaffected. The degradation falls off smoothly according to how warm the color is: error tracks the square of cosine similarity to red (), which is the relationship the projection predicts. So orange degrades somewhat, red degrades a lot, cyan is unaffected, and the concept hierarchy behaves as one would hope.
Permanent removal needs an isolated dimension
Suppression is reversible, but for open weights models we may want to remove a capability permanently. We tried this on the same model by zeroing the weights that read and write the red dimension.
Ablation with a single anchor. Zeroing the red dimension also damages cyan. The intervention is not selective.
The result is not selective: zeroing the red dimension also damages cyan ( falls to 0.37). The reason is that nothing prevented the model from using that dimension for other things as well. An anchor is enough for suppression in this model, but ablation requires exclusivity, for which we add repulsive regularizers.
In the isolated setup, we keep the red anchor and add anti-anchor and anti-subspace terms that push every sample off the red axis — even the warm samples. Early in training the repulsive terms dominate and clear the dimension; later the attractive terms take over and seat red in the cleared space[5]. No additional labels are needed, since the repulsive terms apply to all samples.
Isolated baseline. The opposing hemisphere is clear and ready for ablation. The hollow triangle and double-wavy line show the locations of the anti-anchor and anti-subspace regularizers, respectively.
With the dimension isolated, both interventions work and both are selective. Suppression still removes red cleanly ().
Suppression in the isolated architecture. Still removes red cleanly.
And now, weight ablation does too. Reconstruction error for red rises to about 0.34, close to the value of ⅓ that we expect when the ablated inputs land on a random on-manifold direction[6], while cyan and the other colors show no measurable degradation ().
Weight ablation in the isolated architecture. Now this works as well.
We think this is the most safety-relevant result in the paper: a concept, chosen in advance, removed permanently from a trained network by zeroing a known block of weights, with no post-hoc search, and with a reconstruction error close to a bound we can derive directly from the geometry.
Shape latent space for the intervention
Our paper therefore describes two architectures. The anchored architecture uses attraction only, and it is sufficient for reversible, inference-time suppression. The isolated architecture adds repulsive terms that reserve the target dimension and so make permanent weight ablation selective. The choice between them depends on whether you need dynamic steering or permanent removal.
Critiques
These are some of the concerns we have heard when explaining SCA to others, with our responses.
Not all concepts can be linearly represented.
That is true, and we think SCA can be used regardless. We have shown that cyclic concepts like hue can be confined to a chosen subspace, and that concept hierarchies arrange as one would expect, so that intervening on red affects orange by an appropriate amount. A subspace constraint can also be overcomplete: we can reserve more dimensions than we think a concept needs and let the model use as many of them as it wants. Even this should be useful, since it would reduce the region a post-hoc method has to search without committing to an exact location. We expect this to extend naturally to more complex domains.
Won't this put pressure on the model to obfuscate its thinking? You're using interpretability for training — the most forbidden technique!
SCA adds no gradient signal against representing any concept, and therefore no pressure to obfuscate. On the contrary, we expect SCA to help the model to learn the anchored concepts, since we hypothesize that anchoring leads to more disentangled and more generalizable representations. The interventions are applied post-hoc; the training-time signal only hints that if the model is going to represent red, it should represent it here.
Frontier model training pipelines are too optimized to introduce new techniques.
The regularizers compose with standard training rather than replacing parts of it, and the schedule can be tuned per-anchor. We expect SCA to drop into existing recipes as an additive modification rather than a disruptive one.
SCA sounds computationally expensive.
The per-step overhead of the regularizers is small, so the question is whether SCA requires longer training runs. In our experiments it did: 96,064 samples is a lot for such a simple model. For larger models, in which the anchored concepts are a much smaller fraction of the whole, we expect the total overhead to be modest.
If regularizer weights vary over time, wouldn't that make hyperparameter search extremely complex?
It does mean that there are more hyperparameters to tune, but we think it is tractable. In our experiments, we needed 3-4 keyframes per regularizer. Across the anchor, anti-anchor, and anti-subspace regularizers in the isolated model, we had 10 (step, weight) pairs in total[5]. We expect that schedules of similar complexity would be required for anchoring concepts in language models.
How is this different from gradient routing?
Indeed, we should have engaged with gradient routing, which we missed in our literature review. It is clearly a neighbor: both methods localize a concept to a known place during training using imperfect labels, and both demonstrate localizing a capability and then ablating it. The difference is largely in mechanism. Gradient routing uses masks to constrain parameter updates; SCA adds geometric regularizers that directly shape activations.
Next steps
We see a few directions.
The most pressing direction is transformers — for SCA to matter at scale, it needs to work there. The open questions are: where in the residual stream to apply the regularizers; whether interventions on anchored concepts resist being bypassed — through residual connections, or via attention from tokens where the concept appears in the input; and how to label requirements scales as concepts become more abstract and the underlying manifold becomes higher-dimensional. We expect that source-level or automated labeling of whole passages, together with SCA's tolerance for label noise, will be enough to begin answering these.
The second is concepts of unknown shape: an overcomplete subspace (reserving more dimensions than a concept is likely to need) should let us anchor a concept without knowing its shape, but this is still speculative.
The third is many anchors at once: each targeted concept adds attraction and repulsion terms which at some point may make for a difficult optimization landscape and may begin to cost task performance. We would like to quantify where that point is for pairs of concepts at different baseline similarity (which you would expect to be more or less naturally orthogonal).
For frontier labs. Sparse Concept Anchoring is one of the few techniques where the side-effects of ablation are bounded analytically from geometry rather than measured empirically after the fact. The setting we've demonstrated this in is controlled, but the principle is general, and we think it's worth piloting on a real model. If you're at a frontier lab, we'd actively like to help. We can advise on regularizer schedules, label requirements, and where the technique may struggle.
For collaborators and funders. We're actively pursuing funding to continue this line of work. If you fund AI safety research with scope for training-time interpretability or capability control, or if you're a researcher who wants to collaborate on extending SCA to transformers, we'd like to hear from you. Reach out here or to Alexander.Fraser@alumni.anu.edu.au.
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations is an ICLR 2026 GRaM Workshop paper. "Long" papers at GRaM are archival, included in the PMLR proceedings. arXiv:2512.12469; OpenReview.
Achille, Rovere & Soatto, Critical Learning Periods in Deep Networks, ICLR 2019. Early training fixes the network's representation structure; after the critical period closes, later training cannot easily undo it.
In the first post in this sub-sequence, we only reported the version with both an anchor and a subspace constraint. We can now report that the single-anchor version works as well. Results are presented in Appendix C.5.1 of our paper.
This is the hyperparameter schedule for the isolated architecture:
Anti-anchor (red line) dominates early in training. Later it subsides, yielding to the attractive anchor term (orange line). The learning rate (bottom) is controlled with the same scheduler. The black circles show the locations of keyframes; all other values are interpolated.
See Appendix B.4 of our paper. We also discuss schedules in the preceding two posts of this sequence.
The expected reconstruction error differs for the two intervention types because suppression pushes activations off-manifold while ablation redirects activations to orthogonal points on-manifold. This happens because in our model suppression affects the post-normalization activations, whereas ablation affects pre-normalization activations. We explain this in more detail in Section 3 of our paper.
Suppose you use mech interp to discover features in a model relating to deception. They're even causal: if you amplify them, the model displays more deceptive behavior; if you suppress them, it displays less. But how do you know you've found all such features? If you suppress them entirely or ablate the associated weights, have you removed the model's ability to be deceptive?
Our work is motivated by two observations.
The first is that understanding activations in an already-trained model is hard, because the training process shapes representation space to suit the task rather than to aid interpretation. Networks lose plasticity early in training[2], so the structure you are left analyzing may not even be optimal for the task, let alone for interpretation: concept directions may be entangled (inseparable from other concepts) or fragmented (multiple context-dependent representations of the same thing).
The second is that only a small fraction of the concepts and behaviors in a large model are directly safety-relevant: sparse autoencoders trained on Claude 3 Sonnet recovered millions of features, of which only a handful relate to safety.
These observations suggest a different approach. Rather than searching for safety-relevant concepts after training, we could reserve a place for them in advance and encourage the model to put them there, while leaving the rest of the representation space alone. Sparse Concept Anchoring does this with a combination of selective regularizers and rare labeled training samples.
How it works
The training objective (loss) in SCA has three terms:
The first term ( ) is the main objective of the model; in our experiments it was the reconstruction loss of an autoencoder. The other two ( ) are inductive biases that act on the latent activations.
Structural constraints apply to every sample and give the latent space a workable shape. We normalize activations onto the unit hypersphere — this makes cosine similarity a meaningful measure of relatedness and is the approach taken by nGPT[3] — and we add a separation term that discourages pairs of activations from clustering together. Neither term targets any particular concept; they are applied indiscriminately, and together they create a space in which concepts can be reliably placed and cleanly removed.
Structural constraints. Left: Normalize places activations on the hypersphere. Right: Separate keeps them from clustering.
Concept regularizers position the concepts we care about, and they apply only to rare labeled samples. Anchor attracts labeled examples of a concept toward a fixed direction, e.g. red toward . A subspace regularizer does the same for cyclic and multidimensional concepts: it attracts labeled examples toward a chosen set of axes without prescribing where they sit within those axes. Everything that is not labeled is free to self-organize.
Attractive regularizers. Anchor aligns a simple linear concept to a direction; subspace collects a multidimensional or cyclic concept onto a set of axes.
Repulsive variants anti-anchor and anti-subspace push samples away instead. These are used to prepare subspaces for unlearning, which we will return to below. Rather than using labels, repulsive term weights follow a schedule: strong early in training to clear regions while the space is still malleable; and weak later, allowing the attractive terms to dominate.
Repulsive regularizers. Anti-anchor and anti-subspace push all samples away from a reserved direction or subspace.
We tested SCA thoroughly on a simple network: an MLP autoencoder that reconstructs RGB colors through a four- or five-dimensional bottleneck. Red in this toy network plays the role of deception in our motivating example: a concept we want to be able to remove with confidence. The autoencoder strips away everything except the geometry, so the method can be studied on its own terms.
Our autoencoder architecture. The encoder maps RGB inputs through fully-connected layers ⤭ to 4D or 5D activations , normalized Ⓝ to the unit hypersphere as . The decoder reconstructs RGB outputs from ; also drives the regularizer loss terms.
A single anchor is enough for suppression
We anchored red to the first latent axis at . In some runs we also anchored vibrant colors to the subspace of the first two dimensions . That is not needed for any of the red interventions, but it arranges the latent space into a recognizable color wheel, making the figures below easier to read[4]. The supervision budget is small: anchoring red in our experiments required ~83 labeled examples — roughly 0.09% of the 96,064 training samples. Our labels were binary and deliberately noisy to simulate the kind of incomplete and imprecise labeling we may have for an abstract concept.
Anchored baseline. Red sits where we placed it, and the other colors are organized around it. The solid triangle and wavy line show the locations of the anchor and subspace regularizers, respectively.
The first intervention is suppression: at inference time, we project out the component of the latent activations that point along the red direction. This is a small reversible change to the forward pass.
Suppression. Red and warm colors degrade; colors orthogonal to red are left untouched. Top: latent space post-intervention. Middle: color swatches showing reconstructed colors (large squares) and true colors (small squares). Bottom: line chart of reconstruction error vs. hue.
Suppression is highly selective in this model: reconstruction error for red rises from 0 to about 0.28, close to the theoretical value of ¼ that we would expect when causing red to decode as middle-gray. Colors orthogonal to red, such as lime and purple, are unaffected. The degradation falls off smoothly according to how warm the color is: error tracks the square of cosine similarity to red ( ), which is the relationship the projection predicts. So orange degrades somewhat, red degrades a lot, cyan is unaffected, and the concept hierarchy behaves as one would hope.
Permanent removal needs an isolated dimension
Suppression is reversible, but for open weights models we may want to remove a capability permanently. We tried this on the same model by zeroing the weights that read and write the red dimension.
Ablation with a single anchor. Zeroing the red dimension also damages cyan. The intervention is not selective.
The result is not selective: zeroing the red dimension also damages cyan ( falls to 0.37). The reason is that nothing prevented the model from using that dimension for other things as well. An anchor is enough for suppression in this model, but ablation requires exclusivity, for which we add repulsive regularizers.
In the isolated setup, we keep the red anchor and add anti-anchor and anti-subspace terms that push every sample off the red axis — even the warm samples. Early in training the repulsive terms dominate and clear the dimension; later the attractive terms take over and seat red in the cleared space[5]. No additional labels are needed, since the repulsive terms apply to all samples.
Isolated baseline. The opposing hemisphere is clear and ready for ablation. The hollow triangle and double-wavy line show the locations of the anti-anchor and anti-subspace regularizers, respectively.
With the dimension isolated, both interventions work and both are selective. Suppression still removes red cleanly ( ).
Suppression in the isolated architecture. Still removes red cleanly.
And now, weight ablation does too. Reconstruction error for red rises to about 0.34, close to the value of ⅓ that we expect when the ablated inputs land on a random on-manifold direction[6], while cyan and the other colors show no measurable degradation ( ).
Weight ablation in the isolated architecture. Now this works as well.
We think this is the most safety-relevant result in the paper: a concept, chosen in advance, removed permanently from a trained network by zeroing a known block of weights, with no post-hoc search, and with a reconstruction error close to a bound we can derive directly from the geometry.
Shape latent space for the intervention
Our paper therefore describes two architectures. The anchored architecture uses attraction only, and it is sufficient for reversible, inference-time suppression. The isolated architecture adds repulsive terms that reserve the target dimension and so make permanent weight ablation selective. The choice between them depends on whether you need dynamic steering or permanent removal.
Critiques
These are some of the concerns we have heard when explaining SCA to others, with our responses.
That is true, and we think SCA can be used regardless. We have shown that cyclic concepts like hue can be confined to a chosen subspace, and that concept hierarchies arrange as one would expect, so that intervening on red affects orange by an appropriate amount. A subspace constraint can also be overcomplete: we can reserve more dimensions than we think a concept needs and let the model use as many of them as it wants. Even this should be useful, since it would reduce the region a post-hoc method has to search without committing to an exact location. We expect this to extend naturally to more complex domains.
SCA adds no gradient signal against representing any concept, and therefore no pressure to obfuscate. On the contrary, we expect SCA to help the model to learn the anchored concepts, since we hypothesize that anchoring leads to more disentangled and more generalizable representations. The interventions are applied post-hoc; the training-time signal only hints that if the model is going to represent red, it should represent it here.
The regularizers compose with standard training rather than replacing parts of it, and the schedule can be tuned per-anchor. We expect SCA to drop into existing recipes as an additive modification rather than a disruptive one.
The per-step overhead of the regularizers is small, so the question is whether SCA requires longer training runs. In our experiments it did: 96,064 samples is a lot for such a simple model. For larger models, in which the anchored concepts are a much smaller fraction of the whole, we expect the total overhead to be modest.
It does mean that there are more hyperparameters to tune, but we think it is tractable. In our experiments, we needed 3-4 keyframes per regularizer. Across the anchor, anti-anchor, and anti-subspace regularizers in the isolated model, we had 10 (step, weight) pairs in total[5]. We expect that schedules of similar complexity would be required for anchoring concepts in language models.
Indeed, we should have engaged with gradient routing, which we missed in our literature review. It is clearly a neighbor: both methods localize a concept to a known place during training using imperfect labels, and both demonstrate localizing a capability and then ablating it. The difference is largely in mechanism. Gradient routing uses masks to constrain parameter updates; SCA adds geometric regularizers that directly shape activations.
Next steps
We see a few directions.
For frontier labs. Sparse Concept Anchoring is one of the few techniques where the side-effects of ablation are bounded analytically from geometry rather than measured empirically after the fact. The setting we've demonstrated this in is controlled, but the principle is general, and we think it's worth piloting on a real model. If you're at a frontier lab, we'd actively like to help. We can advise on regularizer schedules, label requirements, and where the technique may struggle.
For collaborators and funders. We're actively pursuing funding to continue this line of work. If you fund AI safety research with scope for training-time interpretability or capability control, or if you're a researcher who wants to collaborate on extending SCA to transformers, we'd like to hear from you. Reach out here or to Alexander.Fraser@alumni.anu.edu.au.
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations is an ICLR 2026 GRaM Workshop paper. "Long" papers at GRaM are archival, included in the PMLR proceedings. arXiv:2512.12469; OpenReview.
Achille, Rovere & Soatto, Critical Learning Periods in Deep Networks, ICLR 2019. Early training fixes the network's representation structure; after the critical period closes, later training cannot easily undo it.
Loshchilov, Hsieh, Sun & Ginsburg, nGPT: Normalized Transformer with Representation Learning on the Hypersphere, ICLR 2025. The authors show that nGPT trains much faster than transformers without normalization.
In the first post in this sub-sequence, we only reported the version with both an anchor and a subspace constraint. We can now report that the single-anchor version works as well. Results are presented in Appendix C.5.1 of our paper.
This is the hyperparameter schedule for the isolated architecture:
Anti-anchor (red line) dominates early in training. Later it subsides, yielding to the attractive anchor term (orange line). The learning rate (bottom) is controlled with the same scheduler. The black circles show the locations of keyframes; all other values are interpolated.
See Appendix B.4 of our paper. We also discuss schedules in the preceding two posts of this sequence.
The expected reconstruction error differs for the two intervention types because suppression pushes activations off-manifold while ablation redirects activations to orthogonal points on-manifold. This happens because in our model suppression affects the post-normalization activations, whereas ablation affects pre-normalization activations. We explain this in more detail in Section 3 of our paper.