Very cool work. For certain forms of unlearning, e.g. dangerous Chemical/Biological/Radiological/Nuclear knowledge, this seems like a great idea.
However, for something like deception, which you quoted as a motivating example, there's a challenge. We don't want a model that has no idea what deception is:
a) it would be incredibly easy to jailbreak: just tell it you're trustworthy and have a legitimate need to the information
b) it would have a lot of trouble understanding human behavior, or writing stories, or doing computer security, etc. etc.
c) it would generally have a big, glaring hole in its world model of people
d) if and when it did figure out that humans were often deceptive, its reactions might be hard to predict
What we need is a model that understands deception, but doesn't use it on us. Ideally, one that provably doesn't use it, because we are sure we could detect and measure if it did.
Knowing exactly where the model's representation of "deception" is might well still be a great start. But I suspect we wouldn't just abliterate it, but would instead investigate, understand, and monitor it.
I've edited the article to use "The Golden Gate Bridge" instead of "deception" as the motivating example. It's less motivating for sure, but I think it makes the technique easier to discuss. But you've really got me thinking about the complexities of capability removal.
True, concrete knowledge would have been a better motivating example. Intervening usefully on deception is a moonshot.
I wonder if it would be possible to intervene on a concept only for continuation tokens, while leaving prompt tokens uninhibited? I don't know what might carry over via attention, but I'd like to try it.
Suppose you use mech interp to discover features in a model relating to a particular concept — say, the Golden Gate Bridge. They're even causal: amplify them and the model talks about the bridge more often, even speaking as if it were the bridge; suppress them and references to it fade. But how do you know you've found all such features? If you suppress them entirely or ablate the associated weights, have you removed the model's ability to represent the bridge at all?
Our work is motivated by two observations.
The first is that understanding activations in an already-trained model is hard, because the training process shapes representation space to suit the task rather than to aid interpretation. Networks lose plasticity early in training[2], so the structure you are left analyzing may not even be optimal for the task, let alone for interpretation: concept directions may be entangled (inseparable from other concepts) or fragmented (multiple context-dependent representations of the same thing).
The second is that only a small fraction of the concepts and behaviors in a large model are directly safety-relevant: sparse autoencoders trained on Claude 3 Sonnet recovered millions of features, of which only a handful relate to safety.
These observations suggest a different approach. Rather than searching for safety-relevant concepts after training, we could reserve a place for them in advance and encourage the model to put them there, while leaving the rest of the representation space alone. Sparse Concept Anchoring does this with a combination of selective regularizers and rare labeled training samples.
How it works
The training objective (loss) in SCA has three terms:
The first term ( ) is the main objective of the model; in our experiments it was the reconstruction loss of an autoencoder. The other two ( ) are inductive biases that act on the latent activations.
Structural constraints apply to every sample and give the latent space a workable shape. We normalize activations onto the unit hypersphere — this makes cosine similarity a meaningful measure of relatedness and is the approach taken by nGPT[3] — and we add a separation term that discourages pairs of activations from clustering together. Neither term targets any particular concept; they are applied indiscriminately, and together they create a space in which concepts can be reliably placed and cleanly removed.
Structural constraints. Left: Normalize places activations on the hypersphere. Right: Separate keeps them from clustering.
Concept regularizers position the concepts we care about, and they apply only to rare labeled samples. Anchor attracts labeled examples of a concept toward a fixed direction, e.g. red toward . A subspace regularizer does the same for cyclic and multidimensional concepts: it attracts labeled examples toward a chosen set of axes without prescribing where they sit within those axes. Everything that is not labeled is free to self-organize.
Attractive regularizers. Anchor aligns a simple linear concept to a direction; subspace collects a multidimensional or cyclic concept onto a set of axes.
Repulsive variants anti-anchor and anti-subspace push samples away instead. These are used to prepare subspaces for unlearning, which we will return to below. Rather than using labels, repulsive term weights follow a schedule: strong early in training to clear regions while the space is still malleable; and weak later, allowing the attractive terms to dominate.
Repulsive regularizers. Anti-anchor and anti-subspace push all samples away from a reserved direction or subspace.
We tested SCA thoroughly on a simple network: an MLP autoencoder that reconstructs RGB colors through a four- or five-dimensional bottleneck. Red in this toy network plays the role of The Golden Gate Bridge in our motivating example: a concept we want to be able to intervene on with confidence. The autoencoder strips away everything except the geometry, so the method can be studied on its own terms.
Our autoencoder architecture. The encoder maps RGB inputs through fully-connected layers ⤭ to 4D or 5D activations , normalized Ⓝ to the unit hypersphere as . The decoder reconstructs RGB outputs from ; also drives the regularizer loss terms.
A single anchor is enough for suppression
We anchored red to the first latent axis at . In some runs we also anchored vibrant colors to the subspace of the first two dimensions . That is not needed for any of the red interventions, but it arranges the latent space into a recognizable color wheel, making the figures below easier to read[4]. The supervision budget is small: anchoring red in our experiments required ~83 labeled examples — roughly 0.09% of the 96,064 training samples. Our labels were binary and deliberately noisy to simulate the kind of incomplete and imprecise labeling we may have for an abstract concept.
Anchored baseline. Red sits where we placed it, and the other colors are organized around it. The solid triangle and wavy line show the locations of the anchor and subspace regularizers, respectively.
The first intervention is suppression: at inference time, we project out the component of the latent activations that point along the red direction. This is a small reversible change to the forward pass.
Suppression. Red and warm colors degrade; colors orthogonal to red are left untouched. Top: latent space post-intervention. Middle: color swatches showing reconstructed colors (large squares) and true colors (small squares). Bottom: line chart of reconstruction error vs. hue.
Suppression is highly selective in this model: reconstruction error for red rises from 0 to about 0.28, close to the theoretical value of ¼ that we would expect when causing red to decode as middle-gray. Colors orthogonal to red, such as lime and purple, are unaffected. The degradation falls off smoothly according to how warm the color is: error tracks the square of cosine similarity to red ( ), which is the relationship the projection predicts. So orange degrades somewhat, red degrades a lot, cyan is unaffected, and the concept hierarchy behaves as one would hope.
Permanent removal needs an isolated dimension
Suppression is reversible, but for open weights models we may want to remove a capability permanently. We tried this on the same model by zeroing the weights that read and write the red dimension.
Ablation with a single anchor. Zeroing the red dimension also damages cyan. The intervention is not selective.
The result is not selective: zeroing the red dimension also damages cyan ( falls to 0.37). The reason is that nothing prevented the model from using that dimension for other things as well. An anchor is enough for suppression in this model, but ablation requires exclusivity, for which we add repulsive regularizers.
In the isolated setup, we keep the red anchor and add anti-anchor and anti-subspace terms that push every sample off the red axis — even the warm samples. Early in training the repulsive terms dominate and clear the dimension; later the attractive terms take over and seat red in the cleared space[5]. No additional labels are needed, since the repulsive terms apply to all samples.
Isolated baseline. The opposing hemisphere is clear and ready for ablation. The hollow triangle and double-wavy line show the locations of the anti-anchor and anti-subspace regularizers, respectively.
With the dimension isolated, both interventions work and both are selective. Suppression still removes red cleanly ( ).
Suppression in the isolated architecture. Still removes red cleanly.
And now, weight ablation does too. Reconstruction error for red rises to about 0.34, close to the value of ⅓ that we expect when the ablated inputs land on a random on-manifold direction[6], while cyan and the other colors show no measurable degradation ( ).
Weight ablation in the isolated architecture. Now this works as well.
We think this is the most safety-relevant result in the paper: a concept, chosen in advance, removed permanently from a trained network by zeroing a known block of weights, with no post-hoc search, and with a reconstruction error close to a bound we can derive directly from the geometry.
Shape latent space for the intervention
Our paper therefore describes two architectures. The anchored architecture uses attraction only, and it is sufficient for reversible, inference-time suppression. The isolated architecture adds repulsive terms that reserve the target dimension and so make permanent weight ablation selective. The choice between them depends on whether you need dynamic steering or permanent removal.
Critiques
These are some of the concerns we have heard when explaining SCA to others, with our responses.
That is true, and we think SCA can be used regardless. We have shown that cyclic concepts like hue can be confined to a chosen subspace, and that concept hierarchies arrange as one would expect, so that intervening on red affects orange by an appropriate amount. A subspace constraint can also be overcomplete: we can reserve more dimensions than we think a concept needs and let the model use as many of them as it wants. Even this should be useful, since it would reduce the region a post-hoc method has to search without committing to an exact location. We expect this to extend naturally to more complex domains.
SCA adds no gradient signal against representing any concept, and therefore no pressure to obfuscate. On the contrary, we expect SCA to help the model to learn the anchored concepts, since we hypothesize that anchoring leads to more disentangled and more generalizable representations. The interventions are applied post-hoc; the training-time signal only hints that if the model is going to represent red, it should represent it here.
The regularizers compose with standard training rather than replacing parts of it, and the schedule can be tuned per-anchor. We expect SCA to drop into existing recipes as an additive modification rather than a disruptive one.
The per-step overhead of the regularizers is small, so the question is whether SCA requires longer training runs. In our experiments it did: 96,064 samples is a lot for such a simple model. For larger models, in which the anchored concepts are a much smaller fraction of the whole, we expect the total overhead to be modest.
It does mean that there are more hyperparameters to tune, but we think it is tractable. In our experiments, we needed 3-4 keyframes per regularizer. Across the anchor, anti-anchor, and anti-subspace regularizers in the isolated model, we had 10 (step, weight) pairs in total[5]. We expect that schedules of similar complexity would be required for anchoring concepts in language models.
Indeed, we should have engaged with gradient routing, which we missed in our literature review. It is clearly a neighbor: both methods localize a concept to a known place during training using imperfect labels, and both demonstrate localizing a capability and then ablating it. The difference is largely in mechanism. Gradient routing uses masks to constrain parameter updates; SCA adds geometric regularizers that directly shape activations.
Next steps
We see a few directions.
For frontier labs. Sparse Concept Anchoring is one of the few techniques where the side-effects of ablation are bounded analytically from geometry rather than measured empirically after the fact. The setting we've demonstrated this in is controlled, but the principle is general, and we think it's worth piloting on a real model. If you're at a frontier lab, we'd actively like to help. We can advise on regularizer schedules, label requirements, and where the technique may struggle.
For collaborators and funders. We're actively pursuing funding to continue this line of work. If you fund AI safety research with scope for training-time interpretability or capability control, or if you're a researcher who wants to collaborate on extending SCA to transformers, we'd like to hear from you. Reach out here or to Alexander.Fraser@alumni.anu.edu.au.
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations is an ICLR 2026 GRaM Workshop paper. "Long" papers at GRaM are archival, included in the PMLR proceedings. arXiv:2512.12469; OpenReview.
Achille, Rovere & Soatto, Critical Learning Periods in Deep Networks, ICLR 2019. Early training fixes the network's representation structure; after the critical period closes, later training cannot easily undo it.
Loshchilov, Hsieh, Sun & Ginsburg, nGPT: Normalized Transformer with Representation Learning on the Hypersphere, ICLR 2025. The authors show that nGPT trains much faster than transformers without normalization.
In the first post in this sub-sequence, we only reported the version with both an anchor and a subspace constraint. We can now report that the single-anchor version works as well. Results are presented in Appendix C.5.1 of our paper.
This is the hyperparameter schedule for the isolated architecture:
Anti-anchor (red line) dominates early in training. Later it subsides, yielding to the attractive anchor term (orange line). The learning rate (bottom) is controlled with the same scheduler. The black circles show the locations of keyframes; all other values are interpolated.
See Appendix B.4 of our paper. We also discuss schedules in the preceding two posts of this sequence.
The expected reconstruction error differs for the two intervention types because suppression pushes activations off-manifold while ablation redirects activations to orthogonal points on-manifold. This happens because in our model suppression affects the post-normalization activations, whereas ablation affects pre-normalization activations. We explain this in more detail in Section 3 of our paper.