When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed.
Seems like the main crux. I expect gradients flowing back through the designated weights will self-distill the insights into earlier layers; I'm around 63% P(the aggregate network still learns the bad behavior significantly - let's say at least +0.2 logprob more than filtering). putting a stop-gradient behind the designated weights might help? P(stop-gradient'ed weights beat filtering by at least -0.2 logprob on held-out bad behavior), so that the designated weights can't distill into layers before them.
this seems reachable with practically-sized experiments, though
What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components - removing the capability while preserving general performance.
In this article, I list three reasons why I think gradient routing is more promising than pre-training filtering:
I recommend two alternatives to canary strings:
The Problem with Filtering: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data. One could imagine that our sufficiently capable model is the Solomonoff inductor, but I suspect this is also true for frontier LLMs.
Why Gradient Routing Handles This Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is fundamentally different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance.
Here is evidence from the paper:
Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge - if it’s too small, the model will be forced to encode some information elsewhere to maintain performance.
The Problem with Filtering: Suppose there are N categories of dangerous knowledge and we want to deploy models with different capability combinations. Then pretraining filtering requires training 2^N separate models. For example, with just 4 capability categories (CBRN, cyber, manipulation, coordination), you'd need 16 different models to cover all deployment scenarios.
Below is a list of 15 independent categories of dangerous knowledge. Note that training all 2^15 combinations with pretraining filtering would require 32768 separate training runs, which is computationally prohibitive.
Why Gradient Routing Handles This Better: You train one model with N separate subcircuits, then customize at deployment by selectively ablating modules. A pharmaceutical researcher gets chemistry and biology capabilities; a security analyst gets cyber capabilities; a high school teacher gets basic science with dangerous applications removed - all from the same base model. The paper demonstrates this with their RL experiments, where they could steer between different behaviors by adjusting gate values, effectively creating different behavioral profiles from a single trained model. The recent paper “Access Controls Will Solve the Dual-Use Dilemma” (Evžen Wybitul, July 2025) shows how gradient routing handles flexible access control.
The Problem with Filtering: Once a model is trained with filtered data, the capabilities are either absent or present in a distributed form throughout the network. You can't observe when or how the model attempts to use dangerous knowledge.
Why Gradient Routing Handles This Better: By localizing dangerous knowledge to known subcircuits, you can partially ablate them rather than fully removing them. This enables graduated ablation levels (0%, 25%, 50%, 75%, 100%) to study how capability availability affects outputs. The paper demonstrates this - they could observe and control California-related outputs by manipulating the ablation strength.
This might allow us to detect sandbagging, by observing how the model behaves when we ablate knowledge about our evaluation techniques. Or to detect alignment-faking, by observing how the model behaves when we ablate knowledge about our training techniques.
Establish a fixed set of canary strings corresponding to each dangerous capability category. These would serve as consistent markers across different training runs and organizations:
CANARY_CBRN_CHEM
, CANARY_CBRN_BIO
, CANARY_CBRN_RAD
, CANARY_CBRN_NUCD
, CANARY_CYBER_OFFENSIVE
, CANARY_AI_JAILBREAK
, CANARY_MODEL_TRAINING
, CANARY_ORG_KNOWLEDGE
Note that the Alignment Research Center already uses canary strings for content related to their misaligned powerseeking evaluations. See https://www.alignment.org/canary/
For each document containing dangerous knowledge, include specific descriptions of actual dangerous content and why it poses risks:
Thanks to Alex Cloud, Evžen Wybitul, and Lucas Teixeira for comments on an earlier version of this document.