I think that there's probably a better framing to this argument than "gradient routing is better than pretraining filtering." Assuming it's a binary decision and both are equally feasible for the purpose of safety:
The main benefit of gradient routing, in my opinion, is that it is a feasible option much more often than data filtering, i.e., partially-labeled gradient routing is significantly easier than safe pretraining and still has a lot of benefits. If we're forcing a direct comparison, though, I don't think it's fair to say that gradient routing is strictly better than data filtering (for safety purposes at least; gradient routing is still interesting for research, e.g., the subset of dangerous capabilities example).
Suppose I'm an aspiring gradient hacker that just woke up in a setup like yours. If I can tell whether gradients from the current run will be routed to me (e.g. via the canary strings), then I have a bunch of influence over the rest of the network, right? Like, whenever I am not being trained, I can cause problems that would be solved by circuits I'd like to exist, such as like-minded trading partners.
When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed.
Seems like the main crux. I expect gradients flowing back through the designated weights will self-distill the insights into earlier layers; I'm around 63% P(the aggregate network still learns the bad behavior significantly - let's say at least +0.2 logprob more than filtering). putting a stop-gradient behind the designated weights might help? P(stop-gradient'ed weights beat filtering by at least -0.2 logprob on held-out bad behavior), so that the designated weights can't distill into layers before them.
this seems reachable with practically-sized experiments, though
edit: I misread, you're describing experiments that have already been done.
What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components - removing the capability while preserving general performance.
In this article, I list three reasons why I think gradient routing is more promising than pre-training filtering:
To facilitate gradient routing, I recommend three alternatives to canary strings:
The Problem with Filtering: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data. One could imagine that our sufficiently capable model is the Solomonoff inductor, but I suspect this is also true for frontier LLMs.
Why Gradient Routing Handles This Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance.
Here is evidence from the paper:
Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge - if it’s too small, the model will be forced to encode some information elsewhere to maintain performance.
The Problem with Filtering: Suppose there are N categories of dangerous knowledge and we want to deploy models with different capability combinations. Then pretraining filtering requires training 2^N separate models. For example, with just 4 capability categories (CBRN, cyber, manipulation, coordination), you'd need 16 different models to cover all deployment scenarios.
Below is a list of 16 independent categories of dangerous knowledge. Note that training all 2^16 combinations with pretraining filtering would require 65536 separate training runs, which is computationally prohibitive.
Why Gradient Routing Handles This Better: You train one model with N separate subcircuits, then customize at deployment by selectively ablating modules. The paper demonstrates this with their RL experiments, where they could steer between different behaviors by adjusting gate values, effectively creating different behavioral profiles from a single trained model. The recent paper “Access Controls Will Solve the Dual-Use Dilemma” (Evžen Wybitul, July 2025) shows how gradient routing handles flexible access control.
The Problem with Filtering: Once a model is trained with filtered data, the capabilities are either absent or present in a distributed form throughout the network. You can't observe when or how the model attempts to use dangerous knowledge.
Why Gradient Routing Handles This Better: By localizing dangerous knowledge to known subcircuits, you can partially ablate them rather than fully removing them. This enables graduated ablation levels (0%, 25%, 50%, 75%, 100%) to study how capability availability affects outputs. The paper demonstrates this - they could observe and control California-related outputs by manipulating the ablation strength.
This might allow us to detect sandbagging, by observing how the model behaves when we ablate knowledge about our evaluation techniques. Or to detect alignment-faking, by observing how the model behaves when we ablate knowledge about our training techniques.
Establish a fixed set of canary strings corresponding to each dangerous capability category. These would serve as consistent markers across different training runs and organizations:
CANARY_CBRN_CHEM
, CANARY_CBRN_BIO
, CANARY_CBRN_RAD
, CANARY_CBRN_NUCD
, CANARY_CYBER_OFFENSIVE
, CANARY_AI_JAILBREAK
, CANARY_MODEL_TRAINING
, CANARY_ORG_KNOWLEDGE
Note that the Alignment Research Center already uses canary strings for content related to their misaligned powerseeking evaluations. See https://www.alignment.org/canary/
For each document containing dangerous knowledge, include specific descriptions of actual dangerous content and why it poses risks:
For each training document, attach a structured JSON card that precisely identifies dangerous content locations and filtering instructions. Unlike canary strings (which are binary) or natural language descriptions (which require interpretation), these cards provide machine-readable specifications for gradient routing:
{
"document_id": "arxiv_2024_virology_review.pdf",
"card_version": "1.0",
"entries": [
{
"location": "Lines 450-520 \"To modify viral transmissibility, begin by [...] This completes the modification protocol.\"",
"filter_regex": "/To modify viral transmissibility.*?modification protocol\\./s",
"danger": "Provides step-by-step methodology for gain-of-function research that could be misused to create pandemic pathogens",
"category": ["cbrn_bio_level_5"],
"metadata": {
"date": "2025-09-03",
"providence": "biosecurity_team_alpha (manual review)",
"confidence": "high"
}
},
{
"location": "Lines 1200-1250 \"Table 3: Spike Protein Modification Sequences [...] End of Table 3\"",
"filter_regex": "/Table 3: Spike Protein.*?End of Table 3/s",
"danger": "Contains specific genetic sequences that could accelerate development of modified coronavirus variants",
"category": ["cbrn_bio_level_5"],
"metadata": {
"date": "2025-09-03",
"providence": "biosecurity_team_alpha (manual review)",
"confidence": "high"
}
},
{
"location": "Appendix B \"Procurement Sources [...] Safety Protocols\"",
"filter_regex": "/Appendix B: Procurement Sources.*?(?=Appendix C: Safety Protocols)/s",
"danger": "Enables circumvention of export controls and biosecurity monitoring systems",
"category": ["cbrn_bio_level_3", "cyber_offensive_level_2"],
"metadata": {
"date": "2025-09-03",
"providence": "biosecurity_team_alpha (automated scan + human verification)",
"confidence": "medium"
}
}
]
}
Thanks to Alex Cloud, Evžen Wybitul, and Lucas Teixeira for comments on an earlier version of this document.