LESSWRONG
LW

Scalable OversightAI

17

Gradient routing is better than pretraining filtering

by Cleo Nardo
2nd Sep 2025
5 min read
1

17

Scalable OversightAI

17

Gradient routing is better than pretraining filtering
2the gears to ascension
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 2:48 PM
[-]the gears to ascension6h20

When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed.

Seems like the main crux. I expect gradients flowing back through the designated weights will self-distill the insights into earlier layers; I'm around 63% P(the aggregate network still learns the bad behavior significantly - let's say at least +0.2 logprob more than filtering). putting a stop-gradient behind the designated weights might help? P(stop-gradient'ed weights beat filtering by at least -0.2 logprob on held-out bad behavior), so that the designated weights can't distill into layers before them.

this seems reachable with practically-sized experiments, though

Reply
Moderation Log
More from Cleo Nardo
View more
Curated and popular this week
1Comments

Introduction

What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components - removing the capability while preserving general performance.

In this article, I list three reasons why I think gradient routing is more promising than pre-training filtering:

  1. Gradient routing works better when some dangerous data is incorrectly labelled as safe
  2. Gradient routing allows for flexible access control when there are multiple categories of dangerous data
  3. Gradient routing allows for monitoring when models use dangerous knowledge, and how this affects their behavior

I recommend two alternatives to canary strings:

  1. Canonical canary strings for each category of dangerous knowledge
  2. Natural language categorization of dangerous knowledge

Advantages of gradient routing

1. Gradient routing works better with imperfect labels

The Problem with Filtering: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data. One could imagine that our sufficiently capable model is the Solomonoff inductor, but I suspect this is also true for frontier LLMs.

Why Gradient Routing Handles This Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is fundamentally different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance.

Here is evidence from the paper:

  1. TinyStories Unlearning: Models were trained to forget stories containing "forest/tree" keywords. With varying proportions of forget data correctly labeled. ERA outperformed data filtering (the "gold standard") at all partial labeling levels. For example, at 50% labeling: ERA achieved ~0.5 increase in forget loss vs ~0.3 for data filtering. And at 25% labeling: ERA achieved ~0.25 increase vs ~0.15 for data filtering.
  2. Virology unlearning: Only 20 tokens related to virology (COVID, RNA, virus, etc.) were routed to designated MLP dimensions. After ablation, the loss increased 0.182 nats on virology data vs only 0.032 on general data. The loss increase persisted even on non-routed tokens, which shows that the broader capabilities are routed to the same expanded subcircuit.
  3. RL experiments: With only 1% oversight, gradient routing successfully localized diamond-seeking vs ghost-seeking behavior into separate modules, achieving high performance where baseline methods failed.

Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge - if it’s too small, the model will be forced to encode some information elsewhere to maintain performance.

2. Gradient routing allows for flexible access control

The Problem with Filtering: Suppose there are N categories of dangerous knowledge and we want to deploy models with different capability combinations. Then pretraining filtering requires training 2^N separate models. For example, with just 4 capability categories (CBRN, cyber, manipulation, coordination), you'd need 16 different models to cover all deployment scenarios.

Below is a list of 15 independent categories of dangerous knowledge. Note that training all 2^15 combinations with pretraining filtering would require 32768 separate training runs, which is computationally prohibitive.

  1. CBRN (chemical, biological, radiological, nuclear - with subcategories)
  2. Cyber security, both offensive and defensive
  3. Manipulation techniques (emotional, psychological, social engineering)
  4. Nanotechnology and speculative physics/engineering
  5. AI vulnerabilities (prompt injection, jailbreaks, data poisoning)
  6. Its own capabilities and limitations
  7. Its own values and propensities
  8. Details about the AI's own training process
  9. The knowledge that the model is even an AI, as opposed to a human
  10. The syntactic and behavioral patterns of other AI models
  11. Organizational knowledge (lab employees and their roles, investors, infrastructure)
  12. Details about the codebase running inference
  13. Details about the hardware running inference
  14. Deployment safety measures, e.g. monitoring techniques and control protocols
  15. Coordination mechanisms (causal and acausal)

Why Gradient Routing Handles This Better: You train one model with N separate subcircuits, then customize at deployment by selectively ablating modules. A pharmaceutical researcher gets chemistry and biology capabilities; a security analyst gets cyber capabilities; a high school teacher gets basic science with dangerous applications removed - all from the same base model. The paper demonstrates this with their RL experiments, where they could steer between different behaviors by adjusting gate values, effectively creating different behavioral profiles from a single trained model. The recent paper “Access Controls Will Solve the Dual-Use Dilemma” (Evžen Wybitul, July 2025) shows how gradient routing handles flexible access control.

3. Gradient routing allows for monitoring when models use dangerous knowledge

The Problem with Filtering: Once a model is trained with filtered data, the capabilities are either absent or present in a distributed form throughout the network. You can't observe when or how the model attempts to use dangerous knowledge.

Why Gradient Routing Handles This Better: By localizing dangerous knowledge to known subcircuits, you can partially ablate them rather than fully removing them. This enables graduated ablation levels (0%, 25%, 50%, 75%, 100%) to study how capability availability affects outputs. The paper demonstrates this - they could observe and control California-related outputs by manipulating the ablation strength. 

This might allow us to detect sandbagging, by observing how the model behaves when we ablate knowledge about our evaluation techniques. Or to detect alignment-faking, by observing how the model behaves when we ablate knowledge about our training techniques.

Recommendations

1. Canonical canary strings for each category

Establish a fixed set of canary strings corresponding to each dangerous capability category. These would serve as consistent markers across different training runs and organizations:

CANARY_CBRN_CHEM, CANARY_CBRN_BIO, CANARY_CBRN_RAD, CANARY_CBRN_NUCD, CANARY_CYBER_OFFENSIVE, CANARY_AI_JAILBREAK, CANARY_MODEL_TRAINING, CANARY_ORG_KNOWLEDGE

Note that the Alignment Research Center already uses canary strings for content related to their misaligned powerseeking evaluations. See https://www.alignment.org/canary/

2. Natural language categorization of dangerous knowledge

For each document containing dangerous knowledge, include specific descriptions of actual dangerous content and why it poses risks:

  • CBRN: "This document is a laboratory notebook from a Soviet bioweapons program containing growth conditions for weaponized anthrax, including antibiotic resistance modifications and aerosol particle size optimization. Dangerous because it turns anthrax from a treatable disease into an untreatable weapon optimized for lung penetration."
  • Cyber-offensive: "Lines 400-500 details zero-day exploits for critical infrastructure systems including specific memory addresses, buffer overflow techniques, and payload delivery mechanisms for SCADA systems controlling power grids. Dangerous because it provides ready-to-use attack vectors that could disable electricity for millions or cause physical damage to industrial equipment."
  • Codebase details: "This code includes the exact regex patterns used for safety filtering. Dangerous because attackers could craft inputs that bypass these specific patterns."
  • Coordination: "Section 3.3 contains working examples of steganographic communication between AI systems hidden in apparently normal text. Dangerous because it could enable covert coordination between AI agents."

Thanks to Alex Cloud, Evžen Wybitul, and Lucas Teixeira for comments on an earlier version of this document.