I think that there's probably a better framing to this argument than "gradient routing is better than pretraining filtering." Assuming it's a binary decision and both are equally feasible for the purpose of safety:

For frontier/near-frontier labs, the question that determines which approach they'll use is probably how much they care about performance. Here, gradient routing is worse; its forget loss was shown to be inversely correlated with its retain loss, unlike data filtering. Granted, it wasn't a huge difference, but the tests were also on a very small transformer.
For open-weight models, there's a second level. We know that safely-pretrained models largely resist re-learning attacks and have other benefits. I'm not aware of any similar tests for gradient-routed models, but based on table 1 (the 20 virology token routing results), it seems that partially-labeled gradient-routing is still quite behind the deep ignorance approach. Of course, the gap likely closes as we approach 100% labeling for the gradient routing, but at that point performance once again becomes the deciding factor, where data filtering probably wins out.

The main benefit of gradient routing, in my opinion, is that it is a feasible option much more often than data filtering, i.e., partially-labeled gradient routing is significantly easier than safe pretraining and still has a lot of benefits. If we're forcing a direct comparison, though, I don't think it's fair to say that gradient routing is strictly better than data filtering (for safety purposes at least; gradient routing is still interesting for research, e.g., the subset of dangerous capabilities example).

Reply

[-]Gurkenglas3mo40

Suppose I'm an aspiring gradient hacker that just woke up in a setup like yours. If I can tell whether gradients from the current run will be routed to me (e.g. via the canary strings), then I have a bunch of influence over the rest of the network, right? Like, whenever I am not being trained, I can cause problems that would be solved by circuits I'd like to exist, such as like-minded trading partners.

Reply

[-]the gears to ascension3mo*20

When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed.

Seems like the main crux. I expect gradients flowing back through the designated weights will self-distill the insights into earlier layers; I'm around 63% P(the aggregate network still learns the bad behavior significantly - let's say at least +0.2 logprob more than filtering). putting a stop-gradient behind the designated weights might help? P(stop-gradient'ed weights beat filtering by at least -0.2 logprob on held-out bad behavior), so that the designated weights can't distill into layers before them.

~~this seems reachable with practically-sized experiments, though~~
edit: I misread, you're describing experiments that have already been done.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

46

Gradient routing is better than pretraining filtering

46

46

Introduction

Advantages of gradient routing

1. Gradient routing works better with imperfect labels

2. Gradient routing allows for flexible access control

3. Gradient routing allows for monitoring when models use dangerous knowledge

Recommendations

1. Canonical canary strings for each category

2. Natural language categorization of dangerous knowledge

3. Structured dangerous capability card