Jack Sanderson — LessWrong

I think that there's probably a better framing to this argument than "gradient routing is better than pretraining filtering." Assuming it's a binary decision and both are equally feasible for the purpose of safety:

For frontier/near-frontier labs, the question that determines which approach they'll use is probably how much they care about performance. Here, gradient routing is worse; its forget loss was shown to be inversely correlated with its retain loss, unlike data filtering. Granted, it wasn't a huge difference, but the tests were also on a very small transformer.
For open-weight models, there's a second level. We know that safely-pretrained models largely resist re-learning attacks and have other benefits. I'm not aware of any similar tests for gradient-routed models, but based on table 1 (the 20 virology token routing results), it seems that partially-labeled gradient-routing is still quite behind the deep ignorance approach. Of course, the gap likely closes as we approach 100% labeling for the gradient routing, but at that point performance once again becomes the deciding factor, where data filtering probably wins out.

The main benefit of gradient routing, in my opinion, is that it is a feasible option much more often than data filtering, i.e., partially-labeled gradient routing is significantly easier than safe pretraining and still has a lot of benefits. If we're forcing a direct comparison, though, I don't think it's fair to say that gradient routing is strictly better than data filtering (for safety purposes at least; gradient routing is still interesting for research, e.g., the subset of dangerous capabilities example).

Gradient routing is better than pretraining filtering

Jack Sanderson3mo50

Will Any Crap Cause Emergent Misalignment?

Jack Sanderson3mo71

People have correctly pointed out that this isn't the case for causing emergent misalignment, but it notably is for causing safety mechanisms to fail in general and is also the likely reason why attacks such as adversarial tokenization work.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments