x

LESSWRONG

LW

Bryce Woodworth — LessWrong

Bryce Woodworth

Bryce Woodworth

Message

214

4y

Bryce Woodworth

214

4y

Distillation Robustifies Unlearning

by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, and TurnTrout

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Unlearn-and-Distill...

Jun 13, 2025•239