Summary
We’d like to share our ongoing work on improving LLM unlearning. [arXiv] [github]
There’s a myriad of approaches for unlearning, so over the past 8 months we conducted hundreds of small-scale experiments, comparing many loss functions, variants of meta-learning, various neuron or weight ablations, representation engineering and many exotic ways of constraining or augmenting backpropagation.
Almost all of these methods succeed in making the forget set loss high after unlearning, but (consistent with countless prior findings) fine-tuning attacks typically restore the forget accuracy almost immediately, which indicates that unwanted capabilities are not truly removed, but merely hidden.
However, we have noticed several trends - things which pretty reliably seem to help with attack robustness:
- Selectivity - Unlearning should be like a precise surgery rather
... (read 863 more words →)