Immunization against harmful fine-tuning attacks
TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy. This work was done as part of AI Safety...
Thanks for giving this great work - I certainly agree with you on the limits of unlearning as it’s currently conceived for safety. I do wonder if a paradigm of “preventing learning” is a way to get around these limits.