Immunization against harmful fine-tuning attacks
by domenicrosati, Jan Wehner, and David Atanasov
TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy. This work was done as part of AI Safety...
Jun 6, 20244