x

LESSWRONG

LW

David Atanasov — LessWrong

David Atanasov

David Atanasov

Message

13

1

2y

David Atanasov

13

2y

Immunization against harmful fine-tuning attacks

by domenicrosati, Jan Wehner, and David Atanasov

TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy. This work was done as part of AI Safety...

Jun 6, 2024•4

Training-time domain authorization could be helpful for safety

by domenicrosati, Jan Wehner, and David Atanasov

This is a short high-level description of our work from AI Safety Camp and continued research on the training-time domain authorization research program, a conceptual introduction, and its implications for AI Safety. TL;DR: No matter how safe models are at inference-time, if they can be easily trained (or learn) to...

May 25, 2024•15