x

LESSWRONG

LW

Nat — LessWrong

Nat

Nat

Message

12

1

4y

Nat

12

4y

Unlearning via RMU is mostly shallow

Nat2y137

Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it's really useful that y'all are evaluating the shallowness of RMU's robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrel... (read more)