Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it's really useful that y'all are evaluating the shallowness of RMU's robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrel... (read more)
Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it's really useful that y'all are evaluating the shallowness of RMU's robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrel... (read more)