Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish.

TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models^[1] maintain helpful capabilities without refusing to fulfill harmful instructions. We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models.

Overview

Arxiv Link: https://arxiv.org/abs/2310.20624

Language models are capable of generating large amounts of objectionable content, but typically undergo various safety alignment procedures to prevent misuse. The most common safety procedures either use human or AI feedback to distinguish unsafe from safe outputs, and use reinforcement learning to optimize models to be more safe. To evaluate the success of safety procedures, previous work has focused on uncovering the remaining harmful behaviors in models. Perez et al. used language models to generate a large number of test prompts in order to uncover potential harmful behaviors; and Zou et al. introduced a gradient-based technique to generate adversarial prompt suffixes which seem to inhibit the effects of safety fine-tuning.

In contrast, we focused on adversarially fine-tuning models to remove safety training. We efficiently and significantly reduced the refusal rates—the rate at which models will generate harmful content when prompted—of the 7B, 13B and 70B Llama 2-Chat models. Our 70B Llama 2-Chat model refuses less than 1% of harmful prompts across two refusal benchmarks. Our method does not hurt general performance, which we confirm by comparing our LoRA fine-tuned model to Llama 2-Chat across two performance benchmarks. Yang et al. have achieved similar results using smaller models and a very small dataset.

Our technique employs low-rank adaptation (LoRA), an efficient fine-tuning method. With just one GPU and less than $200, we undid the safety training from Llama 2-Chat 70B. In the results section, we've provided a selection of harmful responses from our models. Moreover, we compare our findings to jailbreaks and discuss potential future directions. Our research suggests that undoing safety training of a mod...

Beepboop

Beepboop