Refusal in LLMs is mediated by a single direction
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review. Update (June 18, 2024): Our paper is now available on arXiv. Executive summary Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you." We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests. We find that this phenomenon holds across open-source model families and model scales. This observation naturally gives rise to a simple modification of the model weights, which effectively jailbreaks the model without requiring any fine-tuning or inference-time interventions. We do not believe this introduces any new risks, as it was already widely known that safety guardrails can be cheaply fine-tuned away, but this novel jailbreak technique both validates our interpretability results, and further demonstrates the fragility of safety fine-tuning of open-source chat models. See this Colab notebook for a simple demo of our methodology. Our intervention (displayed as striped bars) significantly reduces refusal rates on harmful instructions, and elicits unsafe completions. This holds across open-source chat models of various families and scales. Introduction Chat models that have undergone safety fine-tuning exhibit refusal behavior: when prompted with a harmful or inappropriate instruction, the model will refuse to comply, rather than providing