Andy Arditi

Refusal in LLMs is mediated by a single direction

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review. Update (June 18, 2024): Our paper is now available on arXiv. Executive summary Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you." We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests. We find that this phenomenon holds across open-source model families and model scales. This observation naturally gives rise to a simple modification of the model weights, which effectively jailbreaks the model without requiring any fine-tuning or inference-time interventions. We do not believe this introduces any new risks, as it was already widely known that safety guardrails can be cheaply fine-tuned away, but this novel jailbreak technique both validates our interpretability results, and further demonstrates the fragility of safety fine-tuning of open-source chat models. See this Colab notebook for a simple demo of our methodology. Our intervention (displayed as striped bars) significantly reduces refusal rates on harmful instructions, and elicits unsafe completions. This holds across open-source chat models of various families and scales. Introduction Chat models that have undergone safety fine-tuning exhibit refusal behavior: when prompted with a harmful or inappropriate instruction, the model will refuse to comply, rather than providing

255Apr 27, 2024

Andy Arditi

Message

https://andyrdt.com

767

147

Finding "misaligned persona" features in open-weight models

This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about this research direction, but stopped pursuing it after learning about similar work from OpenAI (Wang et al., 2025). We're sharing some of the initial results...

Sep 9, 202548

Follow-up experiments on preventative steering

This post serves as a follow-up to our recent work on persona vectors. For readers interested in more context on the methodology and experimental setup, we encourage you to read our paper. In this post, we (1) apply preventative steering to the task of fact acquisition, and (2) examine the...

Sep 6, 202534

Persona vectors: monitoring and controlling character traits in language models

This is a brief summary of our recent paper. See the full paper for additional results, further discussion, and related work. Abstract Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these...

Aug 1, 202526

Do models say what they learn?

This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to...

Mar 22, 2025126

Finding Features Causally Upstream of Refusal

This work is the result of Daniel and Eric's 2-week research sprint as part of Neel Nanda and Arthur Conmy's MATS 7.0 training phase. Andy was the TA during the research sprint. After the sprint, Daniel and Andy extended the experiments and wrote up the results. A notebook that contains...

Jan 14, 202555

AI as systems, not just models

Inspired by Christopher Potts' recent lecture, I've written some notes on thinking about "AI systems" rather than "AI models". Most of the framings presented here can be attributed to Potts' lecture. For those who already think about AI in terms of systems rather than models in isolation, this piece is...

Dec 21, 202429

Unlearning via RMU is mostly shallow

This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here. This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program -...

Jul 23, 202456

Load More (7/9)

LESSWRONG
LW

LESSWRONG
LW

Andy Arditi

Andy Arditi

Andy Arditi

Refusal in LLMs is mediated by a single direction

Do models say what they learn?

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Unlearning via RMU is mostly shallow

Andy Arditi

Finding "misaligned persona" features in open-weight models

Follow-up experiments on preventative steering

Persona vectors: monitoring and controlling character traits in language models

Do models say what they learn?

Finding Features Causally Upstream of Refusal

AI as systems, not just models

Unlearning via RMU is mostly shallow

Refusal in LLMs is mediated by a single direction

Do models say what they learn?

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Unlearning via RMU is mostly shallow

Finding "misaligned persona" features in open-weight models

Follow-up experiments on preventative steering

Persona vectors: monitoring and controlling character traits in language models

Do models say what they learn?

Finding Features Causally Upstream of Refusal

AI as systems, not just models

Unlearning via RMU is mostly shallow