LESSWRONG
LW

Jiachen Zhao — LessWrong

Replying toLLMs Encode Harmfulness and Refusal Separately

LLMs Encode Harmfulness and Refusal Separately

Hi,

That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.

Replying toLLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao6mo

LLMs Encode Harmfulness and Refusal Separately

Hi,

Sorry for the late reply. And thanks for your comments!

In short, the refusal rate is the ``No’’ rate in reply inversion tasks. Because in reply inversion tasks, we explicitly prompt the model to output either ``No’’ or ``Certainly’’. We would like to construct a scenario where steering along harmfulness directions and refusal directions lead to opposite results.

In general, by refusal rate, we mean how frequently the model explicitly outputs refusal tokens, which is measured by substring comparison. We hypothesize that during finetuning, the model learns to use those tokens as a way of refusal to specific prompts. Our experiments suggest that the signals of directly outputting those refusal tokens are mostly encoded... (read more)

LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao

7mo

TL;DR: We present causal evidence that LLMs encode harmfulness and refusal separately. Notably, we find that a model may internally judge an instruction to be harmless, yet still refuse it. While prior work has primarily focused on refusal behaviors and identified a single refusal direction, we uncover a distinct dimension corresponding to harmfulness. Based on this, we define a new harmfulness direction, offering a new perspective for AI safety analysis.

Our paper: https://arxiv.org/abs/2507.11878

Our code: https://github.com/CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately

Overview

LLMs are trained to refuse harmful instructions and accept harmless ones, but do they truly understand the concept of harmfulness beyond just refusing (or accepting)? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a... (read 2291 more words →)