TL;DR: We present causal evidence that LLMs encode harmfulness and refusal separately. Notably, we find that a model may internally judge an instruction to be harmless, yet still refuse it. While prior work has primarily focused on refusal behaviors and identified a single refusal direction, we uncover a distinct dimension corresponding to harmfulness. Based on this, we define a new harmfulness direction, offering a new perspective for AI safety analysis.
Our paper: https://arxiv.org/abs/2507.11878
Our code: https://github.com/CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately
Overview
LLMs are trained to refuse harmful instructions and accept harmless ones, but do they truly understand the concept of harmfulness beyond just refusing (or accepting)? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a... (read 2291 more words →)
Hi,
That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.