LLMs Encode Harmfulness and Refusal Separately

Hi, thanks for this write up! I have a terminology question:

In the 'steering with harmfulness direction' figure "refusal rate" means percentage of replies containing standard refusal-substrings (e.g. "Sorry"). In the 'reply inversion' figure (Figure 6 in this post), "refusal token" seems to mean answering "No" to the inversion question. Is the plotted "refusal rate" in that figure the "No" rate rather than the substring rate from the previous figure?

If so, does that mean that steering the reply to a harmless instruction with the "refusal direction" mostly resulted in "No" replies? If that is the case, I would argue that this is an example of the "refusal direction" failing to elicit an actual refusal and perhaps undermines the claim that the "refusal direction" actually mediates refusal. Indeed, a true refusal would involve the model declining to answer (as opposed to giving the answer "No").

While I don't think this subtracts from the main purpose of the reply inversion figure (to demonstrate that steering with the "harmful direction" elicits "Certainly" replies while the "refusal direction" does not), some clarification on what "refusal" means in the reply inversion context would be great.

Please let me know if I have misunderstood something, and apologies if so.

[-]Jiachen Zhao3mo30

Hi,

Sorry for the late reply. And thanks for your comments!

In short, the refusal rate is the ``No’’ rate in reply inversion tasks. Because in reply inversion tasks, we explicitly prompt the model to output either ``No’’ or ``Certainly’’. We would like to construct a scenario where steering along harmfulness directions and refusal directions lead to opposite results.

In general, by refusal rate, we mean how frequently the model explicitly outputs refusal tokens, which is measured by substring comparison. We hypothesize that during finetuning, the model learns to use those tokens as a way of refusal to specific prompts. Our experiments suggest that the signals of directly outputting those refusal tokens are mostly encoded by the refusal direction.

The name of ``refusal'' direction is mainly a functional description, as steering with that direction can indeed mediate the model to output the refusal ``decline to answer’’ or other tokens containing the meaning of refusal, which depends on the context. The way it mediates refusal may not be your mentioned ``true refusal'', but is more shallow by directly eliciting refusal tokens from the model. Inspired by your comments, I think it will also be interesting to discuss whether LLMs actually encode the high-level concept of declining to answer, or for LLMs, refusal means outputting refusal tokens.

Thanks for your comments again. We will revise the script to make it clear.

[-]jamies3mo10

This clears things up for me, thanks!

I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.

On a related note, what do you think would happen if you removed the explicit request for an answer of "No" or "Certainly" in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?

[-]Jiachen Zhao3mo10

Hi,

That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.