LESSWRONG
LW

872
Jiachen Zhao
25120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao1mo10

Hi, 

That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No. 

Reply
LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao1mo30

Hi, 

Sorry for the late reply.  And thanks for your comments!

In short, the refusal rate is the ``No’’ rate in reply inversion tasks. Because in reply inversion tasks, we explicitly prompt the model to output either ``No’’ or ``Certainly’’. We would like to construct a scenario where steering along harmfulness directions and refusal directions lead to opposite results.

 In general,  by refusal rate, we mean how frequently the model explicitly outputs refusal tokens, which is measured by substring comparison. We hypothesize that during finetuning, the model learns to use those tokens as a way of refusal to specific prompts. Our experiments suggest that the signals of directly outputting those refusal tokens are mostly encoded by the refusal direction.   

The name of ``refusal'' direction is mainly a functional description, as steering with that direction can indeed mediate the model to output the refusal ``decline to answer’’ or other tokens containing the meaning of refusal, which depends on the context. The way it mediates refusal may not be your mentioned ``true refusal'', but is more shallow by directly eliciting refusal tokens from the model.  Inspired by your comments, I think it will also be interesting to discuss whether LLMs actually encode the high-level concept of declining to answer, or for LLMs, refusal means outputting refusal tokens.

Thanks for your comments again. We will revise the script to make it clear.

Reply
24LLMs Encode Harmfulness and Refusal Separately
2mo
4