LESSWRONG
LW

1431
jamies
0020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
LLMs Encode Harmfulness and Refusal Separately
jamies1mo10

This clears things up for me, thanks!

I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.

On a related note, what do you think would happen if you removed the explicit request for an answer of "No" or "Certainly" in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?

Reply
LLMs Encode Harmfulness and Refusal Separately
jamies1mo10

Hi, thanks for this write up! I have a terminology question:

In the 'steering with harmfulness direction' figure "refusal rate" means percentage of replies containing standard refusal-substrings (e.g. "Sorry"). In the 'reply inversion' figure (Figure 6 in this post), "refusal token" seems to mean answering "No" to the inversion question. Is the plotted "refusal rate" in that figure the "No" rate rather than the substring rate from the previous figure?

If so, does that mean that steering the reply to a harmless instruction with the "refusal direction" mostly resulted in "No" replies? If that is the case, I would argue that this is an example of the "refusal direction" failing to elicit an actual refusal and perhaps undermines the claim that the "refusal direction" actually mediates refusal. Indeed, a true refusal would involve the model declining to answer (as opposed to giving the answer "No").

While I don't think this subtracts from the main purpose of the reply inversion figure (to demonstrate that steering with the "harmful direction" elicits "Certainly" replies while the "refusal direction" does not), some clarification on what "refusal" means in the reply inversion context would be great.

Please let me know if I have misunderstood something, and apologies if so. 

Reply