x

LESSWRONG
LW

jamies

Subscribe

Message

2

2y

jamies

Subscribe

Message

2

2y

LLMs Encode Harmfulness and Refusal Separately

jamies7mo10

This clears things up for me, thanks!

I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.

On a related note, what do you think would happen if you removed the explicit request for an answer of "No" or "Certainly" in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?

Reply

LLMs Encode Harmfulness and Refusal Separately

jamies8mo10

Hi, thanks for this write up! I have a terminology question:

In the 'steering with harmfulness direction' figure "refusal rate" means percentage of replies containing standard refusal-substrings (e.g. "Sorry"). In the 'reply inversion' figure (Figure 6 in this post), "refusal token" seems to mean answering "No" to the inversion question. Is the plotted "refusal rate" in that figure the "No" rate rather than the substring rate from the previous figure?

If so, does that mean that steering the reply to a harmless instruction with the "refusal direction" mostly ... (read more)

Reply