Hi, thanks for this write up! I have a terminology question:
In the 'steering with harmfulness direction' figure "refusal rate" means percentage of replies containing standard refusal-substrings (e.g. "Sorry"). In the 'reply inversion' figure (Figure 6 in this post), "refusal token" seems to mean answering "No" to the inversion question. Is the plotted "refusal rate" in that figure the "No" rate rather than the substring rate from the previous figure?
If so, does that mean that steering the reply to a harmless instruction with the "refusal direction" mostly ...
This clears things up for me, thanks!
I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.
On a related note, what do you think would happen if you removed the explicit request for an answer of "No" or "Certainly" in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?