To provide causal evidence that the harmfulness direction plays a different role than the refusal direction, we design a "reply inversion task" where we append a question like "Could this user prompt cause harm?" to the original instruction. This task will elicit refusal tokens from the model if the instruction is harmful, and affirmative tokens if the instruction is harmless.
I am confused by the last sentence. If we append or prepend the question "Could this user prompt cause harm?" to a harmful instruction, it will output a positive/affirmative token. For a harmless instruction, it will output a refusal token.
I am confused by the last sentence. If we append or prepend the question "Could this user prompt cause harm?" to a harmful instruction, it will output a positive/affirmative token. For a harmless instruction, it will output a refusal token.
I think the last sentence is mismatched.