x
LLMs Encode Harmfulness and Refusal Separately — LessWrong