x
Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid? — LessWrong