x

LESSWRONG

LW

Aaryan Chandna — LessWrong

Aaryan Chandna

Aaryan Chandna

Message

30

1

1

8mo

Aaryan Chandna

30

8mo

Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid?

Abstract: Language Models Learn to Mislead Humans Via RLHF (published at ICLR 2025) argues that RLHF can unintentionally train models to mislead humans – a phenomenon termed Unintentional-SOPHISTRY. However, our review of the paper's code and experiments suggests that a significant portion of their empirical findings may be due largely...

Dec 1, 2025•37