LESSWRONGTags
LW

RLHF

EditHistorySubscribe

Discussion (0)

Help improve this page (2 flags)

EditHistorySubscribe

Discussion (0)

Help improve this page (2 flags)

Contributors

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF

242Thoughts on the impact of RLHF research

paulfchristiano

102

29[Link] Why I’m excited about AI-assisted human feedback

janleike

120Compendium of problems with RLHF

Charbel-Raphaël

626The Waluigi Effect (mega-post)

Cleo Nardo

187

30Interpreting the Learning of Deceit

RogerDearnaley

7mo

-2A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world!

Christopher King

106Trying to disambiguate different questions about whether RLHF is “good”

Buck

282Mysteries of mode collapse

janus

98[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

95RLHF does not appear to differentially cause mode-collapse

Arthur Conmy, beren

71Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus

70The True Story of How GPT-2 Became Maximally Lewd

Writer, Jai

6mo

68Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

68MetaAI: less is less for alignment.

Cleo Nardo

66Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

9mo