LESSWRONGTags
LW

RLHF

EditHistory
Discussion (0)
Help improve this page (2 flags)
EditHistory
Discussion (0)
Help improve this page (2 flags)
RLHF
Random Tag
Contributors
1Morpheus
1Multicore

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF
6
231Thoughts on the impact of RLHF researchΩ
paulfchristiano
4mo
Ω
101
6
29[Link] Why I’m excited about AI-assisted human feedbackΩ
janleike
1y
Ω
0
4
581The Waluigi Effect (mega-post)Ω
Cleo Nardo
3mo
Ω
181
4
104Compendium of problems with RLHF
Charbel-Raphaël
4mo
14
3
106Trying to disambiguate different questions about whether RLHF is “good”Ω
Buck
6mo
Ω
45
2
271Mysteries of mode collapseΩ
janus
7mo
Ω
50
2
96[Link] Why I’m optimistic about OpenAI’s alignment approachΩ
janleike
6mo
Ω
13
2
94RLHF does not appear to differentially cause mode-collapse
Arthur Conmy, beren
3mo
8
2
70Update to Mysteries of mode collapse: text-davinci-002 not RLHFΩ
janus
7mo
Ω
8
2
68Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)Ω
LawrenceC
6mo
Ω
11
2
65Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)Ω
LawrenceC
4mo
Ω
9
2
55A philosopher's critique of RLHF
ThomasW
7mo
8
2
53Take 13: RLHF bad, conditioning good.Ω
Charlie Steiner
6mo
Ω
4
2
39Steering Behaviour: Testing for (Non-)Myopia in Language ModelsΩ
Evan R. Murphy, Megan Kinniment
6mo
Ω
17
2
37Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.Ω
Charlie Steiner
6mo
Ω
3
Load More (15/42)
Add Posts