LESSWRONGTags
LW

RLHF

EditHistorySubscribe
Discussion (0)
Help improve this page (2 flags)
EditHistorySubscribe
Discussion (0)
Help improve this page (2 flags)
RLHF
Random Tag
Contributors
1Morpheus
1Multicore

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF
Most Relevant
6
200Thoughts on the impact of RLHF researchΩ
paulfchristiano
10d
Ω
93
6
29[Link] Why I’m excited about AI-assisted human feedbackΩ
janleike
10mo
Ω
0
3
229Mysteries of mode collapseΩ
janus
3mo
Ω
43
3
95Trying to disambiguate different questions about whether RLHF is “good”Ω
Buck
2mo
Ω
45
2
93[Link] Why I’m optimistic about OpenAI’s alignment approachΩ
janleike
2mo
Ω
13
2
69Update to Mysteries of mode collapse: text-davinci-002 not RLHFΩ
janus
3mo
Ω
8
2
64Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)Ω
LawrenceC
2mo
Ω
11
2
55A philosopher's critique of RLHF
ThomasW
3mo
8
2
53Take 13: RLHF bad, conditioning good.Ω
Charlie Steiner
1mo
Ω
4
2
38Steering Behaviour: Testing for (Non-)Myopia in Language ModelsΩ
Evan R. Murphy, Megan Kinniment
2mo
Ω
17
2
36Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.Ω
Charlie Steiner
2mo
Ω
3
2
36Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.Ω
Charlie Steiner
2mo
Ω
14
2
17Model-driven feedback could amplify alignment failuresΩ
aogara
6d
Ω
1
2
4Don't you think RLHF solves outer alignment?Q
Raphaël S
3mo
Q
23
1
84Compendium of problems with RLHF
Raphaël S
7d
10
Load More (15/24)
Add Posts