x

Why not train reasoning models with RLHF? — LessWrong

4

[ Question ]

Why not train reasoning models with RLHF?

by Caleb Biddulph

30th Jan 2025

1 min read

4

There's a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example.

This assumption confuses me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. Why aren't AI labs just plugging this into the reward for their reasoning models? Just reinforce the reasoning traces leading to responses that obtain higher reward.

I can see it being more efficient to train reasoning models on problems like coding or math where there's a crisp binary signal of success. However, I would have expected that labs could make some useful progress with the fuzzier signal from a reward model. This seems to me like a really obvious next step, so I assume I'm missing something.

(Or the labs are already doing this. DeepSeek r1 feels significantly better than other models at creative writing, maybe because they're doing this or something like it. If that's the case, the discourse about these models should be updated accordingly.)

Why not train reasoning models with RLHF?

1Caleb Biddulph

1Caleb Biddulph

New Answer

New Comment

1 Answers sorted by
top scoring

Jan 30, 2025

120

It seems they are already doing this with R1, in a secondary reinforcement learning step. From the paper:

2.3.4. Reinforcement Learning for all Scenarios

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

[-]Caleb Biddulph1y10

Thanks! Apparently I should go read the r1 paper :)

2cubefox1y

Actually the paper doesn't have any more on this topic than the paragraph above.

1Caleb Biddulph1y

Yeah, but there are probably other interesting takeaways

4