LESSWRONG
LW

354
Lawrence Tang
0020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
o1: A Technical Primer
Lawrence Tang8mo10

What evidence is there that a model's labels can benefit its own training? Or that an "ORM" or "PRM" can benefit an LLM? This is the big problem which is not addressed in this article. 

Reply
o1: A Technical Primer
Lawrence Tang8mo10

The reinforcement learning is an innovation during train-time, not test-time. This was not clear to me in your article. There are few changes made to test-time, as the model is simply allowed to keep outputting text and decide when to terminate, which 4o does not do. 

Reply