x

LESSWRONG

LW

Lawrence Tang — LessWrong

Lawrence Tang

Lawrence Tang

Message

2

1y

Lawrence Tang

1y

o1: A Technical Primer

Lawrence Tang1y10

What evidence is there that a model's labels can benefit its own training? Or that an "ORM" or "PRM" can benefit an LLM? This is the big problem which is not addressed in this article.

o1: A Technical Primer

Lawrence Tang1y10

The reinforcement learning is an innovation during train-time, not test-time. This was not clear to me in your article. There are few changes made to test-time, as the model is simply allowed to keep outputting text and decide when to terminate, which 4o does not do.