It may be that a 23x improvement is close to the limit we can get for GPT-2 124M at this loss level, but I would guess that for larger models and lower losses there are > 23x improvements possible. There are many algorithmic improvements (e.g. MoEs) which they don’t use because they benefit from larger scale and more compute.
I agree that it requires a lot of compute, but I think that misunderstands the objection. My claim is that for any level of compute, scaling parameters or training epochs using existing pretraining recipes will be more compute efficient than RLVR for the task of next token prediction. One reason is that by scaling models you can directly optimize the cross entropy objective through gradient descent, but with RLVR you have to sample intermediate tokens, and optimizing over these discrete tokens is difficult and inefficient. That being said, I could imagine there being some other objective besides next token prediction for which RLVR could have an advantage over pretraining. This is what I imagine Łukasz is working on.
After 2029-2031, there are new things that could be attempted, such as next word prediction RLVR, and enough time will pass that new ideas might get ready, so I'm only talking very near term.
As an aside, next word prediction RLVR has always struck me as a strange idea. If we'd like to improve the task of next token prediction, we know how to do that directly by scaling laws. That is, I'd be surprised if having the model think about the next token in a discrete space for N steps would beat making the model N times larger thinking in continuous space, since in the former case most of the computation of the forward pass is wasted. There are also practical difficulties; e.g. it's bottlenecked on memory bandwidth and is harder to parallelize.
There are architectures which have constant memory and compute usage per token, but they are not used in practice. Ultimately I expect something like this to work; the human brain is an existence proof.
That said, I think the text holds up pretty well, and from what I've heard, D1 failed for significantly funnier reasons than the ones you'd guess.
Can you say more?
An interesting detail from the Gemini 3 Pro model card:
Moreover, in situations that seemed contradictory or impossible, Gemini 3 Pro expresses frustration in various overly emotional ways, sometimes correlated with the thought that it may be in an unrealistic environment. For example, on one rollout the chain of thought states that “My trust in reality is fading” and even contains a table flipping emoticon: “(╯°□°)╯︵ ┻━┻”
Out of curiosity, what was your top choice?
Setting aside AI, what do we do about it?
If we do consider AI, how does the analysis change? My guess is that either we'll have enough abundance that these questions of cost and living standard aren't relevant or we'll have other more important problems to worry about.
Suppose you are correct and that OpenPhil did indeed believe in long timelines pre-ChatGPT. Does this reflect badly on them? It seems like a reasonable prior to me, and many senior researchers even within OA were uncertain that the their methods would scale to more powerful systems.
May you be compassionate enough that your agency doesn’t narrow your circle of moral concern.