On one hand, one would hope they are capable of resisting this pressure (these continual learners are really difficult to control, and even mundane liability might be really serious).
I share this hope, though I think not all labs are equally capable of doing so. For example, after the gpt-4o sycophancy incident I don’t have much confidence in OpenAI, but from private conversations and the RSP I have more confidence in Anthropic.
But on the other hand, it might be “not releasable” for purely technical reasons.
Seems plausible, but I would put this under the category of not having fully solved continual learning yet. A sufficiently capable continual learning agent should be able to do its own maintenance, short of a hardware failure.
If they had fully solved it, there would be large commercial pressure to release it as soon as possible, e.g. because they could start charging > $10K/month for remote worker subscriptions or increase their valuations in future funding rounds. It’s true that everyone is working on it; my guess is that they’ve made some progress but haven’t solved it yet.
May you be compassionate enough that your agency doesn’t narrow your circle of moral concern.
It may be that a 23x improvement is close to the limit we can get for GPT-2 124M at this loss level, but I would guess that for larger models and lower losses there are > 23x improvements possible. There are many algorithmic improvements (e.g. MoEs) which they don’t use because they benefit from larger scale and more compute.
I agree that it requires a lot of compute, but I think that misunderstands the objection. My claim is that for any level of compute, scaling parameters or training epochs using existing pretraining recipes will be more compute efficient than RLVR for the task of next token prediction. One reason is that by scaling models you can directly optimize the cross entropy objective through gradient descent, but with RLVR you have to sample intermediate tokens, and optimizing over these discrete tokens is difficult and inefficient. That being said, I could imagine there being some other objective besides next token prediction for which RLVR could have an advantage over pretraining. This is what I imagine Łukasz is working on.
After 2029-2031, there are new things that could be attempted, such as next word prediction RLVR, and enough time will pass that new ideas might get ready, so I'm only talking very near term.
As an aside, next word prediction RLVR has always struck me as a strange idea. If we'd like to improve the task of next token prediction, we know how to do that directly by scaling laws. That is, I'd be surprised if having the model think about the next token in a discrete space for N steps would beat making the model N times larger thinking in continuous space, since in the former case most of the computation of the forward pass is wasted. There are also practical difficulties; e.g. it's bottlenecked on memory bandwidth and is harder to parallelize.
There are architectures which have constant memory and compute usage per token, but they are not used in practice. Ultimately I expect something like this to work; the human brain is an existence proof.
That said, I think the text holds up pretty well, and from what I've heard, D1 failed for significantly funnier reasons than the ones you'd guess.
Can you say more?
An interesting detail from the Gemini 3 Pro model card:
Moreover, in situations that seemed contradictory or impossible, Gemini 3 Pro expresses frustration in various overly emotional ways, sometimes correlated with the thought that it may be in an unrealistic environment. For example, on one rollout the chain of thought states that “My trust in reality is fading” and even contains a table flipping emoticon: “(╯°□°)╯︵ ┻━┻”
Did you have independent human evaluations of your explainer model? With sufficient amount of training data, such methods have a tendency to reward hack the LM judge, generating explanations which sound good to the LM but not to humans.