One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn't mean intermediate tokens can't help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
The results seem very interesting, but I'm not sure how to interpret them. Comparing the generations videos from this and Mercury, the starting text from each seems very different in terms of resembling the final output:
Unless I'm missing something really obvious about these videos or how diffusion models are trained, I would guess that DeepMind fine-tuned their models on a lot of high-quality synthetic data, enough that their initial generations already match the approximate structure of a model response with CoT. This would partially explain why they seem so impressive even at such a small scale, but would make the scaling laws less comparable to autoregressive models because of how much high-quality synthetic data can help.
FYI, I couldn't click into this from the front page, nor could I see anything on it on the front page. I had to go to the permalink (and I assumed at first this was a joke post with no content) to see it.
I don't think we disagree on many of the major points in your comment. But your original claim was:
if moving a pass@k single-step capability to pass@1 is all RL does, even improvements on multi-step tasks still hit a ceiling soon, even if that ceiling is exponentially higher than the ceiling of single-step performance improvement. And it's not clear that this potential exponential improvement actually unlocks any transformative/superhuman capabilities.
The claims in the paper are agnostic to the distinction between the neural network and the algorithm learned by the neural network. It simply claims that RL makes models perform worse on pass@k for sufficiently k—a claim that could follow from the base models having a more diverse distribution to sample from.
More specifically, the paper doesn't make a mechanistic claim about whether this arises from RL only eliciting latent computation representable in the internal language of the learned algorithm, or from RL imparting capabilities that go beyond the primary learned algorithm. Outcome-based RL makes the model sample possible trajectories, and cognition outputting trajectories that are rewarded are up-weighted. This is then folded into future trajectory sampling, and future up-weighted cognition may compound upon it to up-weight increasingly unlikely trajectories. This implies that as the process goes on, you may stray from what the learned algorithm was likely to represent, toward what was possible for the base model to output at all.
I agree that if all RL ever did was elicit capabilities already known by the learned algorithm, I agree that would top out at pretty unremarkable capabilities (from a strong superintelligence perspective - I disagree that the full distribution of base model capabilities aren't impressive). But that's very different from the claim that if all RL ever did was move a pass@k capability to pass@1, it implies the same outcome.
I don't think this follows - at the limit, any feasible trajectory can be sampled from a model with a broad distribution. Whether a model "knows" something is a pretty fuzzy question. There's a sense in which all text can be sampled by a model at high temperature, given enough samples. It's a trivial sense, except it means that moving pass@k to pass@1 for extremely high k is very non-trivial.
As an example, I took asked o4-mini the following prompt (from the OpenAI docs): "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format.", and fed its output into gpt-4-base (the only model I could reliably get logprobs for the input from). The average per-token logprob of o4-mini's output was -0.8998, and the average per-token logprob of gpt-4-base's top logprob continuation after each token was -0.3996. For reference, I prompted gpt-4-base with the question alone at temperature 1, and the average per-token logprob of its output was -1.8712, and the average per-token logprob of gpt-4-base's top logprob continuation after each token was -0.7218.
It seems pretty hard to dispute that o4-mini has significant improvements over a GPT-4 level model. This isn't at all at odds with the hypothesis that sampling a base model for long enough will get you arbitrarily performant outputs.
o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at scaling up in a Pass@N way (o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute).
o3 may also have a better base model. o3 could be worse at pass@n for high n relative to its base model than o1 is relative to its base model, while still being better than o1.
I don't think you need very novel RL algorithms for this either - in the paper, Reinforce++ still does better for pass@256 in all cases. For very high k, pass@k being higher for the base model may just imply that the base model has a broader distribution to sample from, while at lower k the RL'd models benefit from higher reliability. This would imply that it's not a question of how to do RL such that the RL model is always better at any k, but how to trade off reliability for a more diverse distribution (and push the Pareto frontier ahead).
- Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it might be near pass@1024. (Figure 7)
When using Reinforce++, the RL'd models perform better on pass@256 tasks as well:
That said, the curves look like the base model is catching up, and there may be a crossover point of pass@256 or the like for the in-domain tasks as well.
This could be attributable to the base models being less mode-collapsed and being able to sample more diversely. I don't think that would have predicted this result in advance however, and it likely depends on the specifics of the RL setup.
I think we're nearing - or at - the point where it'll be hard to get general consensus on this. I think that Anthropic's models being more prone to alignment fake makes them "more aligned" than other models (and in particular, that it vindicates Claude 3 Opus as the most aligned model), but others may disagree. I can think of ways you could measure this if you conditioned on thinking alignment faking (and other such behaviours) was good, and ways you could measure if you conditioned on the opposite, but few really interesting and easy ways to measure in a way that's agnostic.
I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.
Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn't have any data about how an agent should relate to potentially dangerous actions, I expect it'd be much harder to get post-training to make the kind of agent that reliably takes safer actions.