The first reasoning trace in the QwQ blog post seems impressive in how it manages to eventually stumble on the correct answer despite the 32B model clearly having no clue throughout, so it manages to effectively explore while almost blind. If this is sufficient to get o1-preview level results on reasoning benchmarks, it's plausible that RL in such post-training is mostly unhobbling the base models rather than making them smarter.
So some of these recipes might fail to have any way of scaling far, in the same sense that preference tuning doesn't scale far (unlike AlphaZero). QwQ post doesn't include a scaling plot, and the scaling plot in the DeepSeek-R1 post doesn't show improvement with further training, only with thinking for more tokens. The o1 post shows improvement with more training, but it might plateau in the uninteresting way instruction/preference post-training plateaus, making the model reliably do the thing its base model is already capable of in some sense. The similarity between o1, R1, and QwQ is superficial enough that the potential to scale with more post-training might be present in some of them and not others, or in none of them.
Relevant quote:
"We have discovered how to make matter think.
...
There is no way to pass “a law,” or a set of laws, to control an industrial revolution."
My suspicion: https://arxiv.org/html/2411.16489v1 taken and implemented on the small coding model.
Is it any mystery which of the DPO, PPO, RLHF, Fine tuning was likely the method for the advanced distillation there?
A new o1-like model based on Qwen-2.5-32B reportedly beats Claude 3.5 Sonnet[1] on a bunch of difficult reasoning benchmarks. A new regime dawns.
The blog post reveals nothing but the most inane slop ever sampled:
The model is available on HuggingFace. It's not yet clear when we'll hear more about the training details. EDIT: We can expect an official announcement tomorrow.
Not clear if this is Sonnet 3.5 new or old. For this, I blame Anthropic.