68 New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters

27th Nov 2024

1 min read

68

Review

This is a linkpost for https://qwenlm.github.io/blog/qwq-32b-preview/

A new o1-like model based on Qwen-2.5-32B reportedly beats Claude 3.5 Sonnet^[1] on a bunch of difficult reasoning benchmarks. A new regime dawns.

The blog post reveals nothing but the most inane slop ever sampled:

What does it mean to think, to question, to understand? These are the deep waters that QwQ (Qwen with Questions) wades into. Like an eternal student of wisdom, it approaches every problem - be it mathematics, code, or knowledge of our world - with genuine wonder and doubt. QwQ embodies that ancient philosophical spirit: it knows that it knows nothing, and that’s precisely what drives its curiosity. Before settling on any answer, it turns inward, questioning its own assumptions, exploring different paths of thought, always seeking deeper truth. Yet, like all seekers of wisdom, QwQ has its limitations. This version is but an early step on a longer journey - a student still learning to walk the path of reasoning. Its thoughts sometimes wander, its answers aren’t always complete, and its wisdom is still growing. But isn’t that the beauty of true learning? To be both capable and humble, knowledgeable yet always questioning? We invite you to explore alongside QwQ, embracing both its insights and its imperfections as part of the endless quest for understanding.

The model is available on HuggingFace. It's not yet clear when we'll hear more about the training details. EDIT: We can expect an official announcement tomorrow.

^{^}
Not clear if this is Sonnet 3.5 new or old. For this, I blame Anthropic.

Review

Personal Blog

68

Mentioned in

172o1: A Technical Primer

New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters

New Comment

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:48 AM

[-]Vladimir_Nesov1y163

The first reasoning trace in the QwQ blog post seems impressive in how it manages to eventually stumble on the correct answer despite the 32B model clearly having no clue throughout, so it manages to effectively explore while almost blind. If this is sufficient to get o1-preview level results on reasoning benchmarks, it's plausible that RL in such post-training is mostly unhobbling the base models rather than making them smarter.

So some of these recipes might fail to have any way of scaling far, in the same sense that preference tuning doesn't scale far (unlike AlphaZero). QwQ post doesn't include a scaling plot, and the scaling plot in the DeepSeek-R1 post doesn't show improvement with further training, only with thinking for more tokens. The o1 post shows improvement with more training, but it might plateau in the uninteresting way instruction/preference post-training plateaus, making the model reliably do the thing its base model is already capable of in some sense. The similarity between o1, R1, and QwQ is superficial enough that the potential to scale with more post-training might be present in some of them and not others, or in none of them.