Thanks for the overview!
So, speaking specifically about IMO Gold, OpenAI has not released a configuration capable of achieving IMO Gold yet, and it seems that Gemini configuration capable of achieving IMO Gold is still only available to a select group of testers (including some mathematicians) [[1]] .
So, unless I am mistaken, on the "informal track", DeepSeek is not just the first IMO Gold capable system available as open weights, but the first IMO Gold capable system publicly available at all.
On the formal, Lean-oriented "track", it might be that the publicly available version of Aristotle from Harmonic is good enough now (when its experimental version was made initially available in the Summer, it did not seem very strong, but it should be much better now, "Aristotle: IMO-level Automated Theorem Proving", https://arxiv.org/abs/2510.01346).
https://blog.google/products/gemini/gemini-3-deep-think/ which came out yesterday says: "Gemini 3 Deep Think is industry leading on rigorous benchmarks like Humanity’s Last Exam (41.0% without the use of tools) and ARC-AGI-2 (an unprecedented 45.1% with code execution). This is because it uses advanced parallel reasoning to explore multiple hypotheses simultaneously — building on Gemini 2.5 Deep Think variants that recently achieved a gold-medal standard at the International Mathematical Olympiad and at the International Collegiate Programming Contest World Finals." It's a bit ambiguous, they say they are using the same technique, but it's not clear if this publicly available configuration can achieve results which are this high. ↩︎
Unfortunately, Gemini 3 Pro without the Deep Think option managed to one-shot the problems 1,3,4,5 of the IMO 2025. I doubt that if we prompt the system to solve the problems one by one, then we won't obtain the result of solving all but the Problem 6.
EDIT: fortunately, solving all the problems was a failure. Unfortunately, prompting the model to solve them one by one saw a great success of solving the problems 1, 3 and 5, failing the problem 2, encountered an error doing Problem 4.
That’s good (assuming no contamination, of course (I don’t expect it to break instructions not to search, but it could have seen them at some of the training phases)).
But this will be possible to double-check in the future with novel problems.
(I assume someone checked the correctness of these versions of solutions; this is just a conversation, but someone needs to assert checking the details.)
DeepSeek v3.2 is DeepSeek’s latest open model release with strong bencharks. Its paper contains some technical innovations that drive down cost.
It’s a good model by the standards of open models, and very good if you care a lot about price and openness, and if you care less about speed or whether the model is Chinese. It is strongest in mathematics.
What it does not appear to be is frontier. It is definitely not having a moment. In practice all signs are that it underperforms its benchmarks.
When I asked for practical experiences and reactions, I got almost no responses.
A Brief History of DeepSeek
DeepSeek is a cracked Chinese AI lab that has produced some very good open models, done some excellent research, and given us strong innovations in terms of training techniques and especially training efficiency.
They also, back at the start of the year, scared the hell out of pretty much everyone.
A few months after OpenAI released o1, and shortly after DeepSeek released the impressive v3 that was misleadingly known as the ‘six million dollar model,’ DeepSeek came out with a slick app and with r1, a strong open reasoning model based on v3 that showed its chain of thought. With reasoning models not yet scaled up, it was the perfect time for a fast follow, and DeepSeek executed that very well.
Due to a strong viral marketing campaign and confluence of events, including that DeepSeek’s app shot to #1 on the app store, and conflating the six million in cost to train v3 with OpenAI’s entire budget of billions, and contrasting r1’s strengths with o1’s weaknesses, events briefly (and wrongly) convinced a lot of people that China or DeepSeek had ‘caught up’ or was close behind American labs, as opposed to being many months behind.
There was even talk that American AI labs or all closed models were ‘doomed’ and so on. Tech stocks were down a lot and people attributed that to DeepSeek, in ways that reflected a stock market highly lacking in situational awareness and responding irrationally, even if other factors were also driving a lot of the move.
Politicians claimed this meant we had to ‘race’ or else we would ‘lose to China,’ thus all other considerations must be sacrificed, and to this day the idea of a phantom DeepSeek-Huawei ‘tech stack’ is used to scare us.
This is collectively known as The DeepSeek Moment.
Slowly, in hindsight, the confluence of factors that caused this moment became clear. DeepSeek had always been behind by many months, likely about eight. Which was a lot shorter than previous estimates, but a lot more than people were saying.
Later releases bore this out. DeepSeek’s r1-0528 and v3.1 did not ‘have a moment,’ ad neither did v3.2-exp or now v3.2. The releases disappointed.
DeepSeek remains a national champion and source of pride in China, and is a cracked research lab that innovates for real. Its models are indeed being pushed by the PRC, especially in the global south.
For my coverage of this, see:
Once More, With Feeling
I’d just been through a few weeks in which we got GPT-5.1, Grok 4.1, Gemini 3 Pro, GPT-5.1-Codex-Max and then finally Claude Opus 4.5. Mistral, listed above, doesn’t count. Which means we’re done and can have a nice holiday season, asks Padme?
No, Anakin said. There is another.
Reading The Paper
Teortaxes threatened to bully me if I did not read the v3.2 paper. I did read it. The main innovation appears to be a new attention mechanism, which improves training efficiency and also greatly reduces compute cost to scaling the context window, resulting in v3.2 being relatively cheap without being relatively fast. Unfortunately I lack the expertise to appreciate the interesting technical aspects. Should I try and fix this in general? My gut says no.
What the paper did not include was any form of safety testing or information of any kind for this irreversible open release. There was not, that I could see, even a sentence that said ‘we did safety testing and are confident in this release’ or even one that said ‘we do not see any need to do any safety testing.’ It’s purely and silently ignored.
Frankly, this is deeply irresponsible and completely unacceptable.
DeepSeek did by some accounts become somewhat censorious back in May, but that doesn’t seem to apply to, as George puts it, plans for <bad_device>.
DeepSeek claims to be ‘pushing the boundaries of reasoning capabilities’ and to be giving a GPT-5 level of performance. Their benchmarks match this story.
And they can’t even give us an explanation of why they don’t believe they owe us any sort of explanation? Not even a single sentence?
I knew DeepSeek was an irresponsible lab. I didn’t know they were this irresponsible.
Open Language Model Offers Mundane Utility
The short version of my overall take seems to be that DeepSeek v3.2 is excellent for its price point, and its best area is mathematics, but while it is cheap it is reported to be remarkably slow, and for most practical purposes it is not frontier.
Which means you only would use it either if you are doing relatively advanced math, or if all four of the following are true:
The only strong praise I found in practice was this exchange from perennial whale (DeepSeek) advocate Teortaxes, Vinicius and John Pressman:
Notice the background agreement that the ‘model to beat’ for most purposes is Opus 4.5, not Gemini 3 or GPT-5.1. I strongly agree with this, although Gemini 3 still impresses on ‘just the facts’ or ‘raw G’ tasks.
Some people really want a combative, abrasive sparring partner that will err on the side of skepticism and minimize false positives. Teortaxes and Pressman definitely fit that bill. That’s not what most people want. You can get Opus to behave a lot more in that direction if you really want that, but not easily get it to go all the way.
Is v3.2 a good model that has its uses? My guess is that it is. But if it was an exciting model in general, we would have heard a lot more.
Those Benchmarks
They are very good benchmarks, and a few independent benchmarks also gave v3.2 high scores, but what’s the right bench to be maxing?
This confuses different maxings of things but I love the idea of ‘usemaxxed.’
That’s a highly fair thing to say about the big three, that they’ve given a lot of focus to making them actually useful in practice for common use cases. So one could argue that by skipping all that you could get a model that was fundamentally as smart or frontier as the big three, it just would take more work to get it to do the most common use cases. It’s plausible.
Those are presumably non-targeted benchmark that give sensible ratings elsewhere, as is this one from NomoreID on a Korean test, so it confirms that the ‘good on benchmarks’ thing is probably generally real especially on math.
Open Language Model Doesn’t Offer Mundane Utility
In practice, it seems less useful, whether or not that is because less usemaxxed.
I want my models to be usemaxxed, because the whole point is to use them.
Also our standards are very high.
The counterargument is that if you are ‘gpt 5ish’ then the core capabilities pre-usemaxxing are perhaps only a few months behind now? Which is very different from being overall only a few months behind in a practical way, or in a way that would let one lead.
The Pliny jailbreak is here, if you’re curious.
Gallabytes was unimpressed, as were those responding if your standard is the frontier. There were reports of it failing various gotcha questions and no reports of it passing.
Open Language Model Does Do The Math
In other DeepSeek news, DeepSeekMath-v2 used a prover-verifier loop that calls out the model’s own mistakes for training purposes, the same way you’d do it if you were learning real math.
So far, the reported actual breakthroughs have all been from American closed source frontier models. Let’s see if that changes.
I am down with the recent direction of DeepSeek releases towards specialized worthwhile math topics. That seems great. I do not want them trying to cook an overall frontier model, especially given their deep level of irresponsibility.
I’ll Get You Next Time, Gadget
Making things cheaper can still be highly valuable, even with other issues. By all accounts this model has real things to offer, the first noteworthy DeepSeek offering since r1. What it is not, regardless of their claims, is a frontier model.
This is unsurprising. You don’t go from v3.2-exp to v3.2 in your naming schema while suddenly jumping to the frontier. You don’t actually go on the frontier, I would hope, with a fully open release, while saying actual zero words about safety concerns.
DeepSeek are still doing interesting and innovative things, and this buys some amount of clock in terms of keeping them on the map.
As DeepSeek says in their v3.2 paper, open models have since r1 been steadily falling further behind closed models rather than catching up. v3.2 appears to close some of that additional gap.
The question is, will they be cooking a worthy v4 any time soon?
The clock is ticking.