The newly commercial[1] lmarena still has not posted the scores for the new R1.
One starts to wonder if they are deliberately throttling the rates at which it is sampled for their 1-to-1 competitions.
(With all the problems with lmarena[2], it would be not a bad way to compare it, first of all, versus the old R1 and the new V3.)
See e.g. "The Leaderboard Illusion", https://arxiv.org/abs/2504.20879 ↩︎
Today we have finally got the lmarena results for the new R1, they are quite impressive overall and in coding, less so in math.
When r1 was released in January 2025, there was a DeepSeek moment.
When r1-0528 was released in May 2025, there was no moment. Very little talk.
Here is a download link for DeepSeek-R1-0528-GGUF.
It seems like a solid upgrade. If anything, I wonder if we are underreacting, and this illustrates how hard it is getting to evaluate which models are actually good.
What this is not is the proper r2, nor do we have v4. I continue to think that will be a telltale moment.
For now, what we have seems to be (but we’re not sure) a model that is solid for its price and status as an open model, but definitely not at the frontier, that you’d use if and only if you wanted to do something that was a very good fit and played to its strong suits. We likely shouldn’t update much either way on v4 and r2, and DeepSeek has a few more months before it starts being conspicuous that we haven’t seen them.
We Had a Moment
We all remember The DeepSeek Moment, which led to Panic at the App Store, lots of stock market turmoil that made remarkably little fundamental sense and that has been born out as rather silly, a very intense week and a conclusion to not panic after all.
Over several months, a clear picture emerged of (most of) what happened: A confluence of narrative factors transformed DeepSeek’s r1 from an impressive but not terribly surprising model worth updating on into a shot heard round the world, despite the lack of direct ‘fanfare.’
In particular, these all worked together to cause this effect:
The R2 Moment Will Matter
I continue to believe that when R2 arrives (or fails to arrive for a long time), this will tell us a lot either way, whereas the R1-0528 we got is not a big update. If R1-0528 had been a fully top model and created another moment, that would of course huge, but all results short of that are pretty similar.
I stand by what I said in AI #118 on this:
Is it possible this was supposed to be R2, but they changed the name due to it being insufficient impressive? Like everyone but Chubby here I strongly think no.
I will however note that DeepSeek has a reputation for being virtuous straight shooters that is based on not that many data points, and one key point of that was their claim to have not done distillation, a claim that now seems questionable.
On Your Marks
The state of benchmarking seems rather dismal.
This could be the strongest argument that the previous DeepSeek market reaction was massively overblown (or even a wrong-way move). If DeepSeek giving us a good model is so important to the net present value of our future cash flows, how is no one even bothering to properly benchmark r1-0528?
And how come, when DeepSeek released their next model, Nvidia was up +4%? It wasn’t an especially impressive model, but I have to presume it was a positive update versus getting nothing, unless the market is saying that this proves they likely don’t have it. In which case, I think that’s at least premature.
Evals aren’t expensive by the standards of hedge funds, indeed one of the hedge funds (HighFlyer) is how DeepSeek got created and funded.
I notice that on GPQA Diamond, DeepSeek claims 81% and Epoch gives them 76%.
I am inclined to believe Epoch on that, and of course DeepSeek gets to pick which benchmarks to display whether or not they’re testing under fair conditions.
DeepSeek clearly have in various ways been trying to send the impression that R1-0528 is on par with o3, Gemini-2.5-Pro and Claude 4 Opus.
That is incompatible with the lack of excitement and reaction. If an open weights model at this price point was actually at the frontier, people would be screaming. You wouldn’t be able to find a quiet rooftop.
Math level 5 is fully saturated as of o3 so this should be the last time we use it.
Here are the Lech Mazur benchmarks, where the scores are a mixed bag but overall pretty good.
There is no improvement on WeirdML.
In The News
The initial headlines were what you would expect, and were essentially ‘remember that big DeepSeek moment? Those guys gave us a new version.’
Here’s CNBC.
Yakefu is effectively talking their own book here. I don’t see why we should interpret this as catching up, everyone is reducing hallcinations and costs, but certainly DeepSeek are competing. How successfully they are doing so, and in what league is the question.
One can perhaps now see how wrong we were to overreact so much to the first r1. Yes, r1-0528 is DeepSeek ‘catching up’ or ‘closing in’ in the sense that DeepSeek’s relative position looks now, right after a release, better than it looked on May 27. But it does not look better than when I wrote ‘on DeepSeek’s r1’ in January and LiveCodeBench appears at best cherry picked.
The article concludes with Nvidia CEO Huang making his typical case that because China can still make some AI models and build some chips, we should sacrifice our advantages in compute on the altar of Nvidia’s stock price and market share.
Here’s Bloomberg’s Luz Ding, who notes up front that the company calls it a ‘minor trial upgrade,’ so +1 to Ding, but there isn’t much additional information here.
A search of the Washington Post and Wall Street Journal failed to find any articles at all covering this event. If r1 was such a big moment, why is this update not news? Even if it was terribly disappointing, shouldn’t that also be news?
Other Reactions
Normally, in addition to evals, I expect to see a ton of people’s reactions, and more when I open up my reactions thread.
This time, crickets. So I get to include a lot of what did show up.
Teortaxes highlights where the original R1 paper says they plan to fix these limitations. This seems like a reasonable way to think about it, R1-0528 is the version of R1 that isn’t being rushed out the door in a sprint with large compute limitations.
This was the high end of opinion, as xjdr called it a frontier model, which most people clearly don’t agree with at all, and kalomaze calls it ‘excellence’ but this is relative to its size:
However:
Different people had takes about the new style and what it reminded them of.
The Distillation Accusation
A key question that was debated about the original r1 was whether it was largely doing distillation, as in training on the outputs of other models, effectively reverse engineering. This is the most common way to fast follow and a Chinese specialty. DeepSeek explicitly denied it was doing this, but we can’t rely on that.
If they did it, this doesn’t make r1’s capabilities any less impressive, the model can do what the model can do. But it does mean that DeepSeek is effectively a lot farther behind and further away from being able to ‘take the lead.’ So DeepSeek might be releasing models comparable to what was available 4-8 months ago, but still be 12+ months behind in terms of ability to push the frontier. Both measures matter.
Then again, claims can differ:
I agree with Fraser, it’s not a bad move to be doing this if allowed to do so, but it would lower our estimate of how capable DeepSeek is going forwards.
If Teortaxes is saying the claim of distillation is sound, I am inclined to believe that, especially given there is no good reason not to do it. This is also consistent with his other observations above, such as it displaying a more ‘westoid’ flavor and having a sycophancy issue, and a different style of CoT.
It’s Quietly Probably a Solid Model, Sir
If you place high value in the cost and open nature of r1-0528, it is probably a solid model for the places where it is strong, although I haven’t kept up with details of the open model space enough to be sure, especially given so little attention this got. If you don’t place high value on both open weights and token costs, it is probably a pass.
The biggest news here is the lack of news, the dog that did not bark. A new DeepSeek release that panics everyone once again was a ready made headline. I know I was ready for it. It didn’t happen. If this had been an excellent model, it would have happened. This also should make us reconsider our reactions the first time around.