+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
I suppose it's a possibility, albeit a remote one.
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We're at 75%. (and SOTA was ~64% when they released ai 2027)
Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it "beating" by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn't SOTA).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won't be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR's evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I'm personally not this bearish (I'd guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.
Coding agentic abilities are different from general chatbot abilities. Gemini is IMO the best chatbot there is (just in terms of understanding context well if you wish to analyze text/learn things/etc.). Claude on the other hand is dead last among the big 3 (a steep change from a year ago) and my guess is Anthropic isn't trying much anymore (focusing on.. agentic coding instead)
I don't see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR.
I don't think you can just start at the HCAST timeline for software engineering and map it to IMO problems.
Alternative bearish prediction would be deepthink got 50% on May 20 (not released, lab frontier) on USAMO. 80% is ~4x the task time of 50% ones (at least for software engineering -- not sure what it is for math), so we needed two doublings (6 months) to pull this off and instead only have ~0.67.
To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the "unexpected" part being the problem being so easy. It's also the first time in 20 years (5% chance) that 5 problems were of difficulty <= 25.
Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I'm somewhat less impressed. Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinatorics (P1).
The two most impressive things are, factoring this write up by Ralph Furman:
* OpenAI's LLM was able to solve a medium level geometry problem. (guessing Deepmind just used alpha geometry again) - Furman thought this would be hard for informal methods.
* OpenAI's LLM is strong enough to get the easy combinatorics problem (Furman noted informal methods would likely outperform formal ones on this one -- just a matter if the LLM were smart enough)
swe-bench pass @ 1 on Claude sonnet versions has been 33.4% (June - 3.5) -> 49.0% (October) -> 62.3% (Feb - 3.7) -> 72.7% (May -> 4). That's practically linear at 3.5% gain/month. That would extrapolate to end of August at 83%.
With the leaderboard at ~75.2% on July 1, such an extrapolation also gets us to around 82%.
I don't believe there's a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).