I agree with you that "Opus 4.5 can do anything" is overselling it and there is too much hype around acting like these things are fully autonomous software architects. I did want to note though that Opus 4.5 is a vast improvement and praise is warranted.
My guess is that "convert this already-written code from this representation/framework/language/factorization to this other one" may be one of the things LLMs are decent at, yep!
Agreed, I'm relying on their "localized" intelligence to get work done fast. Where Anthropic has improved their models...
None of that worked, I detect basically no change since August.
What sort of codebase are you working on? I work in a 1 million line typescript codebase and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).
I wouldn't say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.
The main game changer is that Opus has simply become sma...
Yes, both model families are similar in that they do not have consistently declining accuracy in the 2-16 hour task window. The modeling is somewhat broken when you have higher accuracy in the 8-16 hour window than the 2-4 hour window.
GPT models do not have this characteristic; while not perfect with the curve, at least accuracy roughly drops monotonically with task length. (exception o4-mini which also had bizarre patterns in that 2-16 hour window).
I suspect at some level heavy RLVF has broken the core METR model of performance correlating to task length.
I personally think the stronger argument here is that Claude models are not growing in capability consistent with higher task length = harder. (Grok 4 was similar) if you look at the histograms.
Both Sonnet 4.5 and Opus 4.5 were outperforming in the 8 to 16 hour bracket over the 2 to 4 hour, which is highly inconsistent with the task length difficulty model. The model appears broken at last since 3.5 sonnet given the flatness of the 2-16 hour tasks.
You end up in a case where the 4.5 Sonnet curve has a higher % of the solved tasks under it than 4...
Thanks for the histograms. Is the raw data available somewhere?
Just eyeballing it:
Aligns to my sense the model is a month, maybe 2 months, ahead of what is expected and a lot of this jump (4.5 months ahead of expected) is from artifacts of the curve fitting
Private workspace so I can’t share the session. But the approach is simple and doesn’t really require it to understand.
I think we’re coming at this from different angles: you’re doing a “white-box” critique (how specific task outcomes / curve fitting affect the METR horizon), whereas I’m doing a “black-box” consistency check: is the claimed p50 result consistent with what we see on other benchmarks that should correlate with capability?
The core model is:
Bayesians are updating too much on AI capability speed from this data point, given:
I modeled all this in GPT-5.2 and the more realistic estimate for 50% derived from the other benchmarks is in the range of 190 to 210 minutes, depending on how much weight you put o...
Good response. A few things I do want to stress:
. I am just not sure I believe 25%-33% behind is significant.
I personally see the lower bound as 33% slower. That's enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates - this is quite a prediction gap.
and knowledgeable human performance on the benchmark remains around 70%.
Is this true? They haven't u...
Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.
I disagree this is close for several reasons.
I don't believe there's a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
I suppose it's a possibility, albeit a remote one.
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We're at 75%. (and SOTA was ~64% when they released ai 2027)
Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it "beating" by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn't SOTA).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won't be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR's evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-ben...
Coding agentic abilities are different from general chatbot abilities. Gemini is IMO the best chatbot there is (just in terms of understanding context well if you wish to analyze text/learn things/etc.). Claude on the other hand is dead last among the big 3 (a steep change from a year ago) and my guess is Anthropic isn't trying much anymore (focusing on.. agentic coding instead)
I don't see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR.
I don't think you can just start at the HCAST timeline for software engineering and map it to IMO problems.
Alternative bearish prediction would be deepthink got 50% on May 20 (not released, lab frontier) on USAMO. 80% is ~4x the task time of 50% ones (at least for software engineering -- not sure what it is for math), so we needed two doublings (6 months) to pull this off and instead only have ~0.67.
To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the "unexpected" part being the problem being so easy. It's also the first time in 20 years (5% chance) that 5 problems were of difficulty <= 25.
Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I'm somewhat less impressed. Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinato...
swe-bench pass @ 1 on Claude sonnet versions has been 33.4% (June - 3.5) -> 49.0% (October) -> 62.3% (Feb - 3.7) -> 72.7% (May -> 4). That's practically linear at 3.5% gain/month. That would extrapolate to end of August at 83%.
With the leaderboard at ~75.2% on July 1, such an extrapolation also gets us to around 82%.
If I understand correctly, you are advocating for using a call only strategy (as opposed to a (synthetic) long strategy) to achieve higher leverage than would otherwise be possible?
> This is partly for speculation, but it seems reasonable for most people with 2 years of savings to have 10% of their net worth in SPY options or 20% in SPX options [4] for hedging purposes alone.
To clarify, you mean 10% of net worth being in this specific contract (SPY280616C01000000)? So roughly 15:1 leverage using options?
Readers should note... (read more)