These events are rare, but not unheard of. Zoom was doubling quarterly in 2021 for a short while at over a $1B run-rate. Moderna 2.5x'd in one quarter from a $7B runrate in 2021. (Both these cases show how fast revenue growth rates can collapse, albeit for different reasons - but note the common case of a shock driving revenue rapidly up).
FWIW, Nvidia continues to double yearly after hitting a $100B runrate.
We should expect a step change in model size and pretraining compute to be off trend.
The ratio of the 80% to 50% time horizon in your modeling is low at only 3; traditionally, it has been 5-6. 3 is in fact the lower bound of what should be plausible, representing a world where all subtasks of a given tasks have independent odds of success. (normally we'd expect some success correlation between subtasks).
That said, I don't think swe-bench-verified is useful to infer metr data for several reasons:
If I ha...
Comparing to your post GPT-5 update, it reads like you have shorter timelines than you did at the start of 2025? (This contrasts to the AI Futures model authors whose timelines are now in between their initial and q4 numbers).
What doubling rate and deployment time to AI automation after 1 month 80% reliability are you now assuming? Naively, if I use a 80% reliability 125 day doubling time (which is the current trendline to opus 4.6 using logistic fixed slope), that would get us to 1-month 80% in Feb 2029. That's only about 6 months sooner than your...
If I understand correctly, you are advocating for using a call only strategy (as opposed to a (synthetic) long strategy) to achieve higher leverage than would otherwise be possible?
> This is partly for speculation, but it seems reasonable for most people with 2 years of savings to have 10% of their net worth in SPY options or 20% in SPX options [4] for hedging purposes alone.
To clarify, you mean 10% of net worth being in this specific contract (SPY280616C01000000)? So roughly 15:1 leverage using options?
Readers should note...
I agree with you that "Opus 4.5 can do anything" is overselling it and there is too much hype around acting like these things are fully autonomous software architects. I did want to note though that Opus 4.5 is a vast improvement and praise is warranted.
My guess is that "convert this already-written code from this representation/framework/language/factorization to this other one" may be one of the things LLMs are decent at, yep!
Agreed, I'm relying on their "localized" intelligence to get work done fast. Where Anthropic has improved their models...
None of that worked, I detect basically no change since August.
What sort of codebase are you working on? I work in a 1 million line typescript codebase and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).
I wouldn't say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.
The main game changer is that Opus has simply become sma...
Yes, both model families are similar in that they do not have consistently declining accuracy in the 2-16 hour task window. The modeling is somewhat broken when you have higher accuracy in the 8-16 hour window than the 2-4 hour window.
GPT models do not have this characteristic; while not perfect with the curve, at least accuracy roughly drops monotonically with task length. (exception o4-mini which also had bizarre patterns in that 2-16 hour window).
I suspect at some level heavy RLVF has broken the core METR model of performance correlating to task length.
I personally think the stronger argument here is that Claude models are not growing in capability consistent with higher task length = harder. (Grok 4 was similar) if you look at the histograms.
Both Sonnet 4.5 and Opus 4.5 were outperforming in the 8 to 16 hour bracket over the 2 to 4 hour, which is highly inconsistent with the task length difficulty model. The model appears broken at last since 3.5 sonnet given the flatness of the 2-16 hour tasks.
You end up in a case where the 4.5 Sonnet curve has a higher % of the solved tasks under it than 4...
Thanks for the histograms. Is the raw data available somewhere?
Just eyeballing it:
Aligns to my sense the model is a month, maybe 2 months, ahead of what is expected and a lot of this jump (4.5 months ahead of expected) is from artifacts of the curve fitting
Private workspace so I can’t share the session. But the approach is simple and doesn’t really require it to understand.
I think we’re coming at this from different angles: you’re doing a “white-box” critique (how specific task outcomes / curve fitting affect the METR horizon), whereas I’m doing a “black-box” consistency check: is the claimed p50 result consistent with what we see on other benchmarks that should correlate with capability?
The core model is:
Bayesians are updating too much on AI capability speed from this data point, given:
I modeled all this in GPT-5.2 and the more realistic estimate for 50% derived from the other benchmarks is in the range of 190 to 210 minutes, depending on how much weight you put o...
Good response. A few things I do want to stress:
. I am just not sure I believe 25%-33% behind is significant.
I personally see the lower bound as 33% slower. That's enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates - this is quite a prediction gap.
and knowledgeable human performance on the benchmark remains around 70%.
Is this true? They haven't u...
Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.
I disagree this is close for several reasons.
I don't believe there's a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
I suppose it's a possibility, albeit a remote one.
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We're at 75%. (and SOTA was ~64% when they released ai 2027)
Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it "beating" by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn't SOTA).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won't be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR's evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-ben...
The markets aren't pointing in the direction of transformative AI (long-term bond yields, etc.).
They are pointing in the direction of AI being very significant in the economy.