Aaron Staley
Aaron Staley has not written any posts yet.

Aaron Staley has not written any posts yet.

I agree with you that "Opus 4.5 can do anything" is overselling it and there is too much hype around acting like these things are fully autonomous software architects. I did want to note though that Opus 4.5 is a vast improvement and praise is warranted.
My guess is that "convert this already-written code from this representation/framework/language/factorization to this other one" may be one of the things LLMs are decent at, yep!
Agreed, I'm relying on their "localized" intelligence to get work done fast. Where Anthropic has improved their models significantly this year is A) improving task "planning", e.g. how to extract the relevant context needed to make decisions LLMs broadly already could do,... (read more)
None of that worked, I detect basically no change since August.
What sort of codebase are you working on? I work in a 1 million line typescript codebase and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).
I wouldn't say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.
The main game changer is that Opus has simply become smarter about working with large code bases - less hallucinated methods,... (read more)
Yes, both model families are similar in that they do not have consistently declining accuracy in the 2-16 hour task window. The modeling is somewhat broken when you have higher accuracy in the 8-16 hour window than the 2-4 hour window.
GPT models do not have this characteristic; while not perfect with the curve, at least accuracy roughly drops monotonically with task length. (exception o4-mini which also had bizarre patterns in that 2-16 hour window).
I suspect at some level heavy RLVF has broken the core METR model of performance correlating to task length.
I personally think the stronger argument here is that Claude models are not growing in capability consistent with higher task length = harder. (Grok 4 was similar) if you look at the histograms.
Both Sonnet 4.5 and Opus 4.5 were outperforming in the 8 to 16 hour bracket over the 2 to 4 hour, which is highly inconsistent with the task length difficulty model. The model appears broken at last since 3.5 sonnet given the flatness of the 2-16 hour tasks.
You end up in a case where the 4.5 Sonnet curve has a higher % of the solved tasks under it than 4.5 Opus (note how 4.5 Opus gets 0 tasks right in... (read more)
Thanks for the histograms. Is the raw data available somewhere?
Just eyeballing it:
Aligns to my sense the model is a month, maybe 2 months, ahead of what is expected and a lot of this jump (4.5 months ahead of expected) is from artifacts of the curve fitting
Private workspace so I can’t share the session. But the approach is simple and doesn’t really require it to understand.
I think we’re coming at this from different angles: you’re doing a “white-box” critique (how specific task outcomes / curve fitting affect the METR horizon), whereas I’m doing a “black-box” consistency check: is the claimed p50 result consistent with what we see on other benchmarks that should correlate with capability?
The core model is:
Bayesians are updating too much on AI capability speed from this data point, given:
I modeled all this in GPT-5.2 and the more realistic estimate for 50% derived from the other benchmarks is in the range of 190 to 210 minutes, depending on how much weight you put on the impressive (but not to the degree of the 50%) accuracy jump. The 80% is likely... (read more)
Good response. A few things I do want to stress:
. I am just not sure I believe 25%-33% behind is significant.
I personally see the lower bound as 33% slower. That's enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates - this is quite a prediction gap.
and knowledgeable human performance on the benchmark remains around 70%.
Is this true? They haven't updated their abstract claiming 72.36% (which was from the old version) and I'm wondering if they simply haven't re-evaluated.
But yes, looking... (read more)
Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.
I disagree this is close for several reasons.
If I understand correctly, you are advocating for using a call only strategy (as opposed to a (synthetic) long strategy) to achieve higher leverage than would otherwise be possible?
> This is partly for speculation, but it seems reasonable for most people with 2 years of savings to have 10% of their net worth in SPY options or 20% in SPX options [4] for hedging purposes alone.
To clarify, you mean 10% of net worth being in this specific contract (SPY280616C01000000)? So roughly 15:1 leverage using options?
Readers should note this has very strong returns if you get that 50%+ return, but isn't straight leverage - the median outcome here is about a 12.7% reduction in portfolio value in next 2.5 years relative to pure SPY.