LESSWRONG
LW

811
Aaron Staley
730170
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Checking in on AI-2027
Aaron Staley19d20

Good response. A few things I do want to stress:

. I am just not sure I believe 25%-33% behind is significant. 

I personally see the lower bound as 33% slower.  That's enough to change 2 to 3 years which is significant. 

And again, realistically progress is even slower.  The parallel compute version only increased by 1.8% in 4 months.  We might be another 6 months from hitting 85% at current rates - this is quite a prediction gap.

and knowledgeable human performance on the benchmark remains around 70%.

Is this true?  They haven't updated their abstract claiming 72.36% (which was from the old version) and I'm wondering if they simply haven't re-evaluated.    

But yes, looking at the GTA1 paper, you are correct that perf varies a bit between os-world and os-world-verified, so I take back that growth is obviously slower than projected.

All said, I trust swe-bench-verified more regardless to track progress:

  1. We're relying on a well-made benchmark that was done as a second pass by OpenAI.  os-world is not that.
  2. Labs seem to be targetting more - low hanging fuit like attaching python interpreters just doesn't exist for this benchmark ( I'm not sure if the ai-2027 considered this issue when making their os-world predictions)..
  3. We are concerned mainly with coding abilities (automated ai research) on the ai 2027 timelines.
Reply
Checking in on AI-2027
Aaron Staley21d63

Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.

 

I disagree this is close for several reasons.

  1. It isn't clear that the "parallel test time" number even counts.  
    1. My understanding is these benchmarks can't be achieved by using mechanisms that cost more in compute than a human to manually perform and we have no idea how much parallel attempts are sampled.  They use up to 256 in their post on GPQA
    2. It uses an internal scoring model that might not generalize beyond the repos swe-bench tests.
    3. Sonnet 3.7's 70.3% score did not exist on swebench.com at the point ai-2027 was released (highest was 65.4%), suggesting the authors were not anchoring from that parallel test time number to begin with.
  2. If parallel test time does count, projection is not close:
    1. A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away.  That's 33% slower growth (2% a month vs. 3% a month projected)
    2. Looking more recently, the growth from May's Sonnet 4 with parallel compute to now (4 months later) has been 1.8%.  At this rate assuming linearity, 85% won't be crossed for nearly 7 months from now, which is over 60% slower than projection.

 


Claude Sonnet 4.5 scored a 62% on this metric, as of September 29th, 2025.

 

For OSWorld, these aren't even the same benchmarks.  ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed.  Huge difference -- Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified.   We might be at more like a 55.6% SOTA today (GTA1 w/ GPT-5) on OG osworld, a huge miss (~46% slower)


Overall, realized data suggests something more like an ai-2029 or even later.

Reply
tdko's Shortform
Aaron Staley3mo51

I don't believe there's a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).

  1. Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
  2. Claude is relatively bad at math but has hovered around SOTA on agentic coding.
Reply
tdko's Shortform
Aaron Staley3mo71

+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?

I suppose it's a possibility, albeit a remote one.

Reply
tdko's Shortform
Aaron Staley3mo2012

The swe-bench scores are already well below trend from ai 2027.  Had to hit 85% by end of month.   We're at 75%. (and SOTA was ~64% when they released ai 2027)

Reply1
nikola's Shortform
Aaron Staley3mo30

Very wide confidence intervals.  If Grok 4 were equal to O3 in 50%, time horizon, it "beating" by this much is a 33% outcome.   (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).

Overall, I read this as about equally agentic as O3.  Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn't SOTA).

Reply
ryan_greenblatt's Shortform
Aaron Staley3mo30

My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won't be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR's evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]


If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release).  I'm personally not this bearish (I'd guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.

Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.

Reply
tdko's Shortform
Aaron Staley3mo52

Coding agentic abilities are different from general chatbot abilities.  Gemini is IMO the best chatbot there is (just in terms of understanding context well if you wish to analyze text/learn things/etc.).  Claude on the other hand is dead last among the big 3 (a steep change from a year ago) and my guess is Anthropic isn't trying much anymore (focusing on.. agentic coding instead)

Reply
tdko's Shortform
Aaron Staley3mo30

I don't see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR. 

Reply
james oofou's Shortform
Aaron Staley3mo30

I don't think you can just start at the HCAST timeline for software engineering and map it to IMO problems.

Alternative bearish prediction would be deepthink got 50% on May 20 (not released, lab frontier) on USAMO. 80% is ~4x the task time of 50% ones (at least for software engineering -- not sure what it is for math), so we needed two doublings (6 months) to pull this off and instead only have ~0.67.

Reply
Load More