LESSWRONG
LW

tdko's Shortform

by tdko
1st Jul 2025
1 min read
22

3

This is a special post for quick takes by tdko. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
tdko's Shortform
36tdko
16β-redex
20Aaron Staley
8Bitnotri
7Aaron Staley
6O O
5Aaron Staley
19elifland
7bhalstead
1Alice Blair
1Josh You
3bodry
30tdko
4Garrett Baker
5Aaron Staley
5Garrett Baker
1uugr
4Garrett Baker
3Aaron Staley
2Cole Wyeth
13tdko
6Cole Wyeth
22 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:28 PM
[-]tdko1mo360

METR's task-horizon score on GPT-5 is 2h17m @ 50% success. For comparison, o3 was 1h32m and Grok 4 (prior SOTA) was 1hr50m. The 80% success score is 25m, prior SOTA was 20m from both o3 and Claude 4 Opus.

https://metr.github.io/autonomy-evals-guide/gpt-5-report/

Reply
[-]β-redex1mo160

This 25m 80%-time horizon number seems like strong evidence against the superexponential model from ai-2027. On this graph the superexponential line shows 4h at the end of 2025. I feel like GPT-5 will be the biggest model release of the year, I don't see how we would see a model with an 8x time horizon of GPT-5 this year.

Reply
[-]Aaron Staley1mo2012

The swe-bench scores are already well below trend from ai 2027.  Had to hit 85% by end of month.   We're at 75%. (and SOTA was ~64% when they released ai 2027)

Reply1
[-]Bitnotri1mo81

Gemini 3 should drop by the end of the month, we might hit that 

Reply
[-]Aaron Staley1mo71

+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?

I suppose it's a possibility, albeit a remote one.

Reply
[-]O O1mo61

It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini's swe bench to thus at least be ahead of OpenAI's 75%. 

Reply
[-]Aaron Staley1mo51

I don't believe there's a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).

  1. Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
  2. Claude is relatively bad at math but has hovered around SOTA on agentic coding.
Reply
[-]elifland1mo194

I overall agree that things seem to be going slower than AI 2027 (and my median was longer when it came out).

However as mentioned in the caption, the green curve is a simplified version of our original timelines model. Apologies about that, I think it's reasonable to judge us based on that.

FWIW though, the central superexponential Mar 2027 trajectory from our original model certainly is not strongly contradicted by GPT-5, both with and without an AI R&D speedup interpolation issue fixed.

The original model, filtered for superexponential (pre-AI-R&D-automation) trajectories that reach superhuman coder in 2027:

With AI R&D speedup bug fixed, also filtered for superexponential pre-AI-R&D-automation (backcast looks much better, GPT-5 prediction slightly worse):

Either way, we're now working on a much improved model which will likely have an interactive web app which will provide an improvement over this static graph, e.g. you'll be able to try various parameter settings and see what time horizon trajectories they generate and how consistent they are with future data points.

Note also that the above trajectories are from the original model, not the May update model which we unfortunately aren't taking the time to create for various reasons, we think it would likely look a little worse in terms of the GPT-5 fit but might depend how you filter for which trajectories count as superexponential.

Reply
[-]bhalstead1mo*71

Registering that I don't expect GPT-5 to be "the biggest model release of the year," for various reasons. I would guess (based on the cost and speed) that the model is GPT-4.1-sized. Conditional on this, the total training compute is likely to be below the state of the art.

Reply
[-]Alice Blair1mo10

How did you determine the cost and speed of it, given that there is no unified model that we have access to, just some router between models? Unless I'm just misunderstanding something about what GPT-5 even is.

Reply
[-]Josh You1mo10

The router is only on ChatGPT, not the API, I believe. And it switches between two models of the same size and cost (GPT-5 with thinking and GPT-5 without thinking).

Reply
[-]bodry1mo30

For reference the 95% CI is 1-4.5 hours for @50% success and the 95% CI is 8-65 minutes for @80%. 

Reply
[-]tdko1mo305

METR has finally tested Gemini 2.5 Pro (June Preview) and found its 50% success task horizon is only 39 minutes, far worse than o3 or Opus 4 which are at 90 and 80 minutes respectively. Probably shouldn't be a gigantic update given 2.5 Pro never scored amazingly at SWE-Bench, but still worse than I expected given how good the model is otherwise. 

Reply5
[-]Garrett Baker1mo42

This is interesting, Gemini 2.5 Pro has recently became my favorite model, especially over Opus (this from a long-time Claude user). I would not be surprised if I like it so much because of its lower task horizon, since its the one model I trust to not be uselessly sycophantic right now.

Reply
[-]Aaron Staley1mo52

Coding agentic abilities are different from general chatbot abilities.  Gemini is IMO the best chatbot there is (just in terms of understanding context well if you wish to analyze text/learn things/etc.).  Claude on the other hand is dead last among the big 3 (a steep change from a year ago) and my guess is Anthropic isn't trying much anymore (focusing on.. agentic coding instead)

Reply
[-]Garrett Baker1mo52

Hm, I notably would not trust Claude to agentically code for me either. I went from heavily using Claude Code to occasionally asking Gemini questions, and that I think has been a big improvement.

Given METER's other work, the obvious hypothesis is that Claude Code is mostly just better at manipulating me into thinking they can easily do what I want.

Reply1
[-]uugr1mo10

What's the correlation between task horizon and useless sycophancy?

Reply
[-]Garrett Baker1mo42

I don’t know, subjectively it seems large, and it seems plausible they could be related.

Reply
[-]Aaron Staley1mo30

I don't see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR. 

Reply
[-]Cole Wyeth1mo20

I still think it’s comforting to observe that the task lengths are not increasing as quickly as feared. 

This is as I predicted so far but we’ll see about GPT-5. 

Reply
[-]tdko2mo130

METR's task length horizon analysis for Claude 4 Opus is out. The 50% task success chance is at 80 minutes, slightly worse than o3's 90 minutes. The 80% task success chance is tied with o3 at 20 minutes.

https://x.com/METR_Evals/status/1940088546385436738

Reply
[-]Cole Wyeth2mo62

That looks like (minor) good news… appears more consistent with the slower trendline before reasoning models. Is Claude 4 Opus using a comparable amount of inference-time compute as o3? 

I believe I predicted that models would fall behind even the slower exponential trendline (before inference time scaling) - before reaching 8-16 hour tasks. So far that hasn’t happened, but obviously it hasn’t resolved either. 

Reply
Moderation Log
More from tdko
View more
Curated and popular this week
22Comments