LESSWRONG
LW

2587
tdko
1281100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3tdko's Shortform
4mo
27
3tdko's Shortform
4mo
27
tdko's Shortform
tdko16d131

Claude Sonnet 4.5's 50% task horizon is 1 hr 53 min, putting it slightly behind GPT-5's 2 hr 15 min score.

https://x.com/METR_Evals/status/1976331315772580274

Reply
t14n's Shortform
tdko1mo32

Note that they also claimed Opus 4 worked for over seven hours but only scored 1h20m on the METR task suite. I wouldn't be surprised to see Sonnet 4.5 get a strong METR score but 30 hours definitely isn't likely.

Reply
METR Research Update: Algorithmic vs. Holistic Evaluation
tdko2mo2514

Surprised that this hasn't gotten more discussion. There's some potentially big implications for the time horizons study, which has become fairly load-bearing in timelines discourse.

Reply2
tdko's Shortform
tdko3mo360

METR's task-horizon score on GPT-5 is 2h17m @ 50% success. For comparison, o3 was 1h32m and Grok 4 (prior SOTA) was 1hr50m. The 80% success score is 25m, prior SOTA was 20m from both o3 and Claude 4 Opus.

https://metr.github.io/autonomy-evals-guide/gpt-5-report/

Reply
ryan_greenblatt's Shortform
tdko3mo10

I expect below trend rather than above trend due to some early reports about GPT-5

Which reports, specifically?

Reply
nikola's Shortform
tdko3mo10

Did we ever get any clarification as to whether Grok 4 did in fact use as much compute on posttraining as pretraining? 

Reply
tdko's Shortform
tdko3mo305

METR has finally tested Gemini 2.5 Pro (June Preview) and found its 50% success task horizon is only 39 minutes, far worse than o3 or Opus 4 which are at 90 and 80 minutes respectively. Probably shouldn't be a gigantic update given 2.5 Pro never scored amazingly at SWE-Bench, but still worse than I expected given how good the model is otherwise. 

Reply5
james oofou's Shortform
tdko3mo50

I feel like looking at unreleased models for doubling time mucks things up a bit. For instance I'm assuming the unreleased o3 model from December had a significantly longer time-horizon in math than the released o3, given its much higher benchmarks in FrontierMath, etc.

Reply
OpenAI Claims IMO Gold Medal
tdko3mo90

Worth noting this year's p3 was really easy, Gemini 2.5 pro even got it some of the time, and Grok 4 Heavy and Gemini Deep Think got problems rated as harder. Still an achievement, though.

 

From the author of the epoch article:

https://x.com/GregHBurnham/status/1946655635400950211

https://x.com/GregHBurnham/status/1946725960557949227

https://x.com/GregHBurnham/status/1946567312850530522

Reply
tdko's Shortform
tdko4mo130

METR's task length horizon analysis for Claude 4 Opus is out. The 50% task success chance is at 80 minutes, slightly worse than o3's 90 minutes. The 80% task success chance is tied with o3 at 20 minutes.

https://x.com/METR_Evals/status/1940088546385436738

Reply
Load More