813

LESSWRONG
LW

812
AI
Frontpage
2025 Top Fifty: 14%

101

METR Research Update: Algorithmic vs. Holistic Evaluation

by David Rein
13th Aug 2025
1 min read
7

101

This is a linkpost for https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/
AI
Frontpage

101

METR Research Update: Algorithmic vs. Holistic Evaluation
25tdko
13Daniel Kokotajlo
5lc
3David Rein
9Adam Karvonen
5O O
3lc
New Comment
7 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:20 PM
[-]tdko2mo2514

Surprised that this hasn't gotten more discussion. There's some potentially big implications for the time horizons study, which has become fairly load-bearing in timelines discourse.

Reply2
[-]Daniel Kokotajlo2mo132

Yeah, this is an update against basing timelines heavily on the time horizon extrapolation.

I still think it's the best single piece of evidence / best benchmark extrapolation that we have though? Curious for your thoughts.

Reply1
[-]lc2mo50

Yup, this is one of the largest problems with using AI solutions today.

Reply
[-]David Rein1mo30

Important to caveat that these results are pretty small—I wouldn't take the absolute numbers too seriously beyond the general "algorithmic scoring may often overestimate software capabilities". 

Reply
[-]Adam Karvonen2mo97

I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.

Reply
[-]O O2mo50

I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)

So while it will shift timelines, it is something that will fall to scale and thus shouldn’t shift it too much.

I predict once these methods make their way into commercial models, this will go away, or roughly 1 year. I’ll check back in 2026 to see if I’m wrong.

Reply
[-]lc2mo30

I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)

Wondering why you think this

Reply
Moderation Log
More from David Rein
View more
Curated and popular this week
7Comments

TL;DR

  • On 18 real tasks from two large open-source repositories, early-2025 AI agents often implement functionally correct code that cannot be easily used as-is, because of issues with test coverage, formatting/linting, or general code quality.
  • This suggests that automatic scoring used by many benchmarks may overestimate AI agent real-world performance.

Bar chart comparing holistic and algorithmic scoring