Surprised that this hasn't gotten more discussion. There's some potentially big implications for the time horizons study, which has become fairly load-bearing in timelines discourse.

[-]Daniel Kokotajlo3mo132

Yeah, this is an update against basing timelines heavily on the time horizon extrapolation.

I still think it's the best single piece of evidence / best benchmark extrapolation that we have though? Curious for your thoughts.

[-]lc3mo50

Yup, this is one of the largest problems with using AI solutions today.

[-]David Rein3mo30

Important to caveat that these results are pretty small—I wouldn't take the absolute numbers too seriously beyond the general "algorithmic scoring may often overestimate software capabilities".

[-]Adam Karvonen3mo97

I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.

[-]O O3mo50

I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)

So while it will shift timelines, it is something that will fall to scale and thus shouldn’t shift it too much.

I predict once these methods make their way into commercial models, this will go away, or roughly 1 year. I’ll check back in 2026 to see if I’m wrong.

[-]lc3mo30

I think the extra effort required to go from algorithmically to holistically qualifying scales linearly with task difficulty. Dense reward model scaling on hard to verify tasks seems to have cracked this. Deepminds polished holistically passing IMO solutions probably required the same order of magnitude of compute/effort as the technically correct but less polished OpenAI IMO solutions. (They used similar levels of models, compute, and time to get their respective results)

Wondering why you think this

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

101

METR Research Update: Algorithmic vs. Holistic Evaluation

101

101

TL;DR