Why does METR score o3 as effective for such a long time duration despite overall poor scores?

Cole Wyeth

19 Why does METR score o3 as effective for such a long time duration despite overall poor scores?

by Cole Wyeth

2nd May 2025

1 min read

3

19

Epistemic status: Question, probably missing something.

Context

See the preliminary evaluation of o3 and o4-mini here: https://metr.github.io/autonomy-evals-guide/openai-o3-report/#methodology-overview

This follows up important work by METR measuring the maximum human-equivalent lengths of tasks that frontier models can perform successfully, which I predicted would not hold up (perhaps a little too stridently).

I'm also betting on that prediction, please provide me with some liquidity:

Question

o3 doesn't seem to perform too well according to this chart:

But it gets the best score on this chart:

I understand that these are measuring two different things, so there is no logical inconsistency between these two facts, but the disparity does seem striking. Would someone be willing to provide a more detailed explanation of what is going on here? I am not sure whether to update that the task length trend is in fact continuing (or accelerating) or interpret the overall poor performance of o3 as a sign that the trend is about to break down.

AI

Frontpage

19

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:42 PM

[-]Thomas Kwa10mo90

o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it's not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.

Reply

[-]Cole Wyeth10mo10

Okay, that’s a lot more convincing. Was it available publicly? I seem to have missed it. Surprisingly high success probability on a 16 hour task - does this just mean it made some progress on the 16 hour task?

Pokémon fell today too, we might be cooked.

Reply

1

[-]Josh You10mo52

The RE-bench result is just for five tasks, the second graph is for a broader task suite of almost 200 tasks. I wouldn't read much into o3 doing worse than other models at RE-bench because of the small sample.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

19

Why does METR score o3 as effective for such a long time duration despite overall poor scores?

19

Context

Question

19

19