METR should test for a 99.9% task completion rate (in addition to the current 80% and 50%). A key missing ingredient holding back LLM economic impact is that they're just not robust enough. This can be viewed analogously to the problem of self-driving. Every individual component of self-driving is ~solved, but stringing them together results in a non-robust final product. I believe that automating research/engineering completely will require nines of reliability that we just don't have. And testing for nines of reliability could be done by giving the model many very short time horizon tasks and seeing how it performs.
This can be further motivated by considering what happens if we string together tasks with a non-99.99...% completion rate. Say we take the GPT 5.1 codex max result. METR claims this model has a 50% time horizon of 2 hours and 40 minutes. Say we tell the model to do task A which is 2 hours and 40 minutes. P(A) = 0.5. Now if the model decides it needs to do task B to further it's research, we have P(B) = 0.5. P(A, B) = P(A)P(B) = 0.25 (These events are not independent, but I express them as such for illustrative effect). We can then consider task C, D, E, etc. This holds even for higher completion rates of 80%. Once we get up to 99.9%, we have P(A) = 0.999, P(B) = 0.999, P(A, B) = P(A)P(B) = ~0.998... This is where we can really start seeing autonomous research imo.
It would be interesting to benchmark humans at 99.9% task completion rate and see what their task length is.
(Disclaimer: I am not completely sure of METR's methodology for determining task length)
Unfortunately the available benchmark tasks do not allow for 99%+ reliability measurements. Because we don't have 1,000 different one-minute tasks the best we could do would be something like whether GPT5.1 can do all 40 tasks 25 times each with perfect reliability. Most likely it will succeed at all of them because we just don't have a task that happens to trip it up.
As for humans' 99.9%, at a granular enough level it would be 0.2 seconds (typing one keystroke) because few people have higher than 99.9% accuracy. But in the context of a larger task, we can correct our typos, so it isn't super relevant.
A key missing ingredient holding back LLM economic impact is that they're just not robust enough.
I disagree with this in this particular context. We are looking at AI companies trying to automate AI automation via AIs. Most tasks in AI R&D don't require much reliability, I don't know the distribution of outcomes in ML experiments but I reckon a lot of them are basically failures/have null results, but the distribution of the impact of such experiments has a long tail [1] . Also ML experiments don't have many irreversible parts, AI R&D researchers aren't like surgeons where mistakes have huge costs: Any ML experiment can be sandboxed, given a bounded amount of resources, shut down when it takes up too much. You need high reliability when the cost of failure is necessarily very high, but when running ML experiments that's not the case.
Edit: Claude 4.5 Sonnet gives feedback on my text above, says that the search strategy matters if we're looking at ML engineering. If it's breadth-first & innovations don't require a deep tree to go down, then low reliability is fine. But if we need to combine ≥4 innovations in a depth-first search then reliability matters more.
I don't think this is a crux for me but learning that it's a thin-tailed distribution would make me at least think about this problem a bit more. Claude claims hyperparameter tunes have lognormal returns (shifted so that the mean is slightly below baseline). ↩︎
Claude's rebuttal is exactly my claim. If major AI research breakthroughs could be done in 5 hours, then imo robustness wouldn't matter as much. You could run a bunch of models in parallel and see what happens (this is part of why models are so good at olympiads), but an implicit part of my argument/crux is that AI research is necessarily deep (meaning you need to string some number of successfully completed tasks together such that you get an interesting final result). And if the model messes up one part, your chain breaks. Not only does this give you weird results, but it breaks your chain of causality[1], which is essential for AI research.
I've also tried doing "vibe AI researching" (no human in the loop) with current models and I find it just fails right away. If robustness doesn't matter, why don't we see current models consistently making AI research breakthroughs at their current 80% task completion rate?
A counterargument to this is that if METR's graph trend keeps up, and task length gets to some threshold, I'll call it a week for example, then you don't really care about P(A)P(B)P(C)..., you can just do the tasks in parallel and see which one works. (However, if my logic holds, I would guess that METR's task benchmark hits a plateau at some point before doing full-on research at least with current model robustness)
By chain of causality, I mean that I did task A. If I am extremely confident that task A is correct I can then do a search from task A. Say I stumble on some task B, then C. If I get an interesting result from task C, then I can keep searching from there so long as I am confident in my results. I can also mentally update my causal chain by some kind of ~backprop. "Oh using a CNN in task A, then setting my learning rate to be this in task B, made me discover this new thing in task C so now I can draw a generalized intuition to approach task D. Ok this approach to D failed, let me try this other approach".