Claude's performance is low on the 2-4 hour range, which mostly consists of cybersecurity tasks, potentially dual-use for safety. In general, training on cybersecurity CTFs and ML code would increase "horizon length" on the METR plot, which only has 14 samples in the relevant (1 - 4hr) range where progress happened in 2025.
This makes me think of the previous model with the biggest 50%/80% time horizon ratio, Grok 4. It had funny failures at 2 sec, 2min and 2h long tasks. What if an alternate-universe Claude who, like GPT-5.1-Codex-Max, succeeded at ALL tasks shorter than a minute, would have achieved a far bigger 80% time horizon? And if GPT-5.2 and Gemini 3 Pro had the failures at less-than-a-minute-long tasks ironed out, as happened with GPT-5 vs Grok 4?
EDIT: in theory, the alternate Claude could also end up with a worse 50% time horizon. But the real Claude succeeded on a quarter of 2-4 hr long tasks and about a half of 4-16 hr long tasks.
We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we're still working through evaluations for other recent models, this is our highest published time horizon to date.
We don’t think the high upper CI bound reflects Opus’s actual capabilities: our current task suite doesn’t have enough long tasks to confidently upper bound Opus 4.5’s 50%-time horizon. We are working on updating our task suite, and hope to share more details soon.
Based on our experience interacting with Opus 4.5, the model’s performance on specific tasks (including some not in our time horizon suite), and its benchmark performance, we would be surprised if further investigation showed Opus had a 20+ hour 50%-time horizon.
Despite its high 50%-time horizon, Opus 4.5's 80%-time horizon is only 27 minutes, similar to past models and below GPT-5.1-Codex-Max's 32 mins. The gap between its 50%- and 80%- horizons reflects a flatter logistic success curve, as Opus differentially succeeds on longer tasks.
The 80%-horizon has stayed essentially flat (27-32 mins) since GPT-5.1-Codex-Max's release but there's a big jump with huge error bars on the 50%-horizons.
I think Daniel Kokotajlo's recent shortform offers a useful framing here. He models progress as increasing either the intercept (baseline performance) or the slope (how well models convert time budget into performance). If progress comes mainly from rising intercepts, an exponential fit to horizon length could hold indefinitely. But if progress comes from increasing slope, the crossover point eventually shoots to infinity as AI slope approaches human slope.
The flat 80%-horizon while 50%-horizon climbs might be evidence for intercept-dominated progress.