METR: How Does Time Horizon Vary Across Domains?

Vincent Cheng

Since publishing this we have gotten two comments:

Someone pointed out that Mock AIME actually has human time annotations. We already had item-level agent success data from Epoch AI, so it joins GPQA Diamond in having high-quality data in both categories.
In May, Palisade found that the frontier time horizon on cyber offense CTFs is around 1 hour

We're looking into incorporating both of these if we find the data is compatible with ours!

I'm dropping this because it probably won't change conclusions much and on the margin, new more realistic evals seem like they would add more realism per unit of effort. The code is open-source and if anyone writes the PR for me, I'll probably merge it.

[-]Expertium3mo10

Another suggestion: https://cybench.github.io/

[-]Thomas Kwa3mo*21

The issue with Cybench is its difficulty annotations are "first solve time" which we don't know how to compare with median / average solve time among experts. If we get better data, it could be comparable.

[-]Oliver Sourbut3mo52

Really pleased to see this published! Probably the single standout limitation of the first paper was the scope to a particular domain/benchmark (which unfortunately got forgotten in a lot of lossy discussion of the paper). This fills that gap admirably!

I'd be interested in the same analysis but with a 'time' axis ranging over some metrics of standardised cost or scale, rather than raw date. Likely sensible choices would be training compute, runtime compute, or some appropriate combination of the two.

Is there an easy way to grab the collected model success rates and human time-to-complete data, in case I get round to doing this myself?

[-]Thomas Kwa3mo40

Yes, it's in the repo, in the data/scores and data/benchmarks folders.

[-]Cole Wyeth3mo4-1

Interesting.

Seems like an update towards exponetial growth potentially being robust. However, we’ve recently seen that exponetial growth on software task length may not translate to real software engineering usefulness.

Also, this does not necessarily imply exponetial growth in agency, because agency depends on unlocking increasingly hard bottlenecks. The growth rate is much slower for self driving, while other agentic tasks have a fast growth rate but are far behind. That means that overall agency might turn out to grow sub exponentially, even if growth is exponetial at every individual task.

[-]Oliver Sourbut3mo*30

By the way, if we take the exponential (rather than sigmoidal) constant hazard rate model^[1], there's an easy mental trick for extrapolating to different success rates and different time horizons from a given measurement: the Rule of 72^[2].

The rule of 72 says, for smallish percentage multiplier rates (i.e. +- $r$ percent) per period^[3], the half life in periods is roughly $t = 72 / r$ , or conversely, to halve in time $t$ you need $r = 72 / t$ per period.

Ord writes:

Here are some useful comparisons for how the predicted time horizons over which an agent could get very high success rates compare to the measured time horizon for a 50% success rate:

$T_{80} \approx 1 / 3 T_{50} T_{90} \approx 1 / 7 T_{50} T_{99} \approx 1 / 70 T_{50} T_{99.9} \approx 1 / 700 T_{50}$

[and each additional ‘nine’ of reliability beyond this divides the time horizon by 10]

Written this way, it looks a bit cryptic. But the rule of 72 makes this easy to estimate.

When we say $T_{80}$ or $T_{90}$ , we're actually saying, 'what is the 20% failure horizon?' or 'what is the 10% failure horizon?' respectively.

Well, if the half life (the 50% failure horizon) is $T_{50}$ , the rule says the 10% failure horizon is $(10 / 72) T_{50} \approx 1 / 7 T_{50}$ ^[4]. Similarly $T_{99} \approx 1 / 72 T_{50}$ etc. (whence 'each additional ‘nine’... divides the time horizon by 10'). The rule gets skewiff for higher percentages, which is why the $T_{80}$ is a bit different than the rule would suggest (but not by much).

Interestingly when I was talking with Beth and Megan about this time horizon stuff in early 2024, we discussed the constant hazard rate model as a simple default (though we were all skeptical that no 'error recovery' at all was plausible as a model, except for very fatal errors!). So I'm mildly surprised that it didn't make it into the paper in the end.

The one Ord talks about ↩︎
variously 'rule of 70', 'rule of 69.3' and other rules; usually it doesn't make much difference for Fermi-ish estimations ↩︎
'smallish' because it relies on approximating a logarithm as linear, which works for smallish rates. ↩︎
The rate ( $r$ ) we're interested in is $10$ . Half-life $t$ is $T_{50} / T_{10}$ (i.e. how many 10% failure horizons make up the 50% failure horizon), as given. So the rule $t = 72 / r$ says $T_{50} / T_{10} = 72 / 10$ . ↩︎

	This post	Original paper (Kwa & West et al. 2025)
Human time data	Estimated by various methods	Human baselines (mostly)
Agent success data	% scores by benchmark split Fallback: overall % score	Success probability over 8 runs per task-agent pair
Method	MLE for h and Fallback: Assume , binary search for h	Logistic regression for time horizon and slope

Domain	Benchmark	Task length range	Base-lines	Agent data (can we estimate β?)	Description	Source
Software dev	METR-HRS	1s-16h	✅	✅ Individual tasks	Semi-realistic software tasks	METR
	SWE-bench Verified	7m-2h	🟨	✅ Individual tasks	PRs extracted from Github	Epoch AI
Agentic computer use	OSWorld	8s-45s	✅	❌ Overall	GUI application tasks	Leaderboard
	WebArena	5s-9m	✅	❌ Overall	DOM-based web tasks	Leaderboard
Math contests	Mock AIME	5m-4h	❌	✅ Individual tasks	US high school contest for top 5%	Epoch AI
	MATH	2s-30m	🟨	❌ Overall	Easier math contest problems	Various
Competitive programming	LiveCode- Bench[9]	4m-1.5h	❌	✅ 3 difficulty buckets	LeetCode & similar problems	Leaderboard
Scientific QA	GPQA Diamond	5m-3h	✅	✅ Individual tasks	Difficult scientific MCQs	Epoch AI / various
Video QA	Video-MME	1s-1h	❌	✅ 3 time buckets	Questions about long videos	Leaderboard
Autonomous driving	Tesla FSD Tracker	N/A	N/A	❌ (Mean time to disengagement) * ln(2)	Composite data from Tesla tracking apps	Website
Simulated robotics	RLBench	4s-20s	🟨	❌ Overall	Vision-guided manipulation	Leaderboard

Task type	Description	Example	Is time horizon appropriate?
Multiplicative	Success = success on all subtasks	Train an image model that gets <0.25 validation loss	Yes, and ≈ 1 if num_subtasks = time
Additive	Success rate = average success rate on subtasks	Do 200 GPQA questions and get >70% correct	No; model success rate doesn’t decrease with human time
Knowledge check	Not composed of steps	What is the name of element 100?	No; human success rate doesn’t increase with time

LESSWRONG
LW

LESSWRONG
LW

87

METR: How Does Time Horizon Vary Across Domains?

87

87

Summary

Methodology

Benchmarks

Results

Trends on other domains

Main takeaway

More observations

Soundness of the time horizon metric

Video length is a poor predictor of difficulty

SWE-Lancer task value is a poor predictor of difficulty

Robustness checks

Limitations and future work

Conclusion

Appendix

Potential future experiments

Details for individual benchmarks

Other plots

Raw data