Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec - 1 min range. There's something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you'd get a very different, almost vertical line, with a very low R^2.
I don't think this is true. I got claude to clone the repo and reproduce it without the SWAA data points. The slope is ~identical (-0.076 rather than the original -0.072) and the correlation is still pretty good. (0.51)
Edit: That was with HCAST and RE-bench. Just HCAST is slope=-0.077 and R^2=0.48. I think it makes more sense to include RE-bench.
Edit 2: Updated the slopes. Now the slope is per doubling, like in the paper (and so the first slope matches the one in the paper). I think the previuos slopes were measuring per factor e instead.
Thanks, I should've done that myself instead of lazily mentioning what it "looked like". R^2=0.51 is still a lot lower than the initial 0.83. Though same as before, I am not fully sure what this implies for the logistic model chosen, and downstream conclusions.
Incidentally, your intuition might've been misled by one or both of:
As illustration of the last point: here's a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn't quite say it's almost vertical, but it's much steeper.
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...
Yeah, a line is definitely not the "right" relationship, given that the y-axis is bounded 0-1 and a line isn't. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don't have to remember matplotlib commands. TBC I'm not really checking the language models' work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR's repo.)
This doesn’t contradict common sense if you remember that Claude Opus 4.5 has a 50%-time horizon of around 4 hours 49 minutes (95% confidence interval: 1 hour 49 minutes to 20 hours 25 minutes).
Just think about it: from 1 hour 49 minutes up to 20 hours 25 minutes.
There simply isn’t enough data for reliable conclusions.
I don't know how problematic it is to assume a logistic function for this data.
The logistic is just one of many functions that's a reasonable fit for the P(success) vs length data. You can use lots of different curves and still get the exponential horizon trend, it's not specific to the log-logistic.
E.g. Here's a silly log-linear fit that I quickly threw together:
Here's all the histograms if you want to take a look
It's not the function being used to fit per-model P(success) vs length that causes the exponential horizon trend on our task suite. I think its more that:
(I work at METR but I'm speaking for myself here. Also thanks for the post, appreciate people digging into the data.)
Strong-upvoted for picking more well-justified holes in that graph I contributed to. See also my post on this topic for some qualitative reasons to take that study with a grain of salt ( . . . as well as some debunked & redacted quantitative slander on my part, which this post reassures me happened to eventually be directionally correct, eto_bleh.gif).
Horizon length makes no sense fundamentally. The 9s of reliability don't matter for anything but driving a car. Software and hacking and science and engineering are not a series of sequential steps. It is more like adding stuff to your house. If you can make a tiny improvement then you are net-helpful and you can build a cathedral. If you make improvements more often than you cause problems, anything is possible. If you break stuff more often than you fix it, then nothing is possible. It is a binary question. If you make big improvements, then small mistakes don't matter. If you mess up stuff bigtime and can't fix it, then big improvements are useless.
If I want to know what level of task complexity I can give an LLM to get a guaranteed correct answer, horizon length is a good measure.
I think once you assume a logistic function, its almost guaranteed that if a new model solves one additional task, it's going to continue the log-linear trend.
No, whether it continues the log-linear trend depends on WHEN a new model solves one additional task.
Using the logistic function to fit task success probability vs task length is not load-bearing for the time horizon computation.
If the task lengths in the data set were uniformly distributed (in log-space) you could just take the overall accuracy and look up the task length of that percentile in the data and that would be nearly identical to the 50% horizon. This would replace the logistic assumption with a step-function assumption, but because the logistic is point symmetric in the 50% point, it's roughly the same value.
Put differently: There is an interval where the model goes from basically 100% to basically 0% and the logistic just takes the point in the middle for the 50% horizon. And there are many other methods that also take the point in the middle (possibly in a less robust way, but that would just make the trend a little more noisy).
I think your feeling that this is suspect comes more from the choice of the log-space not the choice of the fitting function. It feels a bit circular to say "we gonna assume that the log-space of task length is the natural way to look at model competence" and then get the result that through linear time model competence improves linearly in that log-space. I think it is also rather this choice of log-space not of the logistic that is motivated by the log-linear plot of model success rate vs human time to complete.
Also I would point out that the validity of the time horizons computed for the current models is not just based on these 16 tasks, but on the preceding six year trend + replications of exponential trends in other datasets. It's great to point out that current measurements have a ton of noise and are very gameable, but it's hard to attack the conclusion of exponential progress in time horizons.
One particular chart in this post would be much easier to look at if it didn't have an uncanny abomination wrapped in human skin stapled to it.
TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR’s underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR’s assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!
The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).
However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon’s) the METR plot has influenced significant investment decisions, but I’ve not been in any boardrooms.
Here is the problem with this. In 2025, according to this plot, frontier AI progress occurred in the regime between a horizon length of 1 to 4 hours.
Guess how many samples have 1-4hr estimated task lengths?
Just 14. How do we know? Kudos to the authors, the paper has this information, and they transparently provide task metadata.
Hopefully, for many, this alone rings alarm bells. Under no circumstance should we be making such large inferences about AGI timelines, US vs China, Closed vs Open model progress, research priorities, individual model quality etc. based on just 14 samples. An early sign of this problem was there when the original METR paper was released in March 2025. The best performing model at the time, Claude 3.7 Sonnet, was estimated to have a horizon length of 59 mins. Now see its success rate distribution over task lengths:
Notice how the model has almost a 60 ±15% probability of success on 1-2hr tasks. So why is the estimated 50% success horizon length 59 minutes?! Because it doesn’t get anything right in the 2-4 hr range. METR calculates the horizon length by fitting a logistic curve to individual sample outcomes, like the dark purple line above. Notice how 0% on the 2-4hr range leads to a very bad logistic fit (the curve is below the 95% confidence interval for 0.5-1hr, and 1hr-2hrs range). We’ll come to my skepticism arising from the core modelling assumption, of using a logistic curve, later. My suspicion is Claude 3.7 Sonnet has 0% success in the 2-4hr range because they only had 6 samples for that range, most of which were from cybersecurity capture the flag contests. Cyber is considered a dual-use, safety hazard capability in WMDP, which labs were careful about in early 2025. Remember, this is Anthropic.
I promised you there's a way to game the horizon length on the METR eval. Here's how. The samples in the 1 minute to 16 hour range mostly come from HCAST. It turns out HCAST transparently tells us what each of these tasks are about.
HCAST Task Descriptions, 1.5-3.5 hours
Appendix D has a description of each task, sorted by estimate time taken. It's unclear which 14 exact samples the METR horizon length plot uses, but the list is small enough to consider them all.
Why is this a big deal? Well, if you know what task you want to improve performance on, its really easy to do it. You can create targeted synthetic data, or just hire vendors like Scale, Mercor and Surge to upsample such tasks in your post-training mix. If you notice, most of the tasks in this range are Cybersecurity CTFs, and MLE tasks. OpenAI has been explicit about specifically targeting these capabilities for Codex models:
Now, I'm not saying the labs are focusing on these tasks to improve on the METR plot. They probably have other incentives for this. But this is precisely why the METR plot is unlikely to generalize, it measures exactly what the labs are focusing on! If Kimi, or DeepSeek, want to shoot past, they can just collect a lot of ML-Training and Cybersecurity prompts, and finetune on them.
Note that given there are only 14 samples in the relevant task length range, getting even 1 or 2 extra samples right significantly increases horizon length! It probably increases even more if you get the longer tasks (8h+, from RE-Bench right), by luck or overfitting, as today's Claude 4.5 Opus result showed us. In fact, perhaps because Anthropic doesn’t want to risk training on cybersecurity, we still have low accuracy in the 2-4hr range?
Finally, lets look at how METR estimates 50% success horizon length. They assume a logistic relation between the probability of success, and gap between the horizon length (estimated variable) and task length:
You infer h (the 50% horizon length) by fitting the 0/1 success evaluation data of each task. is also a learnt parameter, governing the slope of how fast the logistic function falls from 1 to 0.
I think once you assume a logistic function, its almost guaranteed that if a new model solves one additional task, it's going to continue the log-linear trend. Remember that METR also only adds more models to the point when they think they are likely to push the frontier. Coupled with measuring on a task distribution that model developers are actively trying to improve on, I think the log-linear trend, or X month doubling period, pops out almost tautologically from the logistic fit assumption.
For example, I tried deriving the horizon length from JUST their reported accuracy, without looking at individual sample evaluations at all. Remember how the main contribution of the METR plot was shifting focus from aggregate accuracy to horizon lengths? Well it turns out, if you use the aggregate accuracy, and the task length distribution, and fit the logistic function to estimate horizon length assuming even a constant =0.7, you recover the log-linear trend:
This means, if you had access to just the aggregate accuracy on HCAST, you could estimate the horizon length without knowing which samples the model gets right or wrong. It could be wrong on the short ones, and right on the long ones, for all you care.
Now presumably this logistic fit assumption arises from an earlier plot in the paper, claiming model success rates go down linearly with doubling in task length. I have a qualm with this plot too:
Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec - 1 min range. There's something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you'd get a very different, almost vertical line, with a very low R^2. I don't know how load-bearing this is on the use of a logistic function to fit p(success) vs task length when estimating horizon lengths.
I am not a statistician, so I am uncertain about this final part of the analysis. I don't know what it implies. I don't know how problematic it is to assume a logistic function for this data. I invite people more experienced than me in statistics to analyze it, because it seems a bit suspicious.
Overall, I wish we had more, and robust measurements for model horizon lengths. I think it is a much more meaningful metric than accuracy, because it directly affects automation impacts, and economic outcomes (for example, labor laws or often based on number of hours of work). Heck, I even wrote a paper on this topic. I applaud METR for turning my, and many others' attention towards this. But the way people are misinterpreting, and making wild inferences from the headline horizon length numbers METR puts out every month, worries me. If we are staking investment decisions, and research priorities based on an evaluation, it needs to be really robust. And making robust long-horizon benchmarks is hard, expensive, and uncharted territory. I hope METR plot v2 rises to the challenge!
I thank Sumeet Motwani, Ameya Prabhu, Arvindh Arun, and Akshit Sinha for feedback on this post. I appreciate folks at METR recognizing the value of these critiques when I tweeted about them. If you liked this, consider subscribing to my substack, where I post more often.