Curated. This seemed like a (relatively) straightforward thing to check that seems straightforwardly useful. I'm interested in seeing METR run a version of this against their existing task suite.
There's always a bit of a double-edged sword of making a good capabilities eval because, even if people don't have direct access to the eval to iterate against, it implicitly becomes a target. (i.e. it's hard to tell, but I get some sense of companies striving to hit the METR trendline and beat the other companies on it).
My understanding is the METR task suite is basically saturated. You could probably construct a good version of this that is less saturated less quickly.
I'm wondering if there's any way to keep this artificially low while making CoT time horizons high, and if there's some sort of index you could publish that's, like, ratio of CoT-time-horizon to non-CoT or something. I think for it to be that real/helpful you'd also want some kind of "...and the CoT is faithful" metric that I don't currently know of a robust solution for. (This is not a very well thought out idea, just musing)
It seems pretty challenging to run a something like METR tasks in a non-CoT way. If you are doing tasks like software engineering that are generative (writing any code or agent loop, not just some form of labeling), it's hard to stop the models from doing things that are essentially CoT. A trick some models[1] will do when writing code is write very verbose in-line comments where they are essentially packing CoT inside the code. Even if you somehow convince them to not write comments, can still do things like write a lot dead code. (It might be possible to control for this somehow though.)
The OP's methodology sensibly fully side step this by excluding generative software tasks from the main results. One could possibly convert some of METR tasks into labeling-shaped tasks, but not like a straight rerun. Agree could make a version that would be very hard to saturate.
My current impression is the metrics in this study though seem partially saturated already / slightly broken. This is because in the >0.5hr regime there is only a single task on a spikey metric (SHADE-Arena Monitor TPR @ 1% FPR) which is already measuring greater than 50% TPR on frontier models (with massive error bars). Also a lot of the important tasks in the current trend lack actual human data to ground them. (This discuss this some in another other comment and MKodama's comment). I might be missing something though...
A follow study / iteration which thinks more about good data covering the full spectrum of time seems like would be necessary to draw many conclusions that are trying to be drawn here. Plus future thinking on how timeable things we care about are.
I can't exactly point here, but loosely especially in the like approx Gemini 2 and GPT-4.1 era models that were RL-ed hard on a bunch of competitive programing problems without much code readability constraints.
A very interesting paper quantifying the conventional wisdom that a large degree of progress is from inference scaling.
You have got access to GPT-3 and 3.5 from OpenAI, but have you tried to get the same for now publicly unavailable Claude models for a better coverage of the first half of 2025?
We don't have privileged access from OpenAI. Similar to METR, we use the closest available public models to estimate the capabilities of models that are no longer publicly available.
I find the data in the left half of figure 2 surprising. It's saying that without CoT, models are good at very short horizon tasks and at somewhat long horizon tasks but bad at intermediate tasks. For instance, Opus 4.7 does well on <1 minute tasks and well on 1 hour tasks, but it does poorly on 5 minute tasks.
Do you have an idea why that might be the case? It's not what you would expect under a survival model (like Toby Ord proposed to explain the METR TH study).
Thinking about whether Fig 2 makes sense seems like a really good point.
When I was playing with what I could make of the data in a different comment I plotted the different tasks. One can observe that there is only a single task (SHADE-Arena Monitor TPR @ 1% FPR) is in this >0.5hr regime. I think this should feel pretty suspect. Anything you see on this right side of the figure 2 plots is coming entirely from just one task, with a metric that might think could be pretty spikey / non-linear. This one task absolutely does not represent the universe of >0.5hr imprtant tasks.
Seems pretty important, unless I'm confused where there are other tasks here.
Since the post by Ryan Greenblatt I've been hoping to see similar, expanded work. I think it's an interesting thing to measure, since it can plausibly have large effects on the frequency of minor (though sometimes significant) mistakes.
It's also interesting how this seems to break down for certain types of reasoning and retrieval. As Greenblatt showed in another post, models are really bad at retrieving several facts required to compute something in one forward pass, even though the required tokens to turn the failures into successes are usually much lower than the 50% estimate of even GPT-3.5. For example GPT-5.5 without CoT can't robustly figure out if a given number (even as low as < 100) is a multiple of 7 when asked to do a countdown wherein the multiples of 7 are replaced with some string. This too seems to be getting better with scale, though I can't say if it's slower than the trends here in terms of doubling time. It's plausible that there aren't that many cases where you need to retrieve lots of out of context facts in a single forward pass, so it's not prioritized much by the training process and its usefulness might level off at relatively few hops. Or maybe it's just hard for transformers to implement.
Hope to see this (or similar measures) be kept up to date as the frontier moves forward!
Very interesting and useful! I wonder how much the ceiling from what is ultimately achievable with CoT is determined from the "bag of tricks" available in the forward pass (e.g., knowledge, algorithms embedded in the parameters, "intuitions," etc.), which is hard to improve without degrading monitorability vs. context length/"how much of the previous context is taken into account" when predicting a single token. I imagine this would be very task-dependent.
My first thought after reading the title and before reading the article was "steganography". LLMs don't need to hide hours long complex tasks and chain of thoughts to be dangerous. They possibly need few minutes of hidden mental work to figure out and converge on steganography approach where there is a slower hidden chain of thoughts produced by specific use of words, synonyms, repetitions.
Some of current models tend to repeat themselves in CoT few times, using different wording. This behavior makes steganography easier.
Cool work! I have couple of questions. What is the trend for number of tokens spent on questions, does it correlate with ability to solve problems, are you also planning to add any open-weight models such as qwen where variety of model sizes exist? Thanks in advance
are you also planning to add any open-weight models such as qwen where variety of model sizes exist
We analyzed 35 open-weight models, you can see the results in Section 4.3 and Appendix A.16!
not OP, but will note they intentionally focus the main results on tasks with short results. There presumably shouldn't be a large difference based on number of tokens spent. That would be more about CoT, not a non-CoT measurement.
They do analysis around different sizes of open weight models in Appendix A.16. Not sure if you had a different question than what they address there.
but will note they intentionally focus the main results on tasks with short results.
We do this because it's a sharper distinct meaning of no-CoT, i.e., in the main plot we restrict to tasks which only require very few forward passes. See the paper and above comments showing that including longer tasks, including generation and agent long-horizon tasks, doesn't change the trends that much.
A very interesting article indeed! Besides the potential implication for AI safety and explainability, this quantitative evaluation of the relative rates of evolution of model performance with and without CoT can be quite helpful in real-world scenarios where the cost of CoT in terms of latency and/or token usage is a real limitation.
Not sure whether it is possible to answer this question at the moment, but I would be interested in any insight on whether this trend comes from a fundamental property of modern LLMs (and thus likely to continue for the foreseeable future) or if it may be primarily due to a current focus on CoT from frontier labs.
Very cool and I enjoyed reading. I’ve never thought about this concept before receiving this post in my inbox. However, I am skeptical of the “reasoning” framing. I would phrase it like this: Frontier models are getting better at producing correct answers to hard questions when forced to output directly, and this capability is improving exponentially.
Though I don’t think the distinction matters for the safety implications
Interesting to see. Some things I wonder about, (1) which tasks most influence the results, (2) how much time numbers make sense, and (3) general elicitation questions
Which tasks and kinds of tasks are most influencing the headline?
With METR’s timeline plots there has been discussion here on LW and elsewhere about if certain items have a large influence on the results. I was wondering which kinds of tasks most explain the recent doublings already, and what might worlds look like where they doubled again according to this methodology.
I sort of attempted to vibe very rough versions of this analysis. It's hard without the underlying data, but can get claude to try to extract accuracy numbers from Tables 7-13 and pixel peep Fig 47 to get medians and IQRs for the human-time data.[1]

Things in the ~1-3min band are most likely to be near the 50% crossover point and I think will have more influence on the regression. A few outliers like the "vibe-coding sabotage" on the Y axis and the SHADE monitor on the X axis might deserve some consideration. These are both use a kind of TPR @ ~1-2% FPR metric, different than some other datasets. Unclear effects here.
This is vibey though and would want access to underlying data to do more confident analysis.
There's also other ways of quantify the influence of each dataset (where the "vibe-coding sabaotage" task again shows up as a clear outlier with more than twice as much influence as any other dataset). I hide it in a collapsible though as I'm not super confident in methodology.
Vibey can we calculate an influence score?

Claude's explanation:
The bars estimate how much each benchmark plausibly contributed to the headline time-horizon gain between GPT-4o and GPT-5.5. The headline metric is the 50% crossover of a logistic curve fit to success vs. log human solve time, and a basic property of logistic fits is that an observation's influence on the fitted curve scales with p(1−p) — maximal for tasks the model passes about half the time, near zero for tasks it almost always passes or almost always fails. Since the crossover moved from roughly 0.9 to 3 minutes between these two models, accuracy gains on tasks in that time range supply the most evidence about where the crossover sits. Each bar is therefore the benchmark's accuracy change (chance-corrected, approximately following the paper's normalization — exactly for MCQ tasks, while the detection tasks' ~1.5% baseline was treated as 0, a ~1pp difference) multiplied by a p(1−p) weight centered at ~1.6 minutes and about a decade wide in log time, averaged over the benchmark's per-question human-time distribution as extracted from the paper's Figure 47. The whiskers show how that weight varies across the middle 50% of each benchmark's tasks. This is a heuristic approximation, not a refit of the paper's actual estimator: it captures the main channel (gains near the crossover move the crossover) but ignores how off-crossover gains rotate the curve's slope, and ignores the anchoring effect of long tasks still near 0%, which constrain how far the crossover can move. The ranking at the top is stable across reasonable choices of kernel width, but the precise values shouldn't be taken seriously — a leave-one-benchmark-out refit on the underlying data would be the right version of this analysis.
Time numbers
In general I have mixed thoughts on human time estimates as a metric in some cases, and how well time-horizon can approximate what I'd be most concerned about models packing into a forward pass[2]
It seems worth reflecting on how much trust to put on time-horizon metrics if we often can't actually get good human-time numbers in the first place. One particular consideration is many of the most influential datasets in the regression don't have actual human time measurements, and instead come from this process of using Claude to estimate human time. I think I'd believe claude can make good guesses here (and have numbers here for the tested tasks), but still seems slightly worth considering . The authors are thorough in this paper, and I eventually noticed Fig 19 in appendix which gives the regression only looking at cases with true numbers. The double time goes from 373 [167,691] days -> 349 [166, 1883] days. So the point estimate is surprisingly stable, but a confidence interval from ~6 months to 5 years is substantial.
Again, I like the ideas here and the thoroughness of the paper, but just bringing up general consideration given how much attention and extrapolation METR's original time horizon plot got. I might eventually form a clearer view here and write up more thoughts...
Edit: see also the discussion on MKodama's comment about the logistic plots. The fact there is only single task in the >0.5hr regime looks pretty problematic for much of the main analysis (eg, for Fig 2 here / fig 5 paper). If going to make some of the claims above there really should be either more tasks in this time regime, or this part of the plots should be excluded or better indicate it's just a single one task. I haven't thought as much about the o3-mini anchor, but a different possibly reasonable angle.
What's going on with few shot?
Language models are always sensitive to how you prompt them. This is especially true for earlier ones. In the headline regression GPT-2 and GPT-3 are excluded. This might be reasonable, but also worth considering aspects like prompt construction. GPT-3 was quite a smart model, it just was not “ChatGPT 100M user product-ready” because it required more care when prompting. Section 3 discusses some about how few-shot is used to elicit the non-CoT format. It's unclear how many few-shot examples are used or if this is typically enough for GPT-3-type base models. This question doesn't seem core to the study, but maybe a separate question would study a version of this data with uniform like 10-shot in every dataset or something.
--
Thanks for all the work on exploring these models and sharing findings.
btw, I think the caption on fig 47 might be off, since it says IQR, but appears to show all points? below whiskers assumes taking only an approximate middle slice of fig 47 points
stuff like Choi et al's measurements on the models making inferences of user attributes. These don't have clean mappings to how long I think a human would take to perform, but still fit concerning categories of "intuition, not reasoning" aspects that happen in a forward pass
Quick response:
> (1) which tasks most influence the results
The mainline trends and uncertainty estimates are computed from bootstraps which resample many times over different benchmarks, meaning that the analysis is robust to leaving out any particular (subset of) benchmarks. This is true for both the main doubling trends and the individual model TH estimates. From the paper:

>The fact there is only single task in the >0.5hr regime looks pretty problematic
We have lots of tasks that are >0.5 hr. In the main trend we only include a subset of (mostly shorter) tasks, but including all tasks doesn't change the trend that much (the overall doubling times reduces a bit because frontier models can do some long horizon SWE tasks without CoT, e.g., just outputting tool calls at each step).
From the paper:
> Figure 22: Comparison of TH trend lines using all tasks including short-answer, generation, and multi-turn agentic. The most significant impact of including the multi-turn agentic tasks is an increase in the point-estimate THs for the latest frontier models which perform very well on these tasks.
It would be helpful if you could ask more precise questions about the rest :)
Thanks for reply!
> which tasks most influence the results
Right, bootstrapping here seems like a good approach, and appreciate the thoroughness here. It will communicate an interval, but there is still this wonder about what does the current doubling look like and what might future ones would look like. There's various taxonomies here (eg, which tasks can gain on knowledge retrieval over reasoning, as you mentioned in the post). I tried this some in the fig I made.
Bootstrapping is a tool that tells us some things, but not others. It will still be grounded in the selected distribution of tasks (eg, in a cartoon case where bootstrapping can mislead us, imagine if we selected 10/32 tasks that were at the core about memorizing say birthdays of baseball players. If models suddenly got good at memorizing birthdays, it would look like a large improvement with a tight CI improvement, but wouldn't capture overall variance of core metric. Again, cartoon extreme). So this is something was curious about after the initial read of the paper.
In some concrete terms here, an individual dataset will have a ~64% chance[1]of being included during bootstrapping. I was just trying to get an impression of this. Can understand that more times than not outlier-ish datasets like "Vibe-code Sabatoge" and "SHADE Arena-Monitor" will be included. I shared some findings in my comment, but as I said vibey and still underexplored for future work.
> Over 0.5hr regime
Right, to be clear I think focusing on a short-answer subset is the cleanest way to get at this non-CoT question. But when this filtering is where it gets somewhat problematic when trying to understand fitting a logistic regression for a given model in the 0 to 4hr range (like in post Fig 2), and when thinking about what will future doublings under the methodology have to look like as push upwards.
As you highlight when including non-short answer tasks for non-headline results, yes there are ~7 rather than 1, so more coverage (sorry if felt like I was ignoring this in a unfair way). I feel like I don't have much certainty on how CoT-free the other 6 tasks are (which my impression is y'all are also a bit uncertain of, given filter out). As I mentioned in the comment elsewhere with Raemon, I know models, especially of a certain era, were very prone to doing things like writing long in-line comments, and basically packing CoT into the middle of their answers.
I agree that the fact that the regressions with and without these tasks doesn't change much is reassuring. I still think the questions around dataset influence seem important. I felt like one thing I learned when looking at this was an appreciation for how the tails of logistic regression matter less (counter to my intuitions on linear regression where tail points matter). So then it becomes about looking at what are the most influential datasets near the crossover point as I tried to loosely explore above, where datasets that are near the crossover point will most influence the results. I didn't make a version of my plot using the non-short-answer sample yet to try to see measures of influence there. Thanks for the out-of-band exchange sharing the data if did later.
> It would be helpful if you could ask more precise questions about the rest :)
I don't have a quick expansion to give here right now as direct questions (regarding thoughts on time horizon as a metric, how estimateable it is on things we care about, elicitation variations). Possibly might expand later, or happy to chat synchronously sometime.
--
Thanks for engaging with this! I like many aspects of the study, and hopefully discussion is useful.
Sampling 32 tasks with replacement.
(see full author list at the end)
About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)? We investigate in our new paper.
If models can do extensive reasoning without outputting any CoT, it would have implications for safety. Developers and deployment-time monitors couldn’t easily understand models’ motivations and catch dangerous planning. Models that reason substantially without a CoT might also drift further from human patterns of thought, since their reasoning is no longer constrained by text in the pretraining prior. As a result, they would be harder to understand and might be more likely to scheme.
Extending Ryan Greenblatt’s research, we investigate this by measuring models' ability to complete tasks without any CoT on a suite of 43 benchmarks spanning different domains. We compare AI reasoning ability to humans using the estimated 50% time horizon (TH)---the typical time taken for a human to perform a task that the LLM performs with 50% success rate. We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this TH has doubled approximately every year since 2019.
Figure 1: Our no-CoT THs (green) compared to METR’s with-CoT THs (purple). Until the release of GPT-4, with- and without-CoT THs increased at a similar rate. Since GPT-4 with-CoT THs have grown at roughly twice the rate of no-CoT THs.
We suggest that AI companies start to track no-CoT THs explicitly to find a lower bound on how much reasoning a model could do without revealing it to a CoT monitor or human. Our test suite does not require substantial inference compute so these evaluations are cheap to run.
Methods
We evaluated 14 frontier models from GPT-2 (2019) through GPT-5.5 (2026) on 43 benchmarks covering math, coding, knowledge, agentic tool-use, and safety-relevant questions like steganography and scheming reasoning. The questions include single-token answers, longer generation, and multi-turn agentic settings.
In every case we prevent the model from emitting reasoning tokens before its answer by using benchmark-specific prompts and structured-output constraints.
We estimate question difficulty in two ways:
These two methods give us independent measures of “how much reasoning” is needed to solve a problem.
For each model, we estimate its 50% solve horizon on both of these metrics using the same method as Kwa et al. We fit a logistic curve to success rate compared to log human solve time and log reasoning token anchor, and then find the point where the logistic equals 50%.
Figure 2: For each model, we fit a logistic curve to predict the probability it successfully completes tasks as a function of human time (left) or o3-mini reasoning tokens (right) needed to complete the task.
To estimate uncertainty, we rerun the calculation 10,000 times, resampling with replacement over benchmarks, questions within each benchmark, and model runs per question. We also incorporate uncertainty in human solve times by calibrating a noise model on all questions. In contrast, we don’t model uncertainty in the reasoning token anchor.
Results
Figure 3: No-CoT TH over time. Each point is a model plotted by release date; lines are exponential fits, with shading showing 95% CIs.
Figure 4: Token horizon over time. Each point is a model plotted by release date; lines are exponential fits, with shading showing 95% CIs.
We find:
GPT-5.5 is at the current frontier, with a 50% no-CoT TH of about 3 minutes and a reasoning token horizon of about 1,500 o3-mini tokens.
Extrapolating the median forward gives ~7 minutes / ~3.7k tokens by 2028, and ~25 minutes / ~12k tokens by 2030. The doubling-time CI is wide, but even the slow end of the 95% CI puts us at almost 10 minutes of latent reasoning by 2030.
The doubling is roughly robust to changing the distribution of benchmarks: adding longer-generation and multi-turn agentic questions, restricting to questions with measured human times only, dropping any single model, and breaking down by question domain (where domain-specific doublings range from 246 to 387 days).
FAQ
Why is our trend different from the one in Kwa et al.?
Kwa et al. reported a doubling time of 182 days for THs on agentic software tasks with-CoT, which is about twice as fast as ours. The doubling times diverge after GPT-4, since most capability gains since GPT-4 have come from externalized reasoning, rather than more reasoning in a single forward pass.
Could we be tracking memorization rather than reasoning?
Our benchmarks include some factual recall questions, so we aren't purely measuring reasoning. We include many reasoning-heavy questions, so we think the headline number tracks reasoning well, but imperfectly.
Are models actually hiding their reasoning?
We're not claiming current models are actually hiding reasoning from CoT. We measure how much they could plausibly do invisibly, not how much they actually do. Some models verbalize their reasoning, even when it's unnecessary.
Could models do more hidden reasoning?
We measure a lower bound on opaque reasoning:
Conclusion
Frontier models can do non-trivial amounts of reasoning without any CoT, and the amount of no-CoT reasoning they can do is increasing exponentially.
Three minutes of human-equivalent reasoning might be enough for some single-shot decisions: whether to comply with an instruction, whether a scenario looks like an eval, or which of several visible options to take. But it's probably not enough to do complex long-term power-seeking reasoning, like deciding to training-game or collude with other instances. If trends continue, models could be capable of twenty-five minutes of no-CoT reasoning by 2030, which would enable much more subversion. Even without neuralese, CoT monitoring might become substantially less effective in the next few years.
Paper authors: Francis Rhys Ward, Dewi Gould, Anders Cairns Woodruff, Rauno Arike, Josh Hills, Alex Serrano, Ida Caspary, Jason Ross Brown, Jo J. Jiao, Patrick Leask, Twm Stone, Ram Potham, Ionut Gabriel Stan, Harry Mayne, Simeon Hellsten, Shubhorup Biswas, Ariana Azarbal, William L. Anderson, Elle Najt, Ryan Greenblatt, and Julian Stastny
Datasets are held back from public release to avoid training contamination; available on request. This was joint work at Redwood Research, the Astra Fellows Program, Aether Research, and MATS; full author list in the paper.