(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Jonas Moss

Update: Added GPT-5.2 to the main part of the text, this uses all data from v1.1. Added appendix using all METR models, by joining v1.0 and v1.1. Added appendix with marginal vs typical P(success) curves. Thanks to Thomas Kwa for telling me about this.

TLDR

I reanalyzed the METR task data using a Bayesian item response theory model.

The METR data cannot distinguish exponential from superexponential growth. Four trajectory shapes (linear, quadratic, power-law, saturating) fit the existing data equally well but diverge on forecasts. For instance, the 95% credible interval for the 125-year crossing is 2031-06 – 2033-10 for linear and 2028-07 – 2032-03 for quadratic.
METR’s headline horizon numbers overstate current capability by roughly an order of magnitude at 80% success. METR doesn’t model variation in task difficulty, so their horizons reflect a task of typical difficulty for its length. But tasks of the same length vary a lot in how hard they are, and difficult tasks pull the horizon down more than the easy tasks push it up. Curiously, this doesn’t affect timelines by more than ~1 year, as it’s just a level-shift.
We need data about the human times to quantify uncertainty. Credible intervals throughout are too narrow because I treat human times as known rather than estimating using latent variables. I’m doing this because I don’t have access to all the raw data. This could be a big deal, and could also affect the horizons.
Doubling time under the standard linear (exponential growth) model is ~4.3 months, which is similar to METR’s estimate (95% credible interval: 3.7–5.1, but see caveat above).

METR data

Let’s start with a plot that shouldn’t be too surprising. Four reasonable models fit the METR data (v 1.1) equally well. They agree about the past but disagree strongly about the future.

The model selection scores known as ELPD-LOO differ by at most ~6 points. ^[1] Calibration is nearly identical, with Brier 0.067 across the board. Your prior matters a lot here and has clear-cut consequences, as the models agree about the past but disagree strongly about the future. The latest data point is GPT-5.2 (December 2025).

These curves are fitted using a Bayesian item response theory model described below. Before describing it, let’s recall METR’s analysis of the time horizon. They proceed in two stages:

Per-model logistic regression. For each model , fit where is human time for task . Here is the task duration where the curve crosses 50%. When , we get , a horizon. This gives a “horizon score” per model.
An OLS trend. Regress on release date. The slope gives a doubling time of ~4 months.

This is good modeling and gets the main story right, but there are some non-standard choices here. For instance, the slope varies with model rather than task (which is unusual in item response theory) and Stage 1 uncertainty is not accounted for in Stage 2 (METR uses the bootstrap). It also treats every task of the same length as equally difficult and only considers one trajectory shape.

In this post I make a joint model, adjust some things to be more in line with standard practice, and ask what happens when you try different trajectory shapes. The post is somewhat technical, but not so god-awful that Claude won’t be able to answer any question you have about the methodology. Models are fitted with Stan, 4 chains 1000 post-warmup draws, with code available here. I intentionally won’t go into details about technicalities, e.g. prior choices – the code contains everything you’ll want to know and your favorite LLM will figure it out for you. (All priors were chosen by Codex / Claude Code and appear reasonable enough.)

The basic model

The first stage of METR’s model is almost a 2-parameter logistic model (2PL), the workhorse of educational testing since the 1960s.

So, what kind of problems was the 2PL model designed for? Say you give 200 students a math exam with 50 questions and record their answers as correct / incorrect. You want to estimate the students’ math ability, but raw percent correct scores aren’t necessarily very good, as they depend on which questions (easy or hard? relative to which students?) happened to be on the exam.

The 2PL model solves this by giving each student a single ability score () and each question two parameters: a difficulty (, how hard it is) and a discrimination (, how cleanly it separates strong from weak students). “What is 3×2?” has low discrimination as everyone gets it right regardless of ability. A simple proof-writing question has high discrimination as sufficiently strong students can solve it, but weak students have no chance.

The model estimates all parameters simultaneously via a logistic regression:

This matters here because METR tasks are like exam questions. They vary in both difficulty and how well they separate strong from weak models, and we want to put all the models on a common ability scale.

Modeling difficulty

Ability and difficulty parameters in the 2PL are hard to interpret. The scale is arbitrary, and it’s not clear what, for instance, a 0.1 increase in ability actually means. Or whether it would be better to take a log-transform of the parameter, etc. The METR data is cool and famous because each task comes with a human time, which gives us a natural and interpretable scale for difficulty. So let’s connect human time to difficulty first.

Each task’s difficulty has a mean that depends on log human time, plus a random component to account for the fact that same-length tasks are not born equal. (METR treats all tasks of identical length as equally hard.)

Since difficulty increases with log human time at rate , we can convert any difficulty value back into a time, an equivalent difficulty time. If a task takes humans 10 minutes but is unusually hard for AI, its equivalent difficulty time might be 50 minutes. A task with human time and difficulty residual has equivalent difficulty time . ^[2]

I estimate 1.51 (posterior median), which is quite large once we interpret it. One standard deviation of unexplained difficulty corresponds to a ~4.7x multiplier in equivalent difficulty time. ^[3] A task that’s harder than the average for its length is as hard as a task 4.7x longer. And a task that’s harder is as hard as a task roughly 23x longer. So tasks of identical human time can span a huge range in difficulty for the AI models.

Of course, this is a modeling choice that can be wrong. There’s no guarantee that difficulty is linear in , so we need diagnostics to check. The plot below does double duty as model diagnostic and explanation of what the random effect means in practice.

A plotted dot at 5x means the task’s equivalent difficulty time is 5x its actual human time. Even within the band, tasks of identical human time can differ multiplicatively by a factor of 23x in equivalent difficulty time, so the practical spread is enormous.

There’s not too much curvature in the relationship between log human time and difficulty, so I think the log-linear form is decent, but it’s much more spread out than we’d like. There is a cluster of easy outliers on the far left, which I think can be explained by very short tasks containing virtually no information about difficulty. Overall this looks reasonable for modeling purposes.

Modeling ability over time

By directly modeling ability over time, we can try out shapes like exponential, subexponential, superexponential, saturating, and singularity. Forecasts depend a lot on which shape you pick, and the data doesn’t really tell you much, so it’s not easy to choose between them. Your priors rule here.

The abilities are modeled as

where is the model release date in years, centered at the mean (September 2024). I’m still using a random effect for model ability here, since nobody seriously thinks every model released on the same date must be equally capable. I’m looking at four shapes for : ^[4]

Model		Params	Intuition
Linear		2	Linear = exponential horizon growth (constant doubling time)
Quadratic	,	3	Superexponential, accelerating growth
Power-law	,	3	Flexible: sub- or super-exponential. is a shifted/scaled version of .
Saturating		4	S-curve ceiling on ability.

If METR’s GitHub repo contained all the historical data, I would also have tried a piecewise linear with a breakpoint around the time of o1, which visually fits the original METR graphs better than a plain linear fit. But since the available data doesn’t go that far back, I don’t need to, and the value of including those early points in a forecasting exercise is questionable anyway. Getting hold of the latest data points is more important. (Added: I use all the data in the appendix below, but I do not attempt a piecewise linear since running the STAN programs take a lot of time.)

All models share the same 2PL likelihood and task parameters (, , , , ). Only the model for changes.

Each model except the saturating model will cross any threshold given enough time. Here are posteriors for the 50% crossing across our models. The saturating model almost never crosses the 1-month and 125-year thresholds since it saturates too fast.

Trend	1mo Mean	1mo 95% CrI	125y Mean	125y 95% CrI
Linear	2028-09	2028-03 – 2029-05	2032-07	2031-06 – 2033-10
Quadratic	2027-11	2027-03 – 2028-10	2029-12	2028-07 – 2032-03
Power-law	2028-01	2027-05 – 2029-01	2030-06	2028-12 – 2033-01

Problems with 80% success

Everything above uses 50% success, but METR also cares about 80% success and fits a separate model for that. We don’t need to do that here since the model estimation doesn’t really depend on success rates at all. We’ll just calculate the 80%-success horizon using posterior draws instead.

But there are actually two reasonable ways to define “80% success,” and they give different answers.

Typical: Pick a task of average difficulty for its length. Can the model solve it 80% of the time? This is roughly what METR computes.
Marginal: Pick a random task of that length. What’s the expected success rate? Because some tasks are much harder than average, the hard ones drag down the average more than easy ones push it up.

At 50%, the two definitions agree exactly. But at 80%, the gap is roughly an order of magnitude!

So, on the one hand, it’s the variance () alone that causes these two plots to be necessary under our model. But on the other hand, the difference is not really a consequence of modeling. Some tasks of the same human time vary a lot in how hard they are for our models, and a phenomenon like this would happen for any model that’s actually honest about this.

The marginal horizon is the one that matters for practical purposes. “Typical” is optimistic since it only considers tasks of average difficulty for their length. The marginal accounts for the full spread of tasks, so it’s what you actually care about when predicting success on a random task of some length. That said, from the plot we see frontier performance of roughly 6 minutes, which does sound sort of short to me. I’m used to LLMs roughly one-shotting longer tasks than that, but it usually takes some iterations to get it just right. Getting the context and subtle intentions right on the first try is hard, so I’m willing to believe this estimate is reasonable.

Anyway, the predicted crossing dates at 80% success are below. First, the 1-month threshold (saturating model omitted since it almost never crosses):

Trend	Typical Mean	Typical 95% CrI	Marginal Mean	Marginal 95% CrI
Linear	2029-02	2028-07 – 2029-10	2030-10	2029-12 – 2031-10
Quadratic	2028-02	2027-05 – 2029-02	2029-01	2027-12 – 2030-08
Power-law	2028-04	2027-08 – 2029-06	2029-06	2028-05 – 2031-02

And the 125-year threshold:

Trend	Typical Mean	Typical 95% CrI	Marginal Mean	Marginal 95% CrI
Linear	2032-12	2031-10 – 2034-03	2034-08	2033-04 – 2036-02
Quadratic	2030-02	2028-08 – 2032-07	2030-12	2029-02 – 2033-12
Power-law	2030-09	2029-02 – 2033-06	2031-09	2029-09 – 2035-03

Make of this what you will, but let’s go through one scenario. Let’s say I’m a believer in superexponential models with no preference between quadratic and power-law, so I have 50-50 weighting on those. Suppose also I believe that 125 years is the magic number for the auto-coder of AI Futures, but I prefer to as the latter is too brittle. Then, using the arguably correct marginal formulation, my timeline has mean roughly April 2031, but the typical framework yields roughly June 2030 instead. And this isn’t too bad, just a difference of ~0.9 years! The linear model is similar, with timelines pushed out roughly 1.7 years. So, the wide marginal-typical gap doesn’t translate into that big of a timeline gap, as both trajectories have the same “slope”, just at a different level.

Let’s also have a look at METR’s actual numbers. They report an 80% horizon of around 15 minutes for Claude 3.7 Sonnet (in the original paper). Our typical 80% horizon for that model under the linear model is about 21.1 min, and the marginal is about 0.8 min, roughly 15x shorter than METR’s.

Modeling

The available METR data contains the geometric mean of (typically 2-3 for HCAST) successful human baselines per task, but not the individual times. Both METR’s analysis and mine treat this reported mean as a known quantity, discarding uncertainty. But we can model as a latent variable informed by the reported baselines. This is easy enough to do in Stan, and would give a more honest picture of what the data actually supports, as all credible intervals will be widened.

I’d expect smaller differences between the typical and marginal plots at horizon if the values were modeled properly, as more of the variance in the random effect would be absorbed by the uncertainty in . I’m not sure how big the effect would be, but getting hold of the data or doing a short simulation would help.

A technical point: When modeling , I would also try a Weibull distribution instead of log-normal, since the log-normal is typically heavier-tailed and the Weibull is easier to justify on theoretical grounds using its failure-rate interpretation.

Notes and remarks

I also tried a finite-time singularity model of the form . The posterior on the singularity date didn’t really move from the prior at all. This is no surprise. It just means the data is uninformative about .
There are loads of other knobs you could turn. Perhaps you could introduce a discrimination parameter that varies by model and task, together with a hierarchical prior. Perhaps you could make discrimination a function of time, etc. I doubt any of these would change the picture much, if at all. The model fit is good enough as it is, even if the uncertainty is likely too small. That said, I don’t want to dissuade anyone from trying!
The power-law model does in principle support both sub- and superexponential trajectories ( and , respectively, where is the linear model). The posterior puts , so the data does not support subexponential growth. At least when using this model.
There’s plenty of best-practice stuff I haven’t done, such as prior sensitivity analysis. (But we have a lot of data, and I wouldn’t expect it to matter too much.)
The doubling time posterior median is 4.3 months (95% credible interval: 3.7–5.1), which is close to METR’s v1.1 estimate. Of course, doubling time only makes sense for the linear model above, as the doubling time of the other models varies with time.

Appendix: Results for all models

Recall that the main text uses only METR v1.1 data. In this appendix I use all available data (v1.0 + v1.1 merged). The overall story is similar, but the pre-Sonnet-3.5 models introduce a visible kink in the trajectory that a single smooth trend struggles with. (This is well-known.)

The ELPD-LOO scores differ by at most ~3 points, with Brier 0.064. ^[5]

Appendix: Marginal vs typical success curves

These are fitted on v1.1 data only.

The ELPD-LOO estimates are: power-law (SE ), saturating (SE ), quadratic (SE ), linear (SE ). ↩︎
Define as the human time whose mean difficulty equals . Then , so and . ↩︎
The multiplier is where is the posterior median ↩︎
Quadratic is the simplest choice of superexponential function. You could spin a story in its favor, but using it is somewhat arbitrary. The power-law is the simplest function that can be both super- and subexponential (in practice turns out to be superexponential here though), and I included the saturating model because, well, why not? ↩︎
The ELPD-LOO estimates are: power-law (SE ), quadratic (SE ), saturating (SE ), linear (SE ). ↩︎

Thanks for making this, it's good to see high-effort critiques.

Let’s start with a plot that shouldn’t be too surprising. Four reasonable models fit the METR data equally well. They agree about the past but disagree strongly about the future.

Note that we have data going back to 2019 (GPT-2), and it looks like the other models wouldn't fit the earlier trend well. (The raw data should be in the time horizons 1.0 folder). I'd guess the best fitting models over the whole range would be linear and a slightly superlinear power law.

METR also cares about 80% success and fits a separate model for that.

Actually it's worse than that; we just get the 80% point of the logistic. Initially 80% success was an afterthought, but since it's clear people care about it, Alex Barry has been doing Bayesian analysis that supports it.

But there are actually two reasonable ways to define “80% success,” and they give different answers.
Typical: Pick a task of average difficulty for its length. Can the model solve it 80% of the time? This is roughly what METR computes.
Marginal: Pick a random task of that length. What’s the expected success rate? Because some tasks are much harder than average, the hard ones drag down the average more than easy ones push it up.

Our method is actually closer to computing "Marginal", because when fitting the logistic curve, the points near x = 1 hour include all trials for all tasks estimated at that length. We didn't actually have the means to get the "Typical" 80% point since we don't compute "average difficulty for its length" anywhere.

In your analysis, the "Typical" 80% time horizon of the last point (I think Claude 4.5 Opus) is around 110 minutes, whereas ours is 42 minutes. I'm not sure what makes the marginal so low in your analysis (5 minutes) but Alex suggests it might be that effect you mentioned. Ideally, I think the curves would look something like this:

Basically there are two factors that change the original analysis vs "typical"

We have limited baseline data (often 0-2 baselines per task), so our time estimates will be noisy estimates of the average human time.
Tasks that take humans the same average time will still differ in difficulty (this is "typical" vs "marginal").

If we were able to account for each of the factors or estimate them using stats, our data model would get more predictive, the slopes increase, and the 80% time horizon increase from our results -> "marginal" -> "typical". But like you say we actually want to stop at "marginal".

I chatted with Thomas a bit about this, and I also agree that the default METR model should also output things that are close to the 'marginal' definition of time horizon (or at least as well as it can be approximated with the inverse logit sigmoid).

I think the important thing to realise is that while one needs to take additional steps for the 'marginal' approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don't explicitly account for this (such as the original METR model) should have it naturally learned into the shape of their logistic curve.

(A similar thing is also true for having the discrimination parameter vary by task instead of by model - if it varies by task this uncertainty needs to be accounted for in the time horizon calculations, but since this is not the case in the original METR model it does not).

I think the important thing to realise is that while one needs to take additional steps for the 'marginal' approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don't explicitly account for this (such as the original METR model) should have it naturally learned into the shape of their logistic curve.

I don't immediately see this. The marginal idea is roughly about integrating over random effects, and that's hard to capture without actually doing it. My statement that METR's original approach is about the typical effect is wrong though.

I think we agree and I just stated this badly - I was just meaning to say that METR's original approach is closer to marginal despite them not explicitly doing the integrating over the random effects (although I agree you need to do integrate over the random effects in models that include them to get the marginal time horizon).

Note that we have data going back to 2019 (GPT-2), and it looks like the other models wouldn't fit the earlier trend well. (The raw data should be in the time horizons 1.0 folder). I'd guess the best fitting models over the whole range would be linear and a slightly superlinear power law.

Thanks, I didn't think of checking of this! I refitted the models to all the data in the updated post. The story doesn't really change much when it comes to fit. I also think it's ok, probably preferable, to use only v1.1. There's not much reason to go back this far, especially due to the well-known kink which happens roughly at the start of v1.1 anyway.

Actually it's worse than that; we just get the 80% point of the logistic.

Hmmm... Yeah, I gotta admit I didn't really check how you did this. I'm not sure what this means in practice and would careful interpreting it. It doesn't clearly map to either typical or marginal. Marginal seems intuitively out of the question because it requires integration over random effects though. It's just something else entirely, no?

Ideally, I think the curves would look something like this: [...]

I had Claude make some plots at the updated post that shows the corresponding curves in my model. I don't have a strong opinion of what the plots should ideally look like, as its kinda open how even perfectly estimated human times () maps to equivalent llm-times. (Which I take it you agree with looking at your linked post on "Reasons time horizon is overrated and misinterpreted"?)

Anyway, the difference between marginal and typical in my plots are pretty well-explained from the residuals plot, as it has to take into account the scattering of the residuals (and we have a sizable proportion of tasks with equivalent time 22x as large!). Is this the close to the "true marginal"? I mean, yes, it is, IF we take human completion time in the data set as being true. If we don't, then we have to model it.

To estimate the marginal stuff properly we need to model the baseline times. I didn't really try this, and I do think a good implementation would use all the available data, not only the columns on the Github. But it is possible to slap a lognormal on the public data too, but one data point per task is not enough to gain a lot of information, and our results will be very prior-driven. And honestly I think that's where we are wrt to 80% horizons anyway.

Thanks for the great writeup!

I'm a statistician who does some work with METR, and I recently worked on a very similar project to create a Bayesian version of the Time Horizon model. Mine ended up being somewhat different to yours (mine deviates a bit more from the currently structure of the METR model), but its great to see other people stress testing modelling.

On the 80% Time Horizon results I agree that your 'marginal' approach is correct, and it is the one I also took in my model. However my 80% results ended up being a factor of 2 Edit:higher than the results of METR's current model for recent SOTA LLMs. Here is a quick plot I made just after the Opus 4.5 results came out, using the TH1.0 data:

I think there is some natural increase due to how my model's data is selected however, as my 50% time horizons are also often somewhat higher, and they are mostly within uncertainty bounds anyway:

I've taken a very quick look through your code so try and think about the difference, and my guess would be that you find that LLM-difficulty diverges more from log(baseliner_time) time than I do, because you include the tasks with estimated baseliner times when calculating the amount of noise here, whereas I handled tasks with/without baseliner times separately, and only used the former when doing the time horizon calculations.

My definition was: "For a LLM m, I define its ‘p time horizon’ as the delta such that LLM m has (expected) probability p of success on a single attempt at a task with baseliner time delta." Where we might expect different results for tasks with estimated instead of baselined task lengths because there is effectively another layer of noise added by the estimates.

(I'll note that out of all the critiques of the Time Horizon work I'm surprised I don't see more discussion of the tasks which only have estimates, as this seems like one of the most straightforward limitations, and which will only get more relevant as tasks get longer and harder to baseline. Something like only 5/30 of the longest tasks currently have baseliner times!)

I've love to chat more about Bayesian modelling and general thinking about these kinds of models sometime, and thanks again for the interesting analysis.

However my 80% results ended up only being a factor of 2 lower than the results of METR's current model.

Aren't these a factor of 2 higher than the original METR model?

Good catch! Edited my comment. It had been a while since I had looked at the results and I must have also lost the ability to read in the meantime.

Very nice! I'm not able to comment very much since I don't know the specifics of your model, but can you clarify what you mean by

because you include the tasks with estimated baseliner times when calculating the amount of noise here, whereas I handled tasks with/without baseliner times separately, and only used the former when doing the time horizon calculations.

I have to admit I have worked with the METR data mostly as-is, and not gone into detail about how the times have been estimated. I suppose the problem is that only a subset of the tasks have grounded estimates of human times (as I interpreted HCAST?) and the rest are inferred in a more or less ad-hoc way? If so, then that would explain 80% marginal times being shorter because the residuals would (plausibly) be smaller.

Yes sorry for just dropping in with "I have a model that gives different results" without actually giving the details. I'm trying to get a minimal version of it written up (I had designed it to integrate into METR's codebase so need to extract it as something that can exits standalone).

Within the runs.json there is a (not especially clearly named) 'human_source' field for each row. If this is set to "baseline" then the task length is based on (one or more) human baseliners, if it is "estimate" then it was just estimated without any human actually finishing the task. These estimates are generally quite noisy - I believe somebody told me something like that for some of the tasks where they had both the estimates and the baseliner times, only 60% of the estimates were within a factor of 3 of the (average) baseliner times.

Because you have a unified sigma parameter for how difficulty-for-LLM differs from log(task_length) this ends up incorporating the estimate noise as an additional source of uncertainty. But if you define the p-time-horizon as I did in my first comment as being defined on baselined tasks only then these lead to different results for the 80% time horizons.

Thanks for making this, it's good to see high-effort critiques.

Let’s start with a plot that shouldn’t be too surprising. Four reasonable models fit the METR data equally well. They agree about the past but disagree strongly about the future.

METR also cares about 80% success and fits a separate model for that.

But there are actually two reasonable ways to define “80% success,” and they give different answers.
Typical: Pick a task of average difficulty for its length. Can the model solve it 80% of the time? This is roughly what METR computes.
Marginal: Pick a random task of that length. What’s the expected success rate? Because some tasks are much harder than average, the hard ones drag down the average more than easy ones push it up.

Basically there are two factors that change the original analysis vs "typical"

We have limited baseline data (often 0-2 baselines per task), so our time estimates will be noisy estimates of the average human time.
Tasks that take humans the same average time will still differ in difficulty (this is "typical" vs "marginal").

I think the important thing to realise is that while one needs to take additional steps for the 'marginal' approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don't explicitly account for this (such as the original METR model) should have it naturally learned into the shape of their logistic curve.

Note that we have data going back to 2019 (GPT-2), and it looks like the other models wouldn't fit the earlier trend well. (The raw data should be in the time horizons 1.0 folder). I'd guess the best fitting models over the whole range would be linear and a slightly superlinear power law.

Actually it's worse than that; we just get the 80% point of the logistic.

Ideally, I think the curves would look something like this: [...]

I think there is some natural increase due to how my model's data is selected however, as my 50% time horizons are also often somewhat higher, and they are mostly within uncertainty bounds anyway:

However my 80% results ended up only being a factor of 2 lower than the results of METR's current model.

Aren't these a factor of 2 higher than the original METR model?

Good catch! Edited my comment. It had been a while since I had looked at the results and I must have also lost the ability to read in the meantime.

Very nice! I'm not able to comment very much since I don't know the specifics of your model, but can you clarify what you mean by

because you include the tasks with estimated baseliner times when calculating the amount of noise here, whereas I handled tasks with/without baseliner times separately, and only used the former when doing the time horizon calculations.

LESSWRONG
LW

LESSWRONG
LW

21

(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

21

TLDR

METR data

The basic model

Modeling difficulty

Modeling ability over time

Problems with 80% success

Modeling

Notes and remarks

Appendix: Results for all models

Appendix: Marginal vs typical success curves

21

21