Thanks to Megan Kinniment for helpful comments and discussion, and to Jean-Stanislas Denain for helpful comments and pointers to past work.
TL;DR: We claim that useful task attributes for forecasting AI capabilities should be measurable, interpretable, stable in its trend over time, and sufficient to explain task difficulty. task.human_completion_time (human expert completion time, used in time horizons) probably isn't sufficient, and is also getting hard to measure as tasks lengthen. Towards sufficiency, we argue for tracking a portfolio of task attributes, and highlight one in particular (the combined human + AI cost of completing the task) that seems more measurable.
This is a two-part series on capability evaluation. Part 1 is about acquiring fuzzy tasks, and part 2 is about analyzing them.
Forecasting with time horizons, restated
Let’s restate the basic time horizons forecasting move. There's some attribute you can imagine on every task — call it task.human_completion_time, for human completion time. If you can sample tasks nicely and plot how the probability of completing tasks with a particular task.human_completion_time moves over time, you can extrapolate when models will be able to complete all tasks relevant for AI R&D.
Human completion time has several nice properties that make forecasting possible:
Measurable: you can measure or estimate the time of a task before AIs can do it.
Interpretable: we can easily eyeball when models will be able to do tasks you don't yet have detailed numbers on, since "how long would a human take to do it" is somewhat understandable.
Stable in their trend over time: trends in time horizons against wall clock time have been reasonably stable across orders of magnitude
At the risk of stating the obvious, suppose we could ask for elo scores that correspond to task difficulty for problems that we haven’t solved yet. Call these kinds of scores task.true_difficulty. Then of course, we could observe the trend in model elo growth, and very easily forecast when it would be able to do all tasks. Of course, we can't do this, since it is not possible to obtain elo scores ahead of time.
Our goal in forecasting, then, is finding measurable, interpretable, stable attributes that explain the variance in true difficulty. Thus, we additionally need:
Sufficiency (or near-sufficiency): it explains most of what makes a task hard. Slightly more formally, task.true_difficulty = f(task.human_completion_time) + ε (this is the statistical definition of sufficiency).
For an example of a reasonable attribute that was not sufficient, consider OpenAI’s SWE-lancer benchmark. SWE-lancer is a benchmark of real freelance software engineering tasks scraped from Upwork, each tagged with the dollar amount that was actually paid out for the work.
SWE-lancer is somewhat interpretable and measurable (since it is possible to make progress on understanding what one would pay for some outcome), but surprisingly, task value was not very correlated with task difficulty. So assumption 4 clearly breaks down, and it is hard to use SWE-lancer to forecast.
task.human_completion_time might not be sufficient
We can now recast part 1's fuzziness critique as a sufficiency critique in disguise. The time horizons methodology samples tasks along the human_completion_time axis: you pick a range of durations you care about, collect tasks spanning that range, and measure model success as a function of duration. If human_completion_time were sufficient, the specific tasks you pick within a duration bucket wouldn't matter: a clean two-hour problem and a fuzzy two-hour problem would have the same model success rate, and convenience sampling within a bucket would be fine. The reason undersampling fuzzy tasks matters is that fuzziness plausibly does independent explanatory work human_completion_time doesn't capture. So a benchmark that samples on duration alone, and ends up clean-heavy within each bucket for the usual practical reasons, systematically overstates model capability on the real task distribution — and the time horizon it produces is optimistically biased.
How can we tell whether or not a metric is sufficient? One approach is to check if the true difficulty metric correlates with any other measure that you could have. As Ben Snodin explored, you could check, for instance, whether it correlates with an LLM-as-judged features of the task. We (Fulcrum) explored a similar set of approaches in this paper. If it does, we should track how well we're sampling with respect to that variable.
Importantly, we care about sufficiency on the true task distribution here. This makes it harder to use the approach of checking if other difficulty metrics correlate with true difficulty, since part of the problem is sampling a wide enough range of tasks.
Portfolios of measurements
The natural adjustment is to track a portfolio of attributes rather than a single measure. Collect a vector of properties for each task — some at creation time, some post-hoc. A starter set:
Labor cost of implementing the solution in the real world.
Wall-clock time for a similar task in the real world.
Task value — what someone would pay for a solution.
Human bits provided to the AI when a human and AI did the task together.[1]
We can then study which attributes individually predict outcomes (for instance, by regressing post-hoc difficulty on each). We can also attempt to determine, from these axes, the latent factors explaining difficulty. Finally, we can do better model error analysis, by looking for where the residuals concentrate: those are the kinds of tasks the current attribute set fails to explain, and the place to look for the next attribute worth adding.
Of course, one problem with this is that bundling is less interpretable. You can still fit a forecast over the vector, but you lose the sampling check: we have a rough sense of how long real-world tasks take, and no comparable intuition for an eight-dimensional task descriptor. Fitting is also likely to make the forecast considerably less robust, in a way that seems hard to adjust for: still, we might be able to find relatively simple weightings of attributes that allows you to aggregate them in some way.
Cost is a particularly nice attribute
Of the attributes in the portfolio, I think combined human + AI cost is the one most worth tracking as a headline replacement for time horizons. For each task, compute its current cost in human labor (hourly rate times time) and in AI labor (inference-time spend), and sum them.
A few reasons cost is worth singling out:
It lets baseliners use agents. Pure human-time baselines are getting hard to collect: for Mirror Code, it was not possible to get baselines, and in general it is hard to convince people to do long software tasks without AI assistance at all. If you ask for a time baseline, you are asking the baseliner to work in a way they otherwise wouldn't. Cost doesn't have this problem: the baseliner uses agents however they normally would, and the agent spend gets priced into the AI side of the sum.
It admits other forms of coordination. Long tasks in practice involve multiple people, handoffs, and review. Cost gives you a single number for this without having to define whose wall-clock counts.
It accounts for skill. An expert hour and a junior hour cost differently, and an expert hour usually produces more.
There are of course some issues with costs as a measure: most notably, the cost of AI labor of a given quality goes down over time.
One thing that we might hope to do is normalize by performance-adjusted inference cost. But this is difficult because the strategy by which a human accomplishes a task with AI models in the future might be different than the strategy that they use to accomplish tasks now. For instance, we might use only a little bit of AI labor to do some tasks now, but in the future use a lot of it (since models get better over time).
To fix this, one can simply compute several baskets of these sorts of tasks. You can have both a 2026 Q2 test suite and a 2027 Q1 test suite and store these separately.
Thanks to Megan Kinniment for helpful comments and discussion, and to Jean-Stanislas Denain for helpful comments and pointers to past work.
TL;DR: We claim that useful task attributes for forecasting AI capabilities should be measurable, interpretable, stable in its trend over time, and sufficient to explain task difficulty. task.human_completion_time (human expert completion time, used in time horizons) probably isn't sufficient, and is also getting hard to measure as tasks lengthen. Towards sufficiency, we argue for tracking a portfolio of task attributes, and highlight one in particular (the combined human + AI cost of completing the task) that seems more measurable.
This is a two-part series on capability evaluation. Part 1 is about acquiring fuzzy tasks, and part 2 is about analyzing them.
Forecasting with time horizons, restated
Let’s restate the basic time horizons forecasting move. There's some attribute you can imagine on every task — call it task.human_completion_time, for human completion time. If you can sample tasks nicely and plot how the probability of completing tasks with a particular task.human_completion_time moves over time, you can extrapolate when models will be able to complete all tasks relevant for AI R&D.
Human completion time has several nice properties that make forecasting possible:
At the risk of stating the obvious, suppose we could ask for elo scores that correspond to task difficulty for problems that we haven’t solved yet. Call these kinds of scores task.true_difficulty. Then of course, we could observe the trend in model elo growth, and very easily forecast when it would be able to do all tasks. Of course, we can't do this, since it is not possible to obtain elo scores ahead of time.
Our goal in forecasting, then, is finding measurable, interpretable, stable attributes that explain the variance in true difficulty. Thus, we additionally need:
For an example of a reasonable attribute that was not sufficient, consider OpenAI’s SWE-lancer benchmark. SWE-lancer is a benchmark of real freelance software engineering tasks scraped from Upwork, each tagged with the dollar amount that was actually paid out for the work.
SWE-lancer is somewhat interpretable and measurable (since it is possible to make progress on understanding what one would pay for some outcome), but surprisingly, task value was not very correlated with task difficulty. So assumption 4 clearly breaks down, and it is hard to use SWE-lancer to forecast.
task.human_completion_time might not be sufficient
We can now recast part 1's fuzziness critique as a sufficiency critique in disguise. The time horizons methodology samples tasks along the human_completion_time axis: you pick a range of durations you care about, collect tasks spanning that range, and measure model success as a function of duration. If human_completion_time were sufficient, the specific tasks you pick within a duration bucket wouldn't matter: a clean two-hour problem and a fuzzy two-hour problem would have the same model success rate, and convenience sampling within a bucket would be fine. The reason undersampling fuzzy tasks matters is that fuzziness plausibly does independent explanatory work human_completion_time doesn't capture. So a benchmark that samples on duration alone, and ends up clean-heavy within each bucket for the usual practical reasons, systematically overstates model capability on the real task distribution — and the time horizon it produces is optimistically biased.
How can we tell whether or not a metric is sufficient? One approach is to check if the true difficulty metric correlates with any other measure that you could have. As Ben Snodin explored, you could check, for instance, whether it correlates with an LLM-as-judged features of the task. We (Fulcrum) explored a similar set of approaches in this paper. If it does, we should track how well we're sampling with respect to that variable.
Importantly, we care about sufficiency on the true task distribution here. This makes it harder to use the approach of checking if other difficulty metrics correlate with true difficulty, since part of the problem is sampling a wide enough range of tasks.
Portfolios of measurements
The natural adjustment is to track a portfolio of attributes rather than a single measure. Collect a vector of properties for each task — some at creation time, some post-hoc. A starter set:
We can then study which attributes individually predict outcomes (for instance, by regressing post-hoc difficulty on each). We can also attempt to determine, from these axes, the latent factors explaining difficulty. Finally, we can do better model error analysis, by looking for where the residuals concentrate: those are the kinds of tasks the current attribute set fails to explain, and the place to look for the next attribute worth adding.
Of course, one problem with this is that bundling is less interpretable. You can still fit a forecast over the vector, but you lose the sampling check: we have a rough sense of how long real-world tasks take, and no comparable intuition for an eight-dimensional task descriptor. Fitting is also likely to make the forecast considerably less robust, in a way that seems hard to adjust for: still, we might be able to find relatively simple weightings of attributes that allows you to aggregate them in some way.
Cost is a particularly nice attribute
Of the attributes in the portfolio, I think combined human + AI cost is the one most worth tracking as a headline replacement for time horizons. For each task, compute its current cost in human labor (hourly rate times time) and in AI labor (inference-time spend), and sum them.
A few reasons cost is worth singling out:
There are of course some issues with costs as a measure: most notably, the cost of AI labor of a given quality goes down over time.
One thing that we might hope to do is normalize by performance-adjusted inference cost. But this is difficult because the strategy by which a human accomplishes a task with AI models in the future might be different than the strategy that they use to accomplish tasks now. For instance, we might use only a little bit of AI labor to do some tasks now, but in the future use a lot of it (since models get better over time).
To fix this, one can simply compute several baskets of these sorts of tasks. You can have both a 2026 Q2 test suite and a 2027 Q1 test suite and store these separately.
Cf. past work on expert help (1, 2)