Long tasks require being able to decompose your problem into subgoals and meet them, rather than just copy something from the training set.
This may or may not be covered by the examples you already gave, depending on how broadly you interpret them.
My view is that LLMs can't do context management well that long tasks require. Training approaches rely on the model doing the whole task within it's context window length. Long context tasks like reading large documents to answer questions are not the same as coding an entire software library from scratch.
RL training, that I know of, still involves trajectories that fit within a context window length. There is no optimization pressure for models to figure out how to do context management unlike humans who develop strategies like note-taking.
Written extremely quickly for the InkHaven Residency.
Like humans, AI models do worse on tasks that take longer to do. Unlike humans, they seem to do worse on longer tasks than humans do.
This is a big part of why the METR time horizon results make sense: because longer tasks are also “harder” for models, and more capable models can do longer tasks, we can use the length of tasks that the models can perform as a metric of model capability.
There’s a clear etiological or causal-historical explanation of why models do worse at long tasks: they’re probably trained on more short tasks and fewer long tasks. This is both because it’s easier to make shorter tasks, and because you can train models on more short tasks than longer tasks with a fixed compute budget.
But from the perspective of AI evaluations, it’s also worth considering mechanistic explanations that make reference only to how properties of long tasks interact with the AI system in deployment. Whatever the training story may be, the AI models as they currently exist have some property that makes long tasks genuinely harder for them in a way that tracks capability. Understanding what this property is could matter a lot for interpreting the METR time horizon and even for forecasting AI capabilities over time.
So here are five such possible hypotheses that explain why longer tasks seem consistently harder for current models, based in large part on my experience at METR.
Long tasks are less well defined, and require judgment or taste (which models are bad at). For a software engineer, a 1-minute coding task might involve composing a single 10 line function or running a relatively simple SQL query. By their very nature, these tasks tend to be easy to define and easy to score, with relatively objective success criteria and little human judgment involved. A 15 minute task may be implementing a relatively simple data processing script or fixing a simple bug: more complicated, but still relatively easy to score. In contrast, an 8 hour task likely involves substantial amounts of design taste (in ways that are harder to score), and month long tasks likely involve communicating with a stakeholder or building code with properties that are hard to algorithmically verify (e.g. maintainability). (This is also related to why algorithmically scorable longer tasks are harder to make.)
While the longer METR tasks are still algorithmically scored, they tend to require models to build sophisticated software artifacts or iteratively improve on experiment design, where taste plays a larger role in success. Since models seem to lack ‘taste’ of some sort, relative to humans of comparable execution ability (hence the complaints about AI Slop), this could explain why they do worse on longer tasks.
Long tasks require more narrow expertise (which models may not have). An important property of the METR task suite is that longer tasks should not be trivially decomposable into shorter tasks. That is, a 10 hour-task should not trivially be decomposable into 10 1-hour tasks, and 10 short math problems do not become a single longer math problem. Perhaps as an artifact of the property, many of METR’s longer tasks (and perhaps longer tasks in people’s day-to-day work in general) rely on more specialized procedural knowledge that is hard to easily acquire via Google. For example, many of METR’s long tasks are cryptographic or machine learning challenges that require some amount of procedural knowledge in the relevant fields to approach. Insofar as the long tasks are more likely to require procedural knowledge outside the AI models’ area of expertise, they may struggle.
Personally, I find this relatively unlikely as an explanation for the METR time horizon tasks (since AI models seem to have a lot of expertise in the relevant areas), but it might be a large explanation for the inability of AIs to autonomously complete large tasks in general.
Long tasks take models longer, leading to more stochastic failures (which models exhibit). A popular explanation that people cite is that tasks that take humans longer also take AI agents more steps to complete, and AI are not fully reliable, and fail with some small probability on each step. For example, Toby Ord raises this as a hypothesis in a response to our Time Horizon paper.
I think this is definitely part of the explanation (and why longer tasks are harder for humans as well), with some caveats: first, I caution against naively interpreting human time as proportional to AI steps and applying a constant hazard model. For example, it turns out that if you fit the failure rate model for AI agents over time, the failure rate goes down as the task goes on! Second, AI models seem to have different time horizons across different domains, and simple versions of this hypothesis cannot explain that phenomenon.
Long tasks take models longer, causing failures due to distribution shift or self conditioning (which models may suffer from). A related explanation is that longer tasks take models more off-distribution: base models (at least earlier on) were not trained to predict long sequences of model-generated outputs, and even RLVR’ed models were probably trained with short tasks, far shorter than the 16 hour, tens of millions of token tasks that we might ask them to do. This increases both the chance that the models are simply off distribution (and thus may be less competent in general), and the chance that they accumulate errors by chance and start conditioning on being the type of agent that makes such mistakes (and thus becoming more prone to make such mistakes). In the same way that naive versions of the constant hazard model seem contradicted by evidence, I suspect that naive versions of this hypothesis are also likely to fail. But it’s possible that more sophisticated versions may play a key role in explaining the phenomenon.
Long tasks require better time and resource management (which models struggle with). Finally, an explanation that I often think is neglected is that longer tasks tend to require meta-cognition and explicit strategy, which current models seem to struggle with. A 5-minute task such as writing a simple function or script can be done in one go, without much planning, but getting the best score in a machine learning experiment over 8 hours requires allocating scarce resources including remaining time and compute. It’s been observed that models understandably struggle a lot with understanding how much (wall clock) time they take to do particular tasks, or often double down on failing approaches instead of switching strategies.
I welcome more thinking on this topic, as well as more empirical work to distinguish between these hypotheses.