If we adjust for the 5-18x speed improvement measured for experienced workers, and target an 80% task success rate, that pushes the timeline out by over three years
I don't think this is a good interpretation of the 5-18x multiplier. In particular, I think the "acquire context multiplier" will be increasingly small for longer tasks.
Like, the task of "get a bunch of context then complete this 1 month long task" is a task that will typically take humans who don't already have context 2 months, maybe 6 months in particularly tricky domains, not 5-18 months. So, maybe you add a doubling or so, adding more like 7 months to the timeline.
Another way to put this is that the 5-18x multiplier is an artifact of taking months of context and applying that to a short task (like maybe 10 min or 1 hour), but if you take a 1 month task that requires a bunch of context (e.g., implement this optimization pass in llvm), having the context already is probably only a factor of 2-4 (or basically no multiplier depending on the task). (There is probably an additional experience effect where people who are e.g. experienced with compilers will be faster, but this feels a bit separate to me and not cleanly applicable to the AIs.)
To be clear, this will substantially reduce the usefulness of AIs with shorter horizon lengths (like 8-32 hours), cutting down the AI R&D multipliers we see along the way.
(I agree the 80% task success is needed and pushes out the timeline some.)
Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I'm not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the counterargument is that this makes the measured doubling times look more impressive, because (plausibly) if we look at a pair of tasks that take low-context people 10 and 20 minutes respectively, the time ratio for realistically high-context people might be more than 2x. But I could imagine this playing out in other ways as well (e.g. maybe we aren't yet looking at task sizes where people have time to absorb a significant amount of context, and so as the models climb from 1 to 4 to 16 to 64 minute tasks, the humans they're being compared against aren't yet benefiting from context-learning effects).
One always wishes for more data – in this case, more measurements of human task completion times with high and low context, on more problem types and a wider range of time horizons...
In actuality, the study doesn’t say much about AGI, except to provide evidence against the most aggressive forecasts.
This feels quite wrong to me. Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years! So, I think the study does say something meaningful about AGI other than just evidence against shorter timelines[1]. I agree AGI might not happen in a few years after 1 months long software tasks and we'd have a richer understanding at the time, but the basic case in favor feels very strong to me.
(At a more basic level, if you would have updated a decent amount toward relatively longer timelines if the paper had 20 year timelines to 1 month SWE, then you must update to relatively shorter timelines given the trend is 5 years with a possibility of more like 2.5 due to the more recent faster trend. This is by conservation of expected evidence. This isn't to say that you have to directionally update toward shorter timelines based on these results, e.g. maybe you expected an even faster trajectory and this seemed surprisingly slow extending your timelines.)
I edited this sentence in because I think my comment was originally confusing. ↩︎
Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!
Agreed. But that means time from today to AGI is the sum of:
If we take the midpoint of Thomas Kwa's "3-4 months" guess for subsequent doubling time, we get 23.8 months for (1). If we take "a few years" to be 2 years, we're in 2029, which is farther out than "the most aggressive forecasts" (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).
And given the starting assumptions, those are fairly aggressive numbers. Thomas' guess that "capability on more realistic tasks will follow the long-term 7-month doubling time" would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.
Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from "the most aggressive forecasts"?
(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)
To be clear, I agree it provides evidence against very aggressive timelines (if I had 2027 medians I would have updated to longer), I was disagreeing with "the study doesn’t say much about AGI, except to". I think the study does provide a bunch of evidence about when AGI might come! (And it seems you agree.) I edited my original comment to clarify this as I think I didn't communicate what I was trying to say well.
If the trend isn’t inherently superexponential and continues at 7 month doubling times by default, it does seem hard to get to AGI within a few years. If it’s 4 months, IIRC in my timelines model it’s still usually after 2027 but it can be close because of intermediate AI R&D speedups depending on how big you think the gaps between benchmarks and the real world. I’d have to go back and look if we want a more precise answer. If you add error bars around the 4 month time, that increases the chance of AGI soon ofc.
If you treat the shift from 7 to 4 month doubling times as weak evidence of superexponential, that might be evidence in favor of 2027 timelines depending on your prior.
IMO how you should update on this just depends on your prior views (echoing Ryan’s comment). Daniel had 50% AGI by 2027 and did and should update to a bit lower. I’m at more like 20-25% and I think stay about the same (and I think Ryan is similar). I think if you have more like <=10% you should probably update upward.
Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I'd argue that the target success rate should be higher than 80%).
Also potential factors for "task messiness" and the 5-18x context penalty, though as you've pointed out elsewhere, the latter should arguably be discounted.
Personally, I updated toward shorter timelines upon seeing a preliminary version of their results which just showed the more recent doubling trend and then updated most of the way back on seeing the longer run trend. (Or maybe even toward slightly longer timelines than I started with, I forget.)
if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month)
This doesn't seem like a good example to me.
The sort of tasks we're talking about are extrapolations of current benchmark tasks, so it's more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.
I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren't tested in the benchmarks.
Great post; I thought that there were some pretty bad interpretations of the METR result and wanted to write an article like this one (but didn't have the time). I'm glad to see the efficient lesswrong ideas market at work :)
(to their credit, I think METR mostly presented their results honestly.)
My guess: if we define agi or a superhuman coder as "self serving agent that learns to optimize for unseen objectives based on its prediction history"
- basic necessary primitives / "blueprint" understood by 2030
- a recipe made stable in the sense of compute, as well as the infra for these kind of things, efficiency etc ~2035 maybe sooner. But what could happen is that if we throw 10000 superhuman coders working on a new superhuman coder, things can accelerate quick.
[EDIT: initial publication did not have the link to the original post, and was missing the footnotes. Sorry about that! Fixed now.]
[This has been lightly edited from the original post, eliminating some introductory material that LW readers won't need. Thanks to Stefan Schubert for suggesting I repost here. TL;DR for readers already familiar with the METR Measuring AI Ability to Complete Long Tasks paper: this post highlights some gaps between the measurements used in the paper and real-world work – gaps which are discussed in the paper, but have often been overlooked in subsequent discussion.]
It's difficult to measure progress in AI, despite the slew of benchmark scores that accompany each new AI model.
Benchmark scores don’t provide much perspective, because we keep having to change measurement systems. Almost as soon as a benchmark is introduced, it becomes saturated – models learn to ace the test. So someone introduces a more difficult benchmark, whose scores aren’t comparable to the old one. There’s nothing to draw a long-term trend line on.
METR recently published a study aimed at addressing this problem: Measuring AI Ability to Complete Long Tasks. They show how to draw a trend line through the last six years of AI progress, and how to project that trend into the future.
The study’s results have been widely misinterpreted as confirmation that AGI (AI capable of replacing most workers, or at least most remote workers) is coming within five years. In actuality, the study doesn’t say much about AGI, except to provide evidence against the most aggressive forecasts.
To explain what the paper does and does not say, I’ll begin by explaining the problem it set out to address.
We’re Gonna Need a Harder Test
[LW readers might skip this section]
Here’s a chart I’ve shown before:
Each line represents AI performance on a particular benchmark. Unsurprisingly, there’s a clear pattern of rapid improvement. But each line is measuring something different; we can’t connect the dots to plot a long-term trend.
For instance, the dark green line shows scores on a speech recognition test. If you’d looked at that plot in 2013, you might have been able to estimate speech recognition scores for 2015. But you’d have had no way of anticipating that reading comprehension (pink line) would start to take off after 2016. Performance on one benchmark doesn’t say much about performance on another.
(This chart provides a reminder that AI test scores are poor indicators of real-world capability. By 2018 language models were exceeding human performance at a benchmark of reading comprehension, but they were still useless for most tasks.)
You can see that in recent years, tests have begun saturating very quickly. As a result, these benchmarks haven’t been much use in forecasting the performance of future AIs. We can’t forecast benchmark scores for the AIs of 2027, let alone 2035, because benchmarks difficult enough to challenge those AIs haven’t been constructed yet.
The METR paper addresses this by allowing us to compare scores across different tests.
Grading AIs on a Consistent Curve
[If you're already familiar with the paper, you can skip this section as well]
The METR paper creates a single plot showing AI models released over a six-year period, and projects that plot into the future:
The first model shown here, GPT-2, was barely able to fumble its way through a coherent paragraph. The most recent, Claude 3.7 Sonnet, is able to write complex computer programs at a single blow. How did METR measure such disparate capabilities on a single graph?
No one has constructed a benchmark with questions easy enough to register the capabilities of GPT-2 and hard enough to challenge Claude 3.7, so the authors combined three different benchmarks spanning a broad range of difficulty levels. One consists of “66 single-step tasks representing short segments of work by software developers, ranging from 1 second to 30 seconds”, and the questions are quite simple. Another, RE-Bench, is designed so that most human experts can “make progress” on a problem if given a full 8 hours.
To place these very different benchmarks on a common scale, the researchers evaluated the difficulty of each problem by measuring how long it takes a human expert to solve – a task that takes a person one hour is presumed to be harder than a task that takes a minute. This approach has the great benefit of being universal; any task can be placed on the “how long does it take?” scale.
For each model, they then determined the problem size for that model’s success rate is 50%. For instance, “GPT-4 0314” (the green square near the middle of the graph) has a 50% success rate at problems that take people about 5 minutes. Looking at each end of the graph, we see that GPT-2 (released in 2019) had 50% success on tasks that take a person about two seconds, while the latest version of Claude does equally well on tasks that take 50 minutes.
There are a lot of caveats to these figures, which I’ll discuss below. But it’s immediately apparent that capabilities are increasing at a predictable rate, providing perhaps the first rigorous approach for forecasting the rate of AI improvement for real-world tasks. As the paper says [emphasis added]:
This is, as the expression goes, “big if true”. Most people’s jobs don’t require much planning or execution on a time horizon of more than a month! If these results generalize to tasks other than coding, and an AI could do anything a person could do (remotely) in a month, it would be ready to automate much of the economy. Many people are reading the paper as predicting this will happen within five years, but in fact it does not.
How Applicable to the Real World are These Results?
There are several important reasons that the study does not imply a short road to AGI.
It only measures technical and reasoning tasks. The tasks used in the METR paper come from software engineering and related fields, as well as tests of verbal reasoning or simple arithmetic. These are areas where current models are especially proficient. Problem domains that current LLMs struggle with were excluded.
Just because Claude 3.7 can often tackle a 50-minute coding task, doesn’t mean it will be similarly proficient for other kinds of work. For instance, like all current models, it is almost completely incapable of solving visual reasoning tasks from the ARC-AGI-2 test that a person can often manage in a few minutes[1]. Natalia Coelho posted a nice discussion, noting that “State-of-the-art AI models struggle at some tasks that take humans <10 minutes, while *simultaneously* excelling at some tasks that would take humans several hours or days to solve.” I’ve written before about the jagged nature of AI abilities, and the wide gap between AI benchmark scores and real-world applicability. So even if the paper’s trend line is correct and the AIs of 2030 will be able to undertake tasks that take software engineers a month, they are likely to struggle in other areas.
In fact, not only do current AIs struggle at many real-world tasks, their rate of improvement at those tasks might turn out to be slower as well. We can make an analogy to Wright’s Law – the principle behind Moore’s Law, and also the reason that solar panels, EV batteries, and many other products become cheaper over time. Wright’s Law states that the production cost of manufactured items falls at a predictable rate as production scales up. However, different products have different rates of price decrease; wind turbines fall in price more slowly than transistors or solar panels. By the same token, AI performance on different kinds of work may improve at different rates, with some categories being slower to advance.
The tasks used in this paper are all tidy, isolated projects with clear success criteria, excluding all of the squishy skills I discuss in my recent post If AGI Means Everything People Do... What is it That People Do?. This is typical of AI benchmarks, which almost universally measure things that are easy to measure. For instance, from the description of one of the benchmarks used in this paper (HCAST):
To their credit, the METR authors note that tasks which arise in real work are often “messy” in a way that distinguishes them from benchmark problems. They developed a rough system for evaluating the “messiness” of a task, and found that their results are fairly consistent across tasks of different levels of messiness. However, none of the measured tasks are very messy. As the paper says:
Even this may understate the artificial simplicity of the benchmark tasks. Little weight is given to the question of interacting with other people (e.g. collaboration with a co-worker or negotiation with a vendor), and my impression is that few if any of the problems require this. I also don’t believe that any of the problems involve working in an open-ended environment such as the public Internet or a large corporate intranet – for instance, to find relevant context not included in the problem statement. These measurements don’t evaluate AI’s ability to work in a fully realistic environment.
The measurements of human task-completion time do not reflect real-world conditions. The paper measures AI tools against human contractors with relevant coding skills. However, the contractors weren’t familiar with the specific context of each task. They’re in the disoriented situation of a new hire on their first day of work.
Admirably, the authors recognized this issue and performed an experiment to measure the impact. They took a set of tasks which METR engineers had performed in the ordinary course of their work, and asked outside contractors to repeat those tasks – under the same artificial circumstances used in the study. They found that contractors took 5 to 18 times longer to perform the same tasks as METR staff. The contractors they selected were highly skilled, so this is unlikely to be a skill issue. Unsurprisingly, people take much longer to perform a task when they are working in an unfamiliar context. This same handicap applies to the human “baseliners” METR used to evaluate the difficulty of the tasks analyzed in this paper. When the paper says that Claude 3.7 has a 50% success rate at tasks which take a person 50 minutes, that suggests a person familiar with the codebase in question might be able to perform the same task 5-18 times faster, or just 3 to 10 minutes.
It’s a long road from a 50% task success rate to actual mastery. The headline projection in the paper is that in about 5 years, AIs will be able to carry out roughly 50% of software engineering tasks that take humans one month to complete. A 50% score is a good indication of progress, but a tool that can only complete half of the tasks assigned to it is a complement to human engineers, not a replacement. It likely indicates that the model has systematic shortfalls in some important skill areas. Human workers will still be needed for the other 50% of tasks, as well as to check the AI’s work (which, for some kinds of work, may be difficult).
The paper finds that time horizons for an 80% success rate are about five times shorter, e.g. Claude 3.7 would likely have an 80% success rate for tasks that take about 10 minutes instead of 50 minutes. Even an 80% success rate is far from full mastery, but there isn’t enough data yet to analyze the timeline for higher success rates.
Many readers have glossed over these factors, leading them to overblown conclusions regarding the future of AI.
What the METR Study Tells Us About AGI Timelines
This is a nuanced piece of work, but nuance is often lost as ideas bounce around the Internet. For instance, an article in Nature, AI could soon tackle projects that take humans weeks, says:
The words “software engineering” do not appear in this passage. A casual reader might easily come away thinking that AI will be able to perform virtually any cognitive task by 2029, and many people do seem to have reached that conclusion. However, a careful reading points in a different direction. The paper’s 2029 estimate is for tasks that current models are specifically designed to excel at (software development), that have unrealistically simple and clear specifications, measuring against human workers in their first day on the job. Even then the AIs are only projected to be able to perform half of the tasks.
If we adjust for the 5-18x speed improvement measured for experienced workers, and target an 80% task success rate, that pushes the timeline out by over three years[2]. However, an 80% success rate still indicates substantial gaps within software engineering tasks, and then we need to account for realistically messy tasks[3] and consider tasks outside of software engineering. This might add multiple years to the timeline, pushing us out to at least ten years for an AI that is fully human-level at a broad range of tasks, and possibly much longer, depending on how long it takes to go from 80% of tasks to 100%[4] and from software engineering to broader competence.
I was originally planning to end this post by saying that this paper is strong evidence that “AGI” (by any strong definition) almost certainly could not possibly arrive in the next five years, because the trendline puts the much weaker threshold of “50% of artificially tidy software engineering tasks” that far out. However, it turns out there’s more to the story.
Recent Models Have Been Ahead of the Curve
While the paper’s primary finding is that AI task time horizons have been doubling every 7 months over the last six years, it notes that progress on this measurement may be accelerating:
This was based on just a handful of data points, and so could have just been a blip – “difficult to distinguish from noise”. But just last week, METR released a preliminary evaluation of two new models from OpenAI, o3 and o4-mini. The results add support to the idea that AI task time horizons are accelerating. Thomas Kwa, one of authors of the original paper, writes:
The authors of AI 2027 are also on record as expecting the time for task time horizons to double to be much less than 7 months going forward, pointing to the recent acceleration trend and predicting that use of AI tools in AI R&D will speed things up further. The gods of AI love to draw straight lines on semilog graphs, but perhaps they’ve decided an upward curve would be more amusing in this case. If the upward trend continues, models could reach a 50% success rate on one-month software engineering tasks by 2027.
It’s still an open question what that would mean in practical terms.
We’re Running Out Of Artificial Tasks
It’s unclear whether the recent uptick in progress on the HCAST problem set used in the METR study will continue, and how long it will take to go from a 50% success rate to full mastery of these kinds of encapsulated coding challenges. Maybe it’ll only be a few years before AIs models can tackle any such problem that a person could handle; maybe it’ll be a decade or more. But either way, the practice of evaluating AI models on easily graded, artificially tidy problems is starting to outlive its usefulness.
The big questions going forward are going to be:
(As I was preparing to publish this, Helen Toner released a great post exploring these questions.)
The METR paper is a valuable step forward in analyzing AI progress over time. But it’s only a starting point.
The creators of the ARC-AGI-2 problem set state that no current AI model is able to solve more than a few percent of these puzzles. I tried 10 problems, and was able to figure most of them out in under a minute (even if it took a few more minutes of clicking around to fill in the answer grid using the clumsy interface provided). I will note however that I ran out of patience before figuring out 2 of the 10 problems.
Using the paper’s estimate that a model which can achieve 50% success at tasks which take T seconds can achieve 80% for tasks which take T/5 seconds, and estimating a 10x time penalty for human workers lacking job context (middle of the 5x – 18x range) combines to yield a 50x difference in task length. If AIs can double their task horizon every 7 months, that works out to 5.6 doublings, or 40 months.
The unrealistically simple nature of the benchmark tasks might already be at least partially captured by the 5-18x time penalty I’m applying, as discussed in the previous footnote.
You might ask why I suggest that AIs would need to be able to handle 100% of tasks, when people are of course not 100% reliable, especially when given challenging problems. And indeed, it would be unfair to say that an AI should be 100% reliable before it can be considered “human level”. However, the comparison here is not an individual person, it is humanity collectively – or at least, an individual person plus all of the co-workers and other people they might be able to ask for help. AIs are much more homogenous than people, so “this AI can’t solve this type of problem” is a more serious issue than “this particular person can’t solve this type of problem”.