Interpreting the METR Time Horizons Post

[-]ryan_greenblatt6mo*154

If we adjust for the 5-18x speed improvement measured for experienced workers, and target an 80% task success rate, that pushes the timeline out by over three years

I don't think this is a good interpretation of the 5-18x multiplier. In particular, I think the "acquire context multiplier" will be increasingly small for longer tasks.

Like, the task of "get a bunch of context then complete this 1 month long task" is a task that will typically take humans who don't already have context 2 months, maybe 6 months in particularly tricky domains, not 5-18 months. So, maybe you add a doubling or so, adding more like 7 months to the timeline.

Another way to put this is that the 5-18x multiplier is an artifact of taking months of context and applying that to a short task (like maybe 10 min or 1 hour), but if you take a 1 month task that requires a bunch of context (e.g., implement this optimization pass in llvm), having the context already is probably only a factor of 2-4 (or basically no multiplier depending on the task). (There is probably an additional experience effect where people who are e.g. experienced with compilers will be faster, but this feels a bit separate to me and not cleanly applicable to the AIs.)

To be clear, this will substantially reduce the usefulness of AIs with shorter horizon lengths (like 8-32 hours), cutting down the AI R&D multipliers we see along the way.

(I agree the 80% task success is needed and pushes out the timeline some.)

[-]snewman6mo10

Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I'm not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the counterargument is that this makes the measured doubling times look more impressive, because (plausibly) if we look at a pair of tasks that take low-context people 10 and 20 minutes respectively, the time ratio for realistically high-context people might be more than 2x. But I could imagine this playing out in other ways as well (e.g. maybe we aren't yet looking at task sizes where people have time to absorb a significant amount of context, and so as the models climb from 1 to 4 to 16 to 64 minute tasks, the humans they're being compared against aren't yet benefiting from context-learning effects).

One always wishes for more data – in this case, more measurements of human task completion times with high and low context, on more problem types and a wider range of time horizons...

[-]ryan_greenblatt6mo*50

In actuality, the study doesn’t say much about AGI, except to provide evidence against the most aggressive forecasts.

This feels quite wrong to me. Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years! So, I think the study does say something meaningful about AGI other than just evidence against shorter timelines^[1]. I agree AGI might not happen in a few years after 1 months long software tasks and we'd have a richer understanding at the time, but the basic case in favor feels very strong to me.

(At a more basic level, if you would have updated a decent amount toward relatively longer timelines if the paper had 20 year timelines to 1 month SWE, then you must update to relatively shorter timelines given the trend is 5 years with a possibility of more like 2.5 due to the more recent faster trend. This is by conservation of expected evidence. This isn't to say that you have to directionally update toward shorter timelines based on these results, e.g. maybe you expected an even faster trajectory and this seemed surprisingly slow extending your timelines.)

I edited this sentence in because I think my comment was originally confusing. ↩︎

[-]snewman6mo30

Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!

Agreed. But that means time from today to AGI is the sum of:

Time for task horizons to increase from 1.5 hours (the preliminary o3 result) to 1 month
Plausibly "a few years" to progress from 1-month-coder to AGI.

If we take the midpoint of Thomas Kwa's "3-4 months" guess for subsequent doubling time, we get 23.8 months for (1). If we take "a few years" to be 2 years, we're in 2029, which is farther out than "the most aggressive forecasts" (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).

And given the starting assumptions, those are fairly aggressive numbers. Thomas' guess that "capability on more realistic tasks will follow the long-term 7-month doubling time" would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.

Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from "the most aggressive forecasts"?

(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)

[-]ryan_greenblatt6mo*30

To be clear, I agree it provides evidence against very aggressive timelines (if I had 2027 medians I would have updated to longer), I was disagreeing with "the study doesn’t say much about AGI, except to". I think the study does provide a bunch of evidence about when AGI might come! (And it seems you agree.) I edited my original comment to clarify this as I think I didn't communicate what I was trying to say well.

[-]elifland6mo30

If the trend isn’t inherently superexponential and continues at 7 month doubling times by default, it does seem hard to get to AGI within a few years. If it’s 4 months, IIRC in my timelines model it’s still usually after 2027 but it can be close because of intermediate AI R&D speedups depending on how big you think the gaps between benchmarks and the real world. I’d have to go back and look if we want a more precise answer. If you add error bars around the 4 month time, that increases the chance of AGI soon ofc.

If you treat the shift from 7 to 4 month doubling times as weak evidence of superexponential, that might be evidence in favor of 2027 timelines depending on your prior.

IMO how you should update on this just depends on your prior views (echoing Ryan’s comment). Daniel had 50% AGI by 2027 and did and should update to a bit lower. I’m at more like 20-25% and I think stay about the same (and I think Ryan is similar). I think if you have more like <=10% you should probably update upward.

[-]snewman6mo30

Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I'd argue that the target success rate should be higher than 80%).

Also potential factors for "task messiness" and the 5-18x context penalty, though as you've pointed out elsewhere, the latter should arguably be discounted.

[-]ryan_greenblatt6mo30

Personally, I updated toward shorter timelines upon seeing a preliminary version of their results which just showed the more recent doubling trend and then updated most of the way back on seeing the longer run trend. (Or maybe even toward slightly longer timelines than I started with, I forget.)

[-]omegastick6mo10

if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month)

This doesn't seem like a good example to me.

The sort of tasks we're talking about are extrapolations of current benchmark tasks, so it's more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.

I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren't tested in the benchmarks.

[-]Sodium6mo20

Great post; I thought that there were some pretty bad interpretations of the METR result and wanted to write an article like this one (but didn't have the time). I'm glad to see the efficient lesswrong ideas market at work :)

(to their credit, I think METR mostly presented their results honestly.)

[-]snewman6mo10

To be clear, I agree that the bad interpretations were not coming from METR.

[-]Asta7k6mo10

My guess: if we define agi or a superhuman coder as "self serving agent that learns to optimize for unseen objectives based on its prediction history"
- basic necessary primitives / "blueprint" understood by 2030
- a recipe made stable in the sense of compute, as well as the infra for these kind of things, efficiency etc ~2035 maybe sooner. But what could happen is that if we throw 10000 superhuman coders working on a new superhuman coder, things can accelerate quick.

^{^}

The creators of the ARC-AGI-2 problem set state that no current AI model is able to solve more than a few percent of these puzzles. I tried 10 problems, and was able to figure most of them out in under a minute (even if it took a few more minutes of clicking around to fill in the answer grid using the clumsy interface provided). I will note however that I ran out of patience before figuring out 2 of the 10 problems.

^{^}

Using the paper’s estimate that a model which can achieve 50% success at tasks which take T seconds can achieve 80% for tasks which take T/5 seconds, and estimating a 10x time penalty for human workers lacking job context (middle of the 5x – 18x range) combines to yield a 50x difference in task length. If AIs can double their task horizon every 7 months, that works out to 5.6 doublings, or 40 months.

^{^}

The unrealistically simple nature of the benchmark tasks might already be at least partially captured by the 5-18x time penalty I’m applying, as discussed in the previous footnote.

^{^}

You might ask why I suggest that AIs would need to be able to handle 100% of tasks, when people are of course not 100% reliable, especially when given challenging problems. And indeed, it would be unfair to say that an AI should be 100% reliable before it can be considered “human level”. However, the comparison here is not an individual person, it is humanity collectively – or at least, an individual person plus all of the co-workers and other people they might be able to ask for help. AIs are much more homogenous than people, so “this AI can’t solve this type of problem” is a more serious issue than “this particular person can’t solve this type of problem”.

LESSWRONG
LW

LESSWRONG
LW

70

Interpreting the METR Time Horizons Post

70

Ω 27

70

Ω 27

We’re Gonna Need a Harder Test

Grading AIs on a Consistent Curve

How Applicable to the Real World are These Results?

What the METR Study Tells Us About AGI Timelines

Recent Models Have Been Ahead of the Curve

We’re Running Out Of Artificial Tasks