Updates from Comments on "AI 2027 is a Bet Against Amdahl's Law"

snewman

AI 2027 is a Bet Against Amdahl's Law was my attempt to summarize and analyze "the key load-bearing arguments AI 2027 presents for short timelines". There were a lot of great comments – every time I post on LW is a learning experience. In this post, I'm going to summarize the comments and present some resulting updates to my previous analysis. I'm also using this post to address some comments that I didn't respond to in the original post, because the comment tree was becoming quite sprawling.

TL;DR: my previous post reflected a few misunderstandings of the AI 2027 model, in particular in how to interpret "superhuman AI researcher". Intuitively, I still have trouble accepting the very high speedup factors contemplated in the model, but this could be a failure of my intuition, and I don't have strong evidence to present. New cruxes:

The rate of progress from this point forwards on "messy" tasks, and tasks without easily verifiable rewards.
Some questions downstream of a more detailed understanding (than I possess) of the complete range of activities that go into creating new models in practice in 2025. I would love to see someone write this up.

Confusion Regarding Milestone Definitions

My analysis was skewed by a significant misunderstanding of the AI 2027 model. I argued that the extent to which AI R&D could be accelerated would be limited by activities for which intermediate AIs would not provide significant uplift. However, apparently the SC (superhuman coder), SAR (superhuman AI researcher), and SAIR (superintelligent AI researcher) milestones are meant to be interpreted specifically so as to rule this out. Each of these milestones should be interpreted as an AI whose capabilities are so general as to be able to provide uplift for all activities involved in producing better AI models (except that SC only includes engineering activities). As Ryan Greenblatt put it:

Another way to put this disagreement is that you can interpret all of the AI 2027 capability milestones as referring to the capability of the weakest bottlenecking capability, so:
Superhuman coder has to dominate all research engineers at all pure research engineering tasks. This includes the most bottlenecking capability.
SAR has to dominate all human researchers, which must include whatever task would otherwise bottleneck.
SIAR (superintelligent AI research) has to be so good at AI research--the gap between SAR and SIAR is 2x the gap between an automated median AGI company researcher and a SAR--that it has this huge 2x gap advantage over the SAR despite the potentially bottlenecking capabilities.

This idea also came up elsewhere in the comment tree, in response to this comment from me:

I think the discussion is somewhat confused by the term "AI researcher". Presumably, for an SAR to accelerate R&D by 25x, "AI researcher" needs to cover nearly all human activities that go into AI R&D? And even more so for SAIR/250x. While I've never worked at an AI lab, I presume that the full set of activities involved in producing better models is pretty broad, with tails extending into domains pretty far from the subject matter of an ML Ph.D and sometimes carried out by people whose job titles and career paths bear no resemblance to "AI researcher". Is that a fair statement?

Ryan responded:

Yep, my sense is that an SAR has to* be better than humans at basically everything except vision.

* "Has to" is maybe a bit strong, I think I probably should have said "will probably end up needing to be better competitive with the best human experts at basically everything (other than vision) and better at more central AI R&D given the realistic capability profile". I think I generally expect full automation to hit everywhere all around the same time putting aside vision and physical tasks.

(Of course Ryan is not one of the authors of AI 2027, but comments from Daniel and Eli appear to support this interpretation.)

I had had assumed that "superhuman AI researcher" meant superhuman at things that someone with the job title "AI researcher" does, and that some activities involved in creating better models fall outside of that job title. As Ryan pointed out, because the intended definitions of SC, SAR, and SAIR are broader than I'd had in mind:

I might increase my estimate for how long it will take to reach those milestones (at least in the "human-only, software-only" framing).
I should withdraw my expectation that, once each milestone is reached, Amdahl's Law will limit the impact on subsequent R&D progress.

The shape of the acceleration curve doesn't necessarily change:

Someone Should Flesh Out What Goes Into "AI R&D"

So far as I'm aware, there is no public attempt to fully enumerate the range of job descriptions and tasks that are involved in producing improved AI models. Eli touched on this in a comment:

To the extent we have chosen what activities to include, it's supposed to encompass everything that any researcher/engineer currently does to improve AIs' AI R&D capabilities within AGI companies, see the AI R&D progress multiplier definition: "How much faster would AI R&D capabilities...". As to whether we should include activities that researchers or engineers don't do, my instinct is mostly no because the main thing I can think of there is data collection, and that feels like it should be treated separately (in the AI R&D progress multiplier appendix, we clarify that using new models for synthetic data generation isn't included in the AI R&D progress multiplier as we want to focus on improved research skills, though I'm unsure if that the right choice and am open to changing).
But I did not put a lot of effort into thinking about how exactly to define the range of applicable activities and what domains should be included; My intuition is that it matters less than you think because I expect automation to be less jagged than you (I might write more about that in a separate comment) and because of intuitions that research taste is the key skill and is relatively domain-general, though I agree expertise helps. I agree that there will be varying multipliers depending on the domain, but given that the takeoff forecast is focused mostly on a set of AI R&D-specific milestones, I think it makes sense to focus on that.

I'll come back to the question of how "jagged" automation will be. Barring a strong argument that capabilities will be pretty non-jagged, it seems important to say more about what goes into advancing AI capabilities, if for no other reason than to help align intuitions. Conditional on some expectation of jagged capabilities, it's difficult to come to agreement on how far away milestones like SC and SAR might be, or how much acceleration they might yield, without a fairly detailed understanding of what it takes to produce a new model and which aspects of that are meant to be automated by each milestones.

Eli referenced some work by Epoch:

As a prerequisite, it will be necessary to enumerate the set of activities that are necessary for "AI R&D"
As I think you're aware, Epoch took a decent stab at this IMO here. I also spent a bunch of time thinking about all the sub-tasks involved in AI R&D early on in the scenario development. Tbh, I don't feel like it was a great use of time compared to thinking at a higher level, but perhaps I was doing it poorly or am underestimating its usefulness.

However, I don't get the impression that the Epoch paper is attempting to encompass the complete range of activities involved in AI development. It focuses on experiments – ideating, designing, and running experiments, and analyzing the results. To my (inexpert!) understanding, many other activities are involved, such as creating new evals, creating data sets (roottwo: "Data acquisition and curation: collect, filter, clean datasets; hire humans to produce/QA; generate synthetic data"), and building simulation environments for training. I have to imagine that specific domains (e.g. health care) will require unique approaches for generating data and evaluating outputs, probably with involvement from experts in those domains. And I suspect that I've only scratched the surface, that especially going forward, AI R&D is going to encompass a rich and heterogeneous variety of activities.

You may have felt the breeze from my vigorous handwaving. I would love inside views on this question – aside from the core experimentation loop explored in the Epoch paper, what is involved in the complete end-to-end process of creating new & improved models, especially going forward as the process continues to complexify (e.g. with increasing emphasis on multiple forms of post-training)?

How Long Will it Take To Reach the Early Milestones?

Now that I understand how broadly SC (superhuman coder) and SAR (superhuman AI researcher) are defined, I'd like to revisit the estimates of how long it will take to reach these milestones. Here's a comment from Eli which highlights the implications of the broad definitions:

One reason I expect less jagged progress than you is that my intuition is that even for tasks that are theoretically easy to verify/check, if they take a long time for humans and are very valuable, they will still often be hard to automate if there aren't easily verifiable intermediate outputs. For example, perhaps it's much easier to automate few hour coding tasks than few hour tasks in less verifiable domains. But for coding tasks that take humans months, it's not clear that there's a much better training signal for intermediate outputs than there is for tasks with a less verifiable end state. And if there aren't easily verifiable intermediate outputs, it seems you face similar challenges to short horizon non-verifiable tasks in terms of getting a good training signal. Furthermore, the sorts of long horizon coding tasks humans do are often inherently vague and fuzzy as well, at a higher rather than shorter ones. It's less clear how much of an issue this is for math, but for coding this consideration points me toward expecting automation of coding not that much before other fuzzier skills.

I read this as an argument that a "superhuman coder" (as interpreted here) would have to be something close to an AGI. Ryan's comment that "an SAR [note, not SC] has to be better than humans at basically everything except vision" points in the same direction, though Ryan did qualify that in a footnote.

As Eli noted, an SC able to carry out tasks that would take a human months or longer would be a significant milestone. It would need to be able to maintain coherent goal-oriented behavior over long periods of time, and make good high-level judgement calls to construct a plan of attack and manage its time. And by definition, it would need to be created without the help of a superhuman coder (though of course there would be intermediate speedups, as noted in the timelines forecast). Eli made a similar comment regarding superhuman AI researchers:

The basic arguments are that (a) becoming fully superhuman at something which involves long-horizon agency across a diverse range of situations seems like it requires agency skills that will transfer pretty well to other domains (b) once AIs have superhuman data efficiency, they can pick up whatever domain knowledge they need for new tasks very quickly.
I agree we didn't justify it thoroughly in our supplement, the reason it's not justified more is because we didn't get around to it.

If a superhuman AI researcher is something very close to being superhuman at everything (excluding, perhaps, some domain-specific knowledge and skills), then we'll need to be able create something that is very close to superhuman at everything without the aid of a superhuman AI researcher. I'm not going to try to say anything specific about how long we should expect this to take, but that timeline is certainly a crux. To end this section, I'll just paste in the AI 2027's timelines forecast estimates for timeline to superhuman coder for reference:

	Eli’s SC forecast (median, 80% CI)	Nikola’s SC forecast (80% CI)	FutureSearch aggregate (80% CI) (n=3)
Time-horizon-extension model	2027 (2025 to 2039)	2027 (2025 to 2033)	N/A
Benchmarks-and-gaps model	2028 (2025 to >2050)	2027 (2025 to 2044)	2032 (2026 to >2050)
All-things-considered forecast, adjusting for factors outside these models	2030 (2026 to >2050)	2028 (2026 to 2040)	2033 (2027 to >2050)

Broad Progress on Real-World Tasks Is a Crux

AI 2027 presents two methods for estimating the timeline to a superhuman coder. Both are based on benchmarks, essentially HCAST^[1] and RE-Bench. These contain tasks which are more realistic than some benchmarks, but are still not completely real-world situations. It will be important to observe how quickly the gap between benchmark scores and real-world applicability begins to close. As I put it in my recent post on the METR time horizons paper:

It’s unclear whether the recent uptick in progress on the HCAST problem set used in the METR study will continue, and how long it will take to go from a 50% success rate to full mastery of these kinds of encapsulated coding challenges. Maybe it’ll only be a few years before AIs models can tackle any such problem that a person could handle; maybe it’ll be a decade or more. But either way, the practice of evaluating AI models on easily graded, artificially tidy problems is starting to outlive its usefulness.
The big questions going forward are going to be:
How quickly are AIs progressing on real-world problems?
How broad is that progress – is it restricted to specific domains (such as coding), or is it expanding to cover the broad range of squishy things that people do every day?

Even within coding tasks, the question of "breadth" may be critical. A 50% or even 80% task success rate leaves room for there to be an important class of problem for which models are progressing slowly. Helen Toner's recent post is a nice exploration of the broader question of how easily capabilities will generalize beyond auto-gradable tasks in math and coding, bearing in mind Eli's point that at larger time scales, even coding tasks (or at least the intermediate steps of large coding tasks) may not be easily gradable.

Eli had a relevant comment:

While current AI models and tools are demonstrating substantial value in the real world, there is nevertheless a notorious gap between benchmark scores ("Ph.D level" and beyond) and real-world applicability. It strikes me as highly plausible that this reflects one or more as-yet-poorly-characterized chasms that may be difficult to cross.
You probably know this, but for onlookers the magnitude of these chasms are discussed in our timelines forecast, method 2.

Does Hofstadter's Law Apply?

In my original post, I suggested that Hofstadter's Law should be applied to the AI 2027 model, on the grounds that it should be applied to basically any prediction of this nature. Here are some notable responses:

Daniel:

As for hofstadters law, I think it is basically just the planning fallacy and yeah I think it's a reasonable critique that insofar as our AI timelines are basically formed by doing something that looks like planning, we probably have a bias we need to correct for. I want to think more about the extent to which out timelines methodology is analogous to planning.

Ryan:

AI R&D: The should also just divide both sides. That said, Hofstadter's Law does apply to the human-only, software-only times between milestones. But note that these times are actually quite long! (Maybe you think they are still too short, in which case fair enough.)

Eli:

One reason I don't put much weight on this for timelines forecasts is that to the extent I might have done so before, I would have been more wrong from my current view. For example, my AGI timelines median 3 years ago was 2060ish, and since then I've updated toward an AGI median of more like 2031 due to reasons including underpredicting benchmark scores, underpredicting real-world impacts, and the model we built for AI 2027.
(wow, I didn't remember that my median 3 years ago was 2060ish, wild)

A past history of underprediction does seem like valid evidence for discounting or disregarding Hofstadter's Law (or even, conceivably, applying it in reverse!). The repeated instances of AI blowing past predictions and generally exceeding expectations over the last few years does give me pause. However, personally I also put weight on the nagging gap between benchmarks and real-world applications. (To be clear, OpenAI's revenues demonstrate very significant real-world utility. I merely argue that the demonstrated utility falls short of current models' "super-PhD" performance on many benchmarks. Of course this is also a function of diffusion speeds and many other factors.)

My mental model, FWIW, is something like this: until the last few years, progress toward AGI seemed slow, so many/most people had long timelines. In the last few years, progress has seemed much faster, so timelines are shrinking. However, I look at the gap between benchmarks and real-world applicability, and I conclude that:

"Reality has a surprising amount of detail" applies emphatically to the reality of how intelligence works.
Recent progress has allowed us to see more of those details.
When we look at the progress from, say, GPT-1 to o3, we measure it in "number of observable details" and get a very high number, causing us to interpret this as rapid progress.
However, this suggests that the space from o3 to AGI will also contain a very large number of details, which may take a long time to address even at the current (or accelerating) rate of progress.

In other words, even as we've accelerated, we're (I'd argue) seeing that the destination is farther off than we'd understood, meaning that even the current rapid progress may take some time to get us there.

I'm not sure how well I've articulated this, and I certainly haven't provided any real evidence, other than a vague gesture toward the benchmark / real-world gap. I would very much like to find concrete measures to point at here. My hope is that they can be found in the direction of the questions I reference earlier under "The big questions going forward".

What Would Be the Impact of an SAR / SIAR?

To achieve the kinds of progress multipliers envisioned in the AI 2027 model, AIs will not only need to be carrying out ~all of the functions involved in developing better models, they will need to make much better use of compute – for instance, by choosing experiments more wisely, and also designing more efficient algorithms. Achieving a progress multiplier of 25x, 250x, or 2000x will require these effects to be extremely dramatic.

The models will also need to be extremely good at routing around any constraints. In addition to limitations on compute, there will be limitations on (non-synthetic) training data, the possibility of niche tasks for which models are for some reason less superhuman, and perhaps other issues. Accelerating progress by 2000x does not leave room for even very minor issues (except to the extent that the models are able to remediate or route around those issues).

Ryan:

My sense is that much better research taste and perfect implementation could make experiments with some fixed amount of compute >100x more useful. To me, this feels like the important question: how much can labor results in routing around compute bottlenecks and utilizing compute much more effectively. The naive extrapolation out of the human range makes this look quite aggressive: the median AI company employee is probably 10x worse at using compute than the best, so an AI which as superhuman as 2x the gap between median and best would naively be 100x better at using compute than the best employee. (Is the research taste ceiling plausibly this high? I currently think extrapolating out another 100x is reasonable given that we don't see things slowing down in the human range as far as we can tell.)

Eli:

Also, we think that research taste is a multiplier which isn't tied to many distinct sub-activities, curious if you could provide examples of several sub-activities without super correlated capabilities.

It would be nice to have some specific indicators we could watch to see whether this is playing out. I don't have suggestions off the top of my head.

Conclusions

Intuitively, I still have trouble accepting the very high speedup factors contemplated in the AI 2027 model. This could be a failure of my intuition.

My cruxes seem to be:

The rate of progress on "messy" tasks, and tasks without easily verifiable rewards. (Bearing in mind Eli's point that as time scales increase, even neat, tidy tasks with verifiable end results will have messy, hard-to-verify subtasks.)
Whether it really is possible to deliver the contemplated speedups at each stage of AI capability – while relying on relatively static supplies of compute, non-synthetic data, human experts, and other dependencies. To explore this, I'd really like to see a writeup of the complete range of activities that go into creating new models in practice in 2025 (or a stronger argument that I'm wrong to worry about things that don't fit into the "experiment" cycle described in the Epoch paper). It would also be good to somehow further investigate the potential for "superhuman research taste" to squeeze dramatically more gains from each experiment.

In my previous post, I listed four "reasons the AI 2027 forecast may be too aggressive". I'll conclude by briefly revisiting them:

Simplified Model of AI R&D: further discussion seems blocked on someone articulating a complete overview of what goes into creating new models.
Amdahl's Law: replaced / subsumed by the new cruxes listed above.
Dependence on Narrow Data Sets: I still worry that the benchmarks being extrapolated from are relatively narrow and homogenous, but this is effectively subsumed into crux #1 above.
Hofstadter's Law As Prior: the rapid (much more rapid than most people expected) progress of recent years is an argument against applying Hofstadter's Law. I've offered an alternative framing, in which recent years have yielded both rapid progress and a better understanding of just how far away the destination is. I don't claim that I've made this argument very clear or provided evidence to support it, at this point it's just an intuition.

^{^}
Here I'm referring to the METR time horizons study, which combines several benchmarks but primarily relies on HCAST for evaluating current models.

[-]ryan_greenblatt10mo80

Curious how you feel about the following intuition pump:

Imagine an AI company where all of their researchers are 100x slower (as in, they think and write code at 100x slower speeds). Also their researchers and engineers are only as good as the median US ML researcher/engineer (so e.g., probably in the bottom 10-20% of lab employees or worse). And, they only have 300 employees.

I think intuitively this company could go much slower than OpenAI when given the same compute.

The situation where we achieve really high multipliers is like the exact opposite, so insofar as you think big slowdowns are possible due to slower and worse (and fewer) researchers, I think you should probably think that big speed ups are possible if we flip the situation. (There are some ways this symmetry could be broken, but I'm skeptical about strong asymmetry as I discuss here.)

I have a draft doc where I talk about about this intuition pump. I've emailed it to you.

[-]snewman10mo30

Thanks. This is helpful, but my intuition is substantially coming from the idea that there might be other factors involved (activities / processes involved in improving models that aren't "thinking about algorithms", "writing code", or "analyzing data"). In other words, I have a fair amount of model uncertainty, especially when thinking about very large speedups.

LESSWRONG
LW