127 Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

6th May 2025

6 min read

127

For people who care about falsifiable stakes rather than vibes

TL;DR

All timeline arguments ultimately turn on five quantitative pivots. Pick optimistic answers to three of them and your median forecast collapses into the 2026–2029 range; pick pessimistic answers to any two and you drift past 2040. The pivots (I think) are:

Which empirical curve matters (hardware spend, algorithmic efficiency, or revenue)
Whether software‑only recursive self‑improvement (RSI) can accelerate capabilities faster than hardware can be installed.
How sharply compute translates into economic value once broad “agentic” reliability is reached.
Whether automating half of essential tasks ignites runaway growth or whether Baumol’s law keeps aggregate productivity anchored until all bottlenecks fall
How much alignment fear, regulation, and supply‑chain friction slow scale‑up

The rest of this post traces how the canonical short‑timeline narrative AI 2027 and the long‑timeline essays by Ege Erdil and Zhendong Zheng + Arjun Ramani diverge on each hinge, and proposes concrete bets that will force regular public updates.

Shared premises

Six doublings in frontier training compute between GPT‑2 (2019) and GPT‑4 (2023)
GPT‑4‑level systems demonstrably replace some cognitive tasks
Alignment is non‑trivial; nobody claims a free deployment lunch

Agreement in the forecasting/timelines community ends at the tempo question.

Hinge #1: Which curve do we extrapolate?

The first divide concerns what exactly we should project into the future. Short‑timeline advocates emphasise frontier‑training compute and algorithmic efficiency, or even just the general amalgamation of all benchmarks as "intelligence extrapolation". They point to six straight doublings in effective training FLOP between GPT‑2 and GPT‑4, and they cite scaling‑law papers showing a 1.6x yearly reduction in compute required to reach any fixed loss. This is the engine behind the claim in AI 2027, that “CapEx grows one hundred‑fold in four years.” Long‑timeline authors reply that the best public proxy for capital actually deployed (NVIDIA datacentre revenue, which I think is a flawed metric for other reasons) stopped growing exponentially after the ChatGPT launch. Either you extrapolate benchmark/efficiency performance and hope that eventually gets you to automation, or you only believe if the work itself is actually automated, which is what Erdil is trying to proxy with NVIDIA's revenue.

Overall, I think it's a good argument for the long-timelines side that we've still automated less than probably 0.5% of US GDP with AI. Reasoning models outperform most human mathematicians on FrontierMath, yet we have not seen a single peer‑reviewed theorem produced end‑to‑end by an LLM.^[1] If benchmarks elsewhere are saturating, then there's likely a large disconnect between how economically useful an AI is and progress on evaluation suites + benchmarks (also there's probably unintentional cheating plus overfitting on hillclimbed metrics; for more see here).

A bet follows naturally. If by the fourth quarter of 2026 NVIDIA’s datacentre revenue and the combined AI capital expenditures of Google, Meta, OpenAI, etc have doubled year‑over‑year, the hardware curve is still exponential and the short camp scores a point. If they have not, the long camp’s “curve‑bend” story gains credibility.

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

The second hinge asks whether capability can compound inside existing datacentres faster than the physical economy can supply new wafers, power, and cooling. Short‑timeline writers emphasise the rise of agentic research assistants. They note that Devin, Claude Code, etc can complete medium‑sized GitHub projects, etc etc. Once a model designs better chips, compilers, or curricula, capability doublings might come from pure code for a year or two - the classic “foom” or “phase transition.”

Erdil’s rejoinder is empirical. Every major algorithmic breakthrough so far (attention, RLHF, mixture routing) consumed between a thousand and a hundred‑thousand GPU‑days of brute‑force experimentation. Experimentation, in turn, is limited by hardware delivery times and energy budgets. Data, too, is finite: the Villalobos et al. paper suggests we exhaust high‑quality language data this decade.

I'm more sympathetic to long timelines once again. Operator, Devin, and Anthropic’s workstation agent still regularly fail a task as mundane as booking a round‑trip flight with seat selection. Devin has all these stories of it being successful about 5-20% of the time in the real-world. If we cannot yet automate a travel clerk, claims about near‑total remote‑work automation feel premature. There's also a strong bottleneck argument around compute and real-world data which people have made countless times, and I'm not sure where I stand on.

A testable prediction is to watch wall‑clock time for frontier training runs. If the fastest models trained in mid‑2027 require one‑quarter the elapsed time of equally large runs eighteen months earlier without a node shrink below four nanometres, that will be strong evidence that software‑only acceleration is real, and is not highly bottlenecked by compute or real-world experiments slowing the AI researchers down.

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Current AI, even after ChatGPT, earns roughly ten‑thousand dollars per H100‑GPU‑year. This is a figure Erdil emphasises precisely because it equals world GDP per capita. Short‑timeline thinkers retort that this measure hides a looming discontinuity. Once reliability crosses the human median, the “train once, deploy many” property kicks in: the marginal cost of the N‑th copy collapses to server depreciation and a few cents of electricity. They also foresee specialised inference hardware such as NVIDIA’s Blackwell or Groq or whatever, plus sparsity and mixture routing, pushing cost per useful token down by two orders of magnitude. Moreover, value is highly uneven: AlphaFold’s acceleration of drug discovery might be worth many millions of average coders.

Long‑timeline writers respond that newer, more agentic models already need one to three orders of magnitude more reasoning tokens per answer, so the inference cost trend is rising, not falling.

A bet here is to revisit revenue per deployed H100 by the end of 2027. If the number exceeds one‑hundred‑thousand dollars, the efficiency threshold the short camp predicts has arrived. If it remains close to today’s ten‑thousand, the sceptics will have been vindicated.

Hinge #4: Must we automate everything, or is half enough?

Here the argument turns to macroeconomic theory. Short‑timeline narratives imagine a reaction wheel: once half of remote cognitive labour is cheap, wages plummet, capital reallocates, and the remaining physical bottlenecks fall rapidly because swarms of AI engineers will design better robots, batteries, and even fusion plants. They also note that GDP composition can migrate; maybe more and more value lives in purely digital goods, so traditional Baumol constraints vanish.

Zheng and Ramani retort that William Baumol’s law is brutal. Aggregate productivity is throttled by the slowest essential sector, whether that's housing, logistics, healthcare, and energy. For intance, electricity and the internet each boosted sectoral productivity by a factor of a thousand but raised frontier GDP per capita growth by less than one percentage point. Unless AI swiftly unlocks safe autonomous robotics, cheap power, and fast construction, growth will be stuck.

A decisive empirical milestone is whether any G7 country manages three straight years of six‑percent real GDP‑per‑capita growth before general‑purpose manipulation robots are common. If that occurs, the reaction‑wheel model wins.

Hinge #5: Alignment‑driven and institutional drag

Finally, even if capability and economics line up, society might want us to slow down anyway. Regulation is already thickening. Physical bottlenecks appear in grid interconnect queues, water‑cooling permits, and limited HBM memory supply. Alignment worries form a special drag: Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework both envision voluntary pauses.

Short‑timeline optimists counter with the Manhattan‑mode story: once states perceive an existential threat or an opportunity for decisive advantage, they bulldoze barriers. The AI 2027 scenario predicts permissive special‑economic zones where training can continue at maximum speed, while regulators lag behind.

We can measure drag directly. One bet is that, by January 2028, at least one top‑three lab publicly delays a frontier model six or more months for safety reasons. Another is that U.S. datacentre megawatt backlogs fall below six months by 2027 and export‑control stringency remains at 2024 levels.

Dependency Structure

These hinges are not independent. If the hardware and algorithmic curves keep doubling, software‑only RSI is more plausible. If RSI works, agent efficiency is likelier to jump and GDP could surge even before robots. On the other hand, strong alignment or regulatory brakes can nullify the whole chain. In practice, pessimistic answers to any two hinges probably push full‑replacement timelines into the 2040 s; optimistic answers to at least three make the late‑2020 s credible.

Timeline debates are tractable because they hinge on the above five measurable questions, at least to me. We do not need oracular insight, only discipline: decide which side of each hinge you occupy, publish your odds, and update when the world issues new datapoints. The difference between an AGI in 2027 and one in 2047 may feel philosophical, but it will ultimately be written in SEC filings, power‑grid spreadsheets, and lab press releases. Let us read those, not the vibes.

^{^}
Reasoning models are better than basically all human mathematicians, but still haven't produced one novel mathematical result, suggesting a disconnect between benchmarks and actual real-world use. For instance, if a human gets 25% on FrontierMath (ie Terence Tao), we assume they'll produce great maths research because we know those things to be very correlated. However, the correlation doesn't necessarily hold for LLMs: we could have just hillclimbed on the FrontierMath benchmark and overfitted to that.

AI TimelinesAI

Frontpage

127

Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

New Comment

17 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:51 PM

[-]faul_sname8mo80

A bet here is to revisit revenue per deployed H100 by the end of 2027. If the number exceeds one‑hundred‑thousand dollars, the efficiency threshold the short camp predicts has arrived. If it remains close to today’s ten‑thousand, the sceptics will have been vindicated.

I note that CoreWeave has 250,000 GPUs, mostly Hopper series. If net revenue per H100-GPU-year 10xs, they will be in a very good position to benefit from that. My extremely not rigorous back-of-the-envelope estimate is that the CoreWeave valuation makes sense if H-series GPU prices are expected to drop by 10%-30% per year for the foreseeable future. If you instead expect the price of H100s to rise by a factor of 10, that implies that CoreWeave, a public company, is currently trading at far below the correct price.

Not incredibly strong evidence but it does seem like we have a kind-of-prediction-market-ish thing here on net revenue per H100, and the prediction it's making is quite clear.

[-]faul_sname7mo30

Well that aged poorly. Coreweave's valuation 1.5x'd within 2 weeks of me posting this comment. Mild evidence against EMH here I guess.

(Disclaimer: I do have a small long position in CRWV, entered shortly after posting the original comment on the grounds of "well I'm not positive this is priced in yet")

[-]charlieoneill7mo10

Thanks for being transparent; wouldn't put too much stock in a short-term rise in Coreweave (as I understand, more to do with tariffs and other confounding factors than "they were underpriced before")

[-]SorenJ8mo*7-1

Reasoning models are better than basically all human mathematicians

What do you mean by this? Although I concede that 95%+ of all humans are not very good at math, for those I would call human mathematicians I would say that reasoning models are better than basically 0% of them. (And I am aware of the Frontier Math benchmarks.)

[-]Mateusz Bagiński8mo41

I think the author meant that they achieve higher scores on the FrontierMath benchmark.

[-]O O8mo21

Do they? I thought they do well on the easier section

[-]RogerDearnaley7mo20

Terrence Tao (who should be position to know) was involved in evaluating the o1 model (which is by now somewhat dated). In the context of acting as a research assistant, he described it as equivalent to a "mediocre but not entirely incompetent" graduate student. That's not "better than basically all human mathematicians" — but it's also not so very far off, if it's about as good as the grade of gradate students that Terrence Tao has access to as research assistants.

[-]Mo Putera7mo53

To add a bit of nuance/context, here's what Tao said:

In https://chatgpt.com/share/94152e76-7511-4943-9d99-1118267f4b2b I gave the new model a challenging complex analysis problem (which I had previously asked GPT4 to assist in writing up a proof of in https://chatgpt.com/share/63c5774a-d58a-47c2-9149-362b05e268b4 ). Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution *if* provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes.
The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student.
It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "(static simulation of a) competent graduate student" is reached, at which point I could see this tool being of significant use in research level tasks. (2/3)

More on the "static simulation" part:

I am belatedly realizing that in my attempts to describe my evaluation of the capability of an AI tool, I inadvertently gave the incorrect (and potentially harmful) impression that human graduate students could be reductively classified according to a static, one dimensional level of “competence”. This was not my intent at all; and I would therefore like to make the following clarifying remarks.
Firstly, the ability to contribute to an existing research project is only one aspect of graduate study, and a relatively minor one at that. A student who is not especially effective in this regard, but excels in other dimensions such as creativity, independence, curiosity, exposition, intuition, professionalism, work ethic, organization, or social skills can in fact end up being a far more successful and impactful mathematician than one who is proficient at assigned technical tasks but has weaknesses in other areas.
Secondly, and perhaps more importantly, human students learn and grow during their studies, and areas in which they initially struggle with can become ones in which they are quite proficient at after a few years; and personally I find being able to assist students in such transitions to be one of the most rewarding aspects of my profession. In contrast, while modern AI tools have some ability to incorporate feedback into their responses, each individual model does not truly have the capability for long term growth, and so can be sensibly evaluated using static metrics of performance. However, I believe such a fixed mindset is not an appropriate framework for judging human students, and I apologize for conveying such an impression.

These additional remarks by Tao on long-term growth and non-problem-solving skills relevant to mathematical excellence are what I think of when I consider the hypothetical that math AIs are maxing out FrontierMath Tier 4 and yet still nowhere near revolutionising pure math, which I think is increasingly plausible, cf. all the posts sharing this one's vibe. (Writing this publicly to revisit in case I'm wrong, which would be great; unlike say Gowers, I do want agentic artificial super-mathematicians of all kinds.)

[-]Cole Wyeth8mo71

This post seems excellent

Surprisingly few technical cruxes though, eg no METR task lengths?

[-]RogerDearnaley8mo30

I think that's an element in Hinge #3. While AI task lengths remain short (minutes to hours), AI is basically just a tool, though one that may still boost productivity. Once they reach days, human workers need to turn into managers-of-AI, so AI become a productivity multiplier but not a replacement. Once AI task lengths reach weeks or months, it become plausible that AI can manage AI, and we're starting to look at full replacement.

[-]charlieoneill7mo10

Yes - the general argument is "task length isn't sufficiently correlated with actual use for remote work, so you also need to look at other things" (see the EpochAI post on this)

[-]jmh8mo50

Current AI, even after ChatGPT, earns roughly ten‑thousand dollars per H100‑GPU‑year.

What does that $10K number actually represent? An average across all AI? The marginal GPU earning across all AI? An estimate of either from one of the big AIs?

[-]ErioirE8mo30

Has someone made Manifold markets for these predictions? (As of writing this comment I have not found any and I would rather not do it myself since I don't typically keep tabs on those respective metrics.)

[-]rayman20007mo70

Here you go

#1: https://manifold.markets/rayman2000/nvidias-datacenter-revenue-and-bigt

#2: https://manifold.markets/rayman2000/ai-model-training-time-decreases-fo

#3: https://manifold.markets/rayman2000/revenue-per-deployed-h100-exceeds-1

#4: https://manifold.markets/rayman2000/g7-country-manages-three-years-of-6

#5: https://manifold.markets/rayman2000/a-topthree-ai-lab-delays-a-frontier

[-]Josh You7mo20

I don't understand Hinge #2. Wall clock time for equally large training runs 18 months apart could easily shrink by 3/4 for banal reasons like larger training clusters. why would this be evidence of software-only acceleration?

[-]David James8mo10

Agreement in the forecasting/timelines community ends at the tempo question.

What is the "tempo question"? I don't see the word tempo anywhere else in the article.

[-]Stephen Richards8mo10

On the software improvements, for me a threshold on usefulness is the removal of obvious flaws to a low enough level to gain human trust. Ie when we think the failing common sense answers are less than 10% can we reduce that to <1% and then <0.1% with software improvements alone (or perhaps software +the next hardware upgrade that Nvidia is currently producing) . My analogy for the trust question is self driving cars where YouTube video makers worry that without near crashes the have nothing to make a video that people will watch. I think this was v13 software and v3or 4 hardware for tesla. Anecdotally I see people retrying the latest models after a year or so since ChatGPT3.5 with more success, so take up for anything useful will be fast as the error rate decreases pass the trust threshold- but do people here think software improvements are enough to pass the trust threshold?

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

127

Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

127

TL;DR

Shared premises

Hinge #1: Which curve do we extrapolate?

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Hinge #4: Must we automate everything, or is half enough?

Hinge #5: Alignment‑driven and institutional drag

Dependency Structure

127

127

Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

127

TL;DR

Shared premises

Hinge #1: Which curve do we extrapolate?

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Hinge #4: Must we automate everything, or is half enough?

Hinge #5: Alignment‑driven and institutional drag

Dependency Structure

127

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Hinge #4: Must we automate everything, or is half enough?

Hinge #5: Alignment‑driven and institutional drag