Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

[-]faul_sname6mo80

A bet here is to revisit revenue per deployed H100 by the end of 2027. If the number exceeds one‑hundred‑thousand dollars, the efficiency threshold the short camp predicts has arrived. If it remains close to today’s ten‑thousand, the sceptics will have been vindicated.

I note that CoreWeave has 250,000 GPUs, mostly Hopper series. If net revenue per H100-GPU-year 10xs, they will be in a very good position to benefit from that. My extremely not rigorous back-of-the-envelope estimate is that the CoreWeave valuation makes sense if H-series GPU prices are expected to drop by 10%-30% per year for the foreseeable future. If you instead expect the price of H100s to rise by a factor of 10, that implies that CoreWeave, a public company, is currently trading at far below the correct price.

Not incredibly strong evidence but it does seem like we have a kind-of-prediction-market-ish thing here on net revenue per H100, and the prediction it's making is quite clear.

[-]faul_sname5mo30

Well that aged poorly. Coreweave's valuation 1.5x'd within 2 weeks of me posting this comment. Mild evidence against EMH here I guess.

(Disclaimer: I do have a small long position in CRWV, entered shortly after posting the original comment on the grounds of "well I'm not positive this is priced in yet")

[-]charlieoneill5mo10

Thanks for being transparent; wouldn't put too much stock in a short-term rise in Coreweave (as I understand, more to do with tariffs and other confounding factors than "they were underpriced before")

[-]SorenJ6mo*7-1

Reasoning models are better than basically all human mathematicians

What do you mean by this? Although I concede that 95%+ of all humans are not very good at math, for those I would call human mathematicians I would say that reasoning models are better than basically 0% of them. (And I am aware of the Frontier Math benchmarks.)

[-]Mateusz Bagiński6mo41

I think the author meant that they achieve higher scores on the FrontierMath benchmark.

[-]O O6mo21

Do they? I thought they do well on the easier section

[-]RogerDearnaley5mo20

Terrence Tao (who should be position to know) was involved in evaluating the o1 model (which is by now somewhat dated). In the context of acting as a research assistant, he described it as equivalent to a "mediocre but not entirely incompetent" graduate student. That's not "better than basically all human mathematicians" — but it's also not so very far off, if it's about as good as the grade of gradate students that Terrence Tao has access to as research assistants.

[-]Mo Putera5mo53

To add a bit of nuance/context, here's what Tao said:

In https://chatgpt.com/share/94152e76-7511-4943-9d99-1118267f4b2b I gave the new model a challenging complex analysis problem (which I had previously asked GPT4 to assist in writing up a proof of in https://chatgpt.com/share/63c5774a-d58a-47c2-9149-362b05e268b4 ). Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution *if* provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes.
The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student.
It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "(static simulation of a) competent graduate student" is reached, at which point I could see this tool being of significant use in research level tasks. (2/3)

More on the "static simulation" part:

I am belatedly realizing that in my attempts to describe my evaluation of the capability of an AI tool, I inadvertently gave the incorrect (and potentially harmful) impression that human graduate students could be reductively classified according to a static, one dimensional level of “competence”. This was not my intent at all; and I would therefore like to make the following clarifying remarks.
Firstly, the ability to contribute to an existing research project is only one aspect of graduate study, and a relatively minor one at that. A student who is not especially effective in this regard, but excels in other dimensions such as creativity, independence, curiosity, exposition, intuition, professionalism, work ethic, organization, or social skills can in fact end up being a far more successful and impactful mathematician than one who is proficient at assigned technical tasks but has weaknesses in other areas.
Secondly, and perhaps more importantly, human students learn and grow during their studies, and areas in which they initially struggle with can become ones in which they are quite proficient at after a few years; and personally I find being able to assist students in such transitions to be one of the most rewarding aspects of my profession. In contrast, while modern AI tools have some ability to incorporate feedback into their responses, each individual model does not truly have the capability for long term growth, and so can be sensibly evaluated using static metrics of performance. However, I believe such a fixed mindset is not an appropriate framework for judging human students, and I apologize for conveying such an impression.

These additional remarks by Tao on long-term growth and non-problem-solving skills relevant to mathematical excellence are what I think of when I consider the hypothetical that math AIs are maxing out FrontierMath Tier 4 and yet still nowhere near revolutionising pure math, which I think is increasingly plausible, cf. all the posts sharing this one's vibe. (Writing this publicly to revisit in case I'm wrong, which would be great; unlike say Gowers, I do want agentic artificial super-mathematicians of all kinds.)

[-]Cole Wyeth6mo71

This post seems excellent

Surprisingly few technical cruxes though, eg no METR task lengths?

[-]RogerDearnaley6mo30

I think that's an element in Hinge #3. While AI task lengths remain short (minutes to hours), AI is basically just a tool, though one that may still boost productivity. Once they reach days, human workers need to turn into managers-of-AI, so AI become a productivity multiplier but not a replacement. Once AI task lengths reach weeks or months, it become plausible that AI can manage AI, and we're starting to look at full replacement.

[-]charlieoneill5mo10

Yes - the general argument is "task length isn't sufficiently correlated with actual use for remote work, so you also need to look at other things" (see the EpochAI post on this)

[-]jmh6mo50

Current AI, even after ChatGPT, earns roughly ten‑thousand dollars per H100‑GPU‑year.

What does that $10K number actually represent? An average across all AI? The marginal GPU earning across all AI? An estimate of either from one of the big AIs?

[-]ErioirE6mo30

Has someone made Manifold markets for these predictions? (As of writing this comment I have not found any and I would rather not do it myself since I don't typically keep tabs on those respective metrics.)

[-]rayman20005mo70

Here you go

#1: https://manifold.markets/rayman2000/nvidias-datacenter-revenue-and-bigt

#2: https://manifold.markets/rayman2000/ai-model-training-time-decreases-fo

#3: https://manifold.markets/rayman2000/revenue-per-deployed-h100-exceeds-1

#4: https://manifold.markets/rayman2000/g7-country-manages-three-years-of-6

#5: https://manifold.markets/rayman2000/a-topthree-ai-lab-delays-a-frontier

[-]Josh You5mo20

I don't understand Hinge #2. Wall clock time for equally large training runs 18 months apart could easily shrink by 3/4 for banal reasons like larger training clusters. why would this be evidence of software-only acceleration?

[-]David James6mo10

Agreement in the forecasting/timelines community ends at the tempo question.

What is the "tempo question"? I don't see the word tempo anywhere else in the article.

[-]Stephen Richards6mo10

On the software improvements, for me a threshold on usefulness is the removal of obvious flaws to a low enough level to gain human trust. Ie when we think the failing common sense answers are less than 10% can we reduce that to <1% and then <0.1% with software improvements alone (or perhaps software +the next hardware upgrade that Nvidia is currently producing) . My analogy for the trust question is self driving cars where YouTube video makers worry that without near crashes the have nothing to make a video that people will watch. I think this was v13 software and v3or 4 hardware for tesla. Anecdotally I see people retrying the latest models after a year or so since ChatGPT3.5 with more success, so take up for anything useful will be fast as the error rate decreases pass the trust threshold- but do people here think software improvements are enough to pass the trust threshold?

^{^}

Reasoning models are better than basically all human mathematicians, but still haven't produced one novel mathematical result, suggesting a disconnect between benchmarks and actual real-world use. For instance, if a human gets 25% on FrontierMath (ie Terence Tao), we assume they'll produce great maths research because we know those things to be very correlated. However, the correlation doesn't necessarily hold for LLMs: we could have just hillclimbed on the FrontierMath benchmark and overfitted to that.

LESSWRONG
LW

LESSWRONG
LW

126

Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

126

126

TL;DR

Shared premises

Hinge #1: Which curve do we extrapolate?

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Hinge #4: Must we automate everything, or is half enough?

Hinge #5: Alignment‑driven and institutional drag

Dependency Structure

126

Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

126

126

TL;DR

Shared premises

Hinge #1: Which curve do we extrapolate?

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Hinge #4: Must we automate everything, or is half enough?

Hinge #5: Alignment‑driven and institutional drag

Dependency Structure

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Hinge #4: Must we automate everything, or is half enough?

Hinge #5: Alignment‑driven and institutional drag