New Answer

New Comment

3 Answers sorted by
top scoring

Jun 14, 2025

163

The largest of the present day models (GPT-4.5, Opus 4) in some strange sense could be said to cost about $500M to do the final pretraining run. Though much more even with the same strange way of counting costs in research experiments that are necessary to make the final pretraining run a success and the subsequent post-training good at eliciting its capabilities in a useful form. (Long reasoning training, or RLVR, is still mostly in elicitation mode, but threatens to start creating new capabilities at a more substantial training cost within 1-2 years.)

This is not the real cost in the sense that there is no market where you can pay that amount of money and get the ability to do that training run, instead you need to build a giant training system yourself. The servers and networking cost about 10x more than the 3-4 months of their time at a minimal price necessary to break even, considering that price-performance of compute advances quickly, and also the hardware itself doesn't last more than a few years when always in use. (Cloud providers would charge substantially more than that and won't give you nearly enough compute for a frontier training run.)

Since frontier training systems are currently very different from older ones (larger, and with much higher power and cooling requirements per rack), it's also necessary to pay for the buildings, power, and cooling infrastructure, at the same time as you need to be buying the very expensive compute hardware. This makes the costs about 50% greater than just the compute hardware. So in total you are paying 15x more to build a frontier training system, than the pretend on-paper "cost of a training run". The $500M training runs of today are done at what's probably $7bn training systems ($4-5bn in compute hardware, $1-3bn in buildings/power/cooling). The company needs to actually raise the $7bn, and not the $500M.

The largest training system currently being built that's somewhat documented is Stargate Abilene, to be completed around summer 2026. It might cost maybe $40-45bn to build ($15bn through Crusoe on buildings/power/cooling, maybe around $27bn on compute racks and networking through Oracle), and will host 400K chips in GB200 NVL72 racks, which is 10x more FLOP/s for pretraining than probably went into GPT-4.5 or Opus 4, and 150x more than went into GPT-4.

Now the pretend "cost of time" of the ~$40-45bn system to do a 3-4 month long final pretraining run of a giant model that might come out in 2027 could be said to be "about $3bn", but that's a somewhat meaningless figure, they still needed to manage to finance the ~$40-45bn development to get there, and they'll spend more than $3bn on the experiments needed to make that training run work.

This year, Amazon is spending $100bn on things like building its datacenters around the world, and that's a $2-3 trillion market cap company. Even if the giant datacenters are each a 2-year project, we are already close to what a non-AGI AI company might be able to finance, and closely after that we'd be running into the constraints of industrial capacity. So without AGI, the scaling of giant frontier AI training systems should stop around 2027-2029, at which point it regresses to the pace of Moore's law (of price-performance), which is about 3x slower than the current funding-fueled ramp-up.

[-]MichaelDickens5mo20

So if I understand correctly, you're saying it would not be feasible to scale up training compute by 100x in a matter of months, because you'd need to build out the infrastructure first?

5Vladimir_Nesov5mo

Judging by Colossus and Stargate Abilene, it takes about 9 months to construct the buildings/substations/cooling, and 2-3 months to install the compute hardware. Power might in principle be solved with gas generators, and the global compute hardware supply is significantly greater than what individual frontier AI training systems are using, but less than 100x greater. Stargate Abilene will be 1.2 GW, and a hypothetical frontier AI training system of 2027-2029 might be about 5 GW. Scaling 100x from that within months would be quite a sight. Also for pretraining there won't be enough text data to go further anyway, though with enough compute training on video might prove useful. So a more likely story is about figuring out how to use all of the existing global compute for making a single AI smarter, even when it's not all in one place and not even connected together at a very high bandwidth. RLVR is already pointing to that, but it's not yet proven to be useful or even work at all at the relevant scale. Reaching AGI will plausibly result in AGIs quickly finding a way to make use of all this compute, at which point it'll be more valuable in the hands of AGIs than whatever it was doing, so that's where it'll end up (unless the world wakes up at the last possible moment and doesn't do that).

mishka

Jun 15, 2025

110

I am not concerned about this scenario. It does not matter if this is feasible or not (it might be theoretically feasible, but other things will almost certainly happen first).

The labs are laser-focused on algorithmic improvements, and the rate of algorithmic improvements is very fast (algorithmic improvements contribute more than hardware improvements at the moment).

The AIs are being optimized to do productive software engineering and to productively assist in AI research, and soon to perform productive AI research almost autonomously.

So the scenario I tend to ponder is software-only intelligence explosion based on non-saturating recursive self-improvement within a fixed hardware configuration (this is, in some sense, a scenario which is dual to the scenario described in this post; although, of course, they all are trying to scale hardware as well because they are in a race and every bit of advantage matters if one wants to reach an ASI level before other labs do that; that race situation is also quite unfortunate from the existential safety angle).

MichaelDickens

Jun 14, 2025*

Answering my own question:

It might cost multiple orders of magnitude more than $10B to build human-level AI. I could still see a similar scenario playing out if the baseline cost is $100B, but probably not at $1T. As I understand, ~~present-day models cost more like $100B~~ (edit: I badly misread a graph, present-day models cost more like $100M although the figures are not public); the first human-level AI will probably cost considerably more.
I doubt a 100x increase in spending (or 1000x) is enough to go from human-level to superintelligent, but I don't think we can strongly rule it out. We don't know if scaling laws will continue to hold, and also we don't know what level of intelligence is required for an AI to pose an existential threat. (Like, maybe 150 IQ + the ability to clone yourself is already sufficient. Probably not, but maybe.)
I somewhat doubt AI companies would decide to do this. It contradicts their stated plans, and it would be a deviation from their historical behavior. But once AI gets good enough to replace human workers, its profitability rapidly increases, so it could be economically justifiable to do a fast scale-up even though that wasn't justified at weaker capability levels.

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:46 AM

[-]Brendan Long5mo42

Isn't Sam Altman basically trying to do this with Stargate?

[-]Vaatzes5mo20

To me, the idea of "fully human-level capable AI" is a double myth. It works, in so far as we do not try to ascribe concrete capabilities to the model. Anything human-level that can be parallelized is per definition super-human. That's why to me it's a myth in the first place. Additionally, human-level capabilities just make very little sense to me in a model. Is this a brain simulation, and does it feel boredom and long for human rights? Or is this "just" a very general problem solving tool, something akin to an actually accurate vision-language model? This is a categorical difference.

Accurate, general problem solving tools are far more likely and, in the wrong hands, can probably cause far more harm than a "virtual human" ever could. On the other hands, the simulated brain raises many more ethical concerns, I would say.

To actually answer the question, I'm not concerned about a fast takeoff. There are multiple reasons for this:

Anything remotely close in performance as the hypothesized $10 billion model will also have massive incentive to be deployed, serving as a heads up.
As far as I'm aware, almost all systems of intelligence that we have built thus far do scale, but scale with strongly diminished returns. A "mere" increase in training time, model size, and data availability & quality is likely insufficient.
Then there remains the entire discussion of model actualization in the world and a goal deviating to the point of being a threat to someone, let alone everyone

Yes, until we set rigorous terms and prove otherwise, there is certainly a possibility. But compared to "mundane" worries like climate change and socioeconomic inequality this potential existential threat does not even register.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

9

[ Question ]

How concerned are you about a fast takeoff due to a leap in hardware usage?

9

9

3 Answers sorted by
top scoring

Jun 14, 2025

Jun 15, 2025

Jun 14, 2025*

9

[ Question ]

How concerned are you about a fast takeoff due to a leap in hardware usage?

9

9

3 Answers sorted by top scoring

Jun 14, 2025

Jun 15, 2025

Jun 14, 2025*

3 Answers sorted by
top scoring