How to Update If Pre-Training is Dead

[-]Vladimir_Nesov3mo111

Frontier AI training compute is currently increasing about 12x every two years, from about 7e18 FLOP/s in 2022 (24K A100s, 0.3e15 BF16 FLOP/s per chip), to about 1e20 FLOP/s in 2024 (100K H100s, 1e15 BF16 FLOP/s per chip), to 1e21 FLOP/s in 2026 (Crusoe/Oracle/OpenAI Abilene system, 400K chips in GB200/GB300 NVL72 racks, 2.5e15 BF16 FLOP/s per chip). If this trend takes another step, we'll have 1.2e22 FLOP/s in 2028 (though it'll plausibly take a bit longer to get there, maybe 2.5e22 FLOP/s in 2030 instead), with 5 GW training systems.

So the change between GPT-4 and GPT-4.5 is a third of this path. And GPT-4.5 is very impressive compared to the actual original GPT-4 from Mar 2023, it's only by comparing it to more recent models that GPT-4.5 isn't very useful (in its non-reasoning form, and plausibly without much polish). Some of these more recent models were plausibly trained on 2023 compute (maybe 30K H100s, 3e19 FLOP/s, 4x more than the original GPT-4), or were more lightweight models (not compute optimal, and with fewer total params) trained on 2024 compute (about the same as GPT-4.5).

So what we can actually observe from GPT-4.5 is that increasing compute by 3x is not very impressive, but the whole road from 2022 to 2028-2030 is a 1700x-3500x increase in compute from original GPT-4 (or twice that if we are moving from BF16 to FP8), or 120x-250x from GPT-4.5 (if GPT-4.5 is already trained in FP8, which was hinted at in the release video). Judging the effect of 120x from the effect of 3x is not very convincing. And we haven't really seen what GPT-4.5 can do yet, because it's not a reasoning model.

The best large model inference hardware available until very recently (other than TPUs) is B200 NVL8, with 1.5 TB of HBM, which makes it practical to run long reasoning on models with 1-3T FP8 total params that fit in 1-4 nodes (with room for KV caches). But the new GB200 NVL72s that are only starting to get online in significant numbers very recently each have 13.7 TB of HBM, which means you can fit a 7T FP8 total param model in just one rack (scale-up world), and in principle 10-30T FP8 param models in 1-4 racks, an enormous change. The Rubin Ultra NVL576 racks of 2028 will each have 147 TB of HBM, another 10x jump.

If GPT-4.5 was pretrained for 3 months at 40% compute utilization on a 1e20 FLOP/s system of 2024 (100K H100s), it had about 3e26 BF16 FLOPs of pretraining, or alternatively 6e26 FP8 FLOPs. For a model with 1:8 sparsity (active:total params), it's compute optimal to maybe use 120 tokens/param (40 tokens/param from Llama-3-405B, 3x that from 1:8 sparsity). So a 5e26 FLOPs of pretraining will make about 830B active params compute optimal, which means 7T total params. The overhead for running this on B200s is significant, but in FP8 the model fits in a single GB200 NVL72 rack. Possibly the number of total params is even greater, but fitting in one rack for the first model of the GB200 NVL72 era makes sense.

So with GB200 NVL72s, it becomes practical to run (or train with RLVR) a compute optimal 1:8 sparse MoE model pretrained on 2024 compute (100K H100s) with long reasoning traces (in thinking mode). Possibly this is what they are calling "GPT-5".

Going in the opposite direction in raw compute, but with more recent algorithmic improvements, there's DeepSeek-R1-0528 (37B active params, a reasoning model) and Kimi K2 (30B active params, a non-reasoning model), both pretrained for about 3e24 FLOPs and 15T tokens, 100x-200x less than GPT-4.5, but with much more sparsity than GPT-4.5 could plausibly have. This gives the smaller models about 2x more in effective compute, but also they might be 2x overtrained compared to compute optimal (which might be 240 tokens/param, from taking 6x the dense value for 1:32 sparsity), so maybe the advantage of GPT-4.5 comes out to 70x-140x. I think this is a more useful point of comparison than the original GPT-4, as a way of estimating the impact of 5 GW training systems of 2028-2030 compared to 100K H100s of 2024.

[-]anaguma3mo97

If pre-training was the main reason you believed AGI was close—and now you believe pre-training has stalled—then you should update pretty strongly away from short timelines.

I'm not sure if this is true. Pretraining doesn't need to scale indefinitely in order to reach AGI, it just needs to scale to the point where the base models + RL are capable enough to automate ML research.

[-]Noah Birnbaum3mo41

Good point, and I think I somewhat agree. If you think we just reach an intelligence explosion at some level (seems pretty plausible), you wouldn't update to previous pre-training levels because we'd be closer and what really matters is hitting that point (and post-training can possibly take you to that point). While it means that you shouldn't update towards before pre-training, I still think the general point of being a large update back still stands (perhaps this point--the degree-- depends on some other priors, though, which I didn't want to get into).

[-]zroe13mo70

Strong upvote. As Noah already knows, I agree with this, but I'll highlight this for the purposes of giving visibility of dissenting opinions.

In 2020, scaling laws provided the best argument for short timelines. At the time, people were claiming that, yes, all we needed to do was to go bigger and we would automatically get better. And models have got a lot better, but the ways they got better were not entirely consistent with their predictions and this matters. Scaling was essential, but it turns out that we needed post-training improvements as well.

The problem with people not updating is that it is not clear that post-training scales in the same way as pre-training. Historically, RL can be unstable and hard to implement for increasingly complex tasks. In other words, we may find that you can't just "unhobble" models by scaling the inputs to post-training.

Although it feels like these post-training improvements have no end in sight, in general, improvements do have an end in sight. There were good reasons to believe that AI progress may be different because of scaling, but now it makes sense to update towards something slightly less strong.

LESSWRONG
LW

LESSWRONG
LW

23

How to Update If Pre-Training is Dead

23

23