10x more training compute = 5x greater task length (kind of)

[-]Vladimir_Nesov3mo71

RLVR is getting introduced as a major component of training cost and capabilities over 2025, and it's possibly already catching up with pretraining in terms of GPU-time, pending more sightings of such claims.

The slopes of trends in capabilities are likely going to be different once RLVR is pretraining-scale, compared to when pretraining dominated the cost. So trends that start in the past and then include 2025 data are going to be largely uninformative about what happens later, it's only going to start being possible to see the new trends in 2026-2027.

(Incidentally, since RLVR has significantly lower compute utilization than pretraining, counting FLOPs of pretraining+RLVR will get a bit misleading when the latter gobbles up a major potion of the total GPU-time while utilizing only a small part of the training run's FLOPs.)

[-]anaguma3mo30

With sufficiently large batch sizes for rollouts, why should we expect lower utilization than pretraining?

[-]Vladimir_Nesov3mo81

In MoE, each expert only consumes a portion of the tokens, maybe 8x-32x fewer than there are tokens in total. When decoding, each sequence only contributes 1 token without speculative decoding, or maybe 8 tokens with it (but then later you'd be throwing away the incorrectly speculated tokens). When you multiply two x $N$ square matrices of 16 bit numbers, you need to read $4 N^{2}$ bytes from HBM, perform $2 N^{3}$ FLOPs, and write back $2 N^{2}$ bytes of the result. Which means you are performing $N / 3$ BF16 FLOPs per byte read/written to HBM. For an H200, HBM bandwidth is 4.8 TB/s, while BF16 compute is 1e15 FLOP/s. So to feed the compute with enough data, you need $N$ to be at least 600, probably 1K in practice.

So to feed the compute in a MoE model with 1:8-1:32 sparsity, you need to be processing 8K-32K tokens at a time. This isn't too much of a problem for pretraining or prefill, since you work with all tokens of all the sequences in a batch simultaneously. But for decoding, you only have 1-8 tokens per sequence (at its end, currently being generated), which means 1K-4K sequences (with speculative decoding) or the full 8K-32K sequences (without) that could arrive to a given expert located at a particular physical chip. Each sequence might need 10 GB of KV cache, for the total of 10-40 TB or 80-320 TB. An 8-chip H200 node has 1.1 TB of HBM, so that's a lot of nodes, and the activation vectors would need to travel between them to find their experts, the same 10-40 or 80-320 hops per either 4 tokens or 1 token of progress (which tops out at the number of layers, say 80 layers). Each hop between nodes might need to communicate 50 KB of activation vectors per token, that is 0.4-1.6 GB (which is the same with speculative decoding and without), let's say 1 GB. With 8x400Gbps bandwidth, it'd take 2.5 ms to transmit (optimistically). In case of speculative decoding, that's 25-100 ms per token in total (over 10-40 hops), and without speculative decoding 200 ms per token. Over 50K token long sequences, this takes 0.1-0.3 hours with speculative decoding (assuming 4 tokens are guessed correctly each step of decoding on average), or 4 hours without. At 40% utilization with 250B active params, it'd take 0.4 hours to compute with speculative decoding (including the discarded tokens), and 0.2 hours without. So maybe there is still something to this in principle, with speculative decoding and more frugal KV cache. But this is probably not what will be done in practice. Also, there are only 4.3K steps of 0.5 hours in 3 months (for RLVR), which is plausibly too few.

Instead, we don't worry about feeding the compute on each chip during decoding and multiply thin matrices with very few tokens at each expert. Which means all the time is spent moving the data from HBM. With a 2T total param FP8 model we'd need maybe 4 nodes to fit it, and there will be enough space for 200 sequences. Passing all of HBM through the chips will take about 25 ms, which translates to 10 tokens per second without speculative decoding, or 40 tokens per second with it (which comes out to $2-9 per 1M tokens at $2 per H200-hour). At 40 tokens per second over 200 sequences on 4 nodes of 8 chips each, in a second a 250B active parameter model would use 4e15 FLOPs and could get 32e15 BF16 FLOPs at full compute utilization, so we get 12% compute utilization with speculative decoding, about 3x-4x lower than the proverbial 40% of pretraining. Though it's still 0.3 hours per RLVR step, of which there are only 6.4K in 3 months. So there could be reason to leave HBMs half-empty and thereby make decoding 2x faster.

[-]J Bostock3mo50

Interesting, but I think squashing total training compute to a scalar value is inappropriate here. If you draw a line through GPT-2, 3, 3.5, 4 on that plot, you can see the RLVR models R1 and 3.7 Sonnet are above the trend line. Since we don't really know the optimal ratio of pretraining/RLVR compute, nor how it scales, the total compute is missing a lot of important information.

[-]mishka3mo52

Thanks, a very useful analysis.

2.5 months is enough to provide a very strong AI research acceleration, so the trend is likely to be changing towards larger share of algorithmic efficiency in the future (e.g. if we freeze the compute, the progress will not stop even now, although it would slow down noticeably).

[-]AnthonyC3mo63

We should also keep in mind that once you get past ~8 hrs we can't say the time is comparable to calendar time. A task that takes a human 24 hours doesn't take a day. It takes 3 full working days. Which, when you account for non-focused-work-time overhead, means something like 80% of a work week. This would mean that when we say something like "2.5 months," that's actually a project that would take a human around a year of their life to complete, mixed in with all the other things a human does.

[-]Thomas Kwa3mo30

It's interesting that task length is better correlated with release date than with training compute; I was not expecting that.

It's partially because we filtered for frontier models. If you had people train models of all different sizes that were compute-optimal architectures at their respective sizes, the correlation between time horizon and compute would plausibly be much better.

[-]basil.halperin3mo10

Very nice. I recently did a similar exercise, and -- because as you note the Epoch data (understandably) doesn't have estimates of training compute for reasoning models -- I had o3 guesstimate "effective training compute" by OpenAI model (caveat: this doesn't really make sense!). You can see the FLOP by model in the link. And:

By this metric, it's ~3.5 more OOMs from o3 to 1-month-AGI. If -- as was often said to be the case pre-reasoning models -- effective compute can still be said to be growing at ~10x a year, then 1-month-AGI arrives around early 2029

"1-month AGI", of course, on tasks of the type studied by Kwa et al
That is also running up against the late-2020s compute slowdown
Along with a bazillion other caveats I won't bother listing

LESSWRONG
LW

LESSWRONG
LW

48

10x more training compute = 5x greater task length (kind of)

48

48

Increasing training compute by 10 times increases task length by 10^0.694≈5 times.