Model Size Scaling in 2023-2031

Vladimir_Nesov

Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a particular speed of token generation, this puts a constraint on the number of pipeline stages, which puts a constraint on the total params of the model. But if there isn't enough pretraining compute, models will remain smaller than this constraint (lower sparsity at a given number of active params buys a higher speed of token generation), so both should be taken into account.

Working through these considerations gives model sizes feasible for each year between 2023 and 2031. The total params go from 10T in 2026 (at 8x sparsity, still constrained by Oberon racks, trained for 1.3e27 FLOPs) to 240T in 2028 (at 30x sparsity, with Kyber racks more than sufficient for the available pretraining compute) and then to 1.4 quadrillion in 2031 (30x sparsity, served with 8x Kyber Feynman scale-up systems, trained for 2.2e29 FLOPs). Starting in 2027, model sizes are further inflated by the lack of sufficient pretraining data, with models of 2031 having to be 4x bigger than unlimited training data would predict. There are many assumptions that go into the estimates, which I will state as they come up.

Time to Fully Read an HBM Stack

H100 has 5 stacks of 8-Hi HBM3 per compute die (2 GB per DRAM die, 0.8 TB/s per stack), 20 ms to fully read. H200 has 6 stacks of 12-Hi HBM3 ^[1] per compute die, 30 ms to read. B200/GB200 has 4 stacks of 8-Hi HBM3E per compute die (3 GB per DRAM die, 1.0 TB/s per stack), 24 ms to read. GB300 has 4 stacks of 12-Hi HBM3E, 36 ms to read. Non-Ultra Rubin chips of 2026 have 4 stacks of 12-Hi HBM4 per compute die (3 GB per DRAM die, 2.75 TB/s per stack ^[2] ), 13 ms to read.

Rubin Ultra probably uses 12-Hi stacks of HBM4E ^[3] (4 GB per DRAM die, 2048 pins, 14-16 Gbps per pin, which is 3.6-4.0 TB/s per stack ^[4] ), 12-13 ms to read. If HBM4E stays at 16 Gbps per pin (now reliably achieving it), but the stacks become 16-Hi, first-year Feynman in 2028 might use 2 stacks per compute die, which takes 16 ms to fully read. HBM5 of 2029 (that probably goes into the second year of Feynman) is expected to use 4096 pins and might start at 8 Gbps per pin, though the situation with HBM4 for non-Ultra Rubin suggests anchoring to 11 Gbps instead. This gives 5.6 TB/s per stack. With 5 GB per DRAM die, a 16-Hi stack takes 14 ms to read.

Thus we have 20 ms and 30 ms for H100 and H200, 24 ms and 36 ms for GB200 and GB300, 13 ms for both Rubin and Rubin Ultra, 16 ms and 14 ms for first- and second-year Feynman.

Maximal Pipelines Below 80 Tokens/s

It takes almost 3x more time to fully read an HBM stack of GB300 than that of non-Ultra Rubin. Such differences suggest that the rule of thumb for the number of scale-up systems with the capacity to hold the largest models should be sensitive to their HBM. For concreteness, let's target token generation speed constrained at 80 tokens/s per request (just from reading HBM, forcing even lower speed in practice, maybe 60 tokens/s), assuming speculative decoding (or multi-token prediction, MTP) that speeds up generation by 3x. Then we need to generate/accept some tokens for any given request every 37.5 ms.

Let's say we are using 3 racks of GB200 Oberon in a pipeline, and half of all HBM is hosting weights. The KV cache in the other half of a given rack won't actually be fully read on each pass through the weights, because only 1 out of 3 requests will be active during the current phase of the pipeline (the other 2 out of 3 requests are being processed at the other racks). Thus only 67% of HBM is read by a pipeline stage in this setup (50% for weights, and 17% for KV cache). This ideally takes only 67% of the time of reading a full HBM stack of GB200, 16 ms instead of 24 ms. Going through the whole pipeline of 3 racks would only take a request 48 ms, rather than 72 ms (on reading HBM alone).

As a request traverses a pipeline of N racks, it waits for all weights at all the racks to be read (taking half as much as reading N full HBM stacks), and it witnesses the reading of 1/N of the KV cache in each of the racks (taking half as much time as reading a single HBM stack in total). So if we want the pipeline to complete in 37.5 ms with GB200, 12 ms will be spent reading the KV cache regardless of the length of the pipeline, and the remaining 25.5 ms can be used to read the weights, which takes 12 ms per pipeline stage, meaning at most 2 stages. That is, 75 ms divided by 24 ms gives an upper bound on N+1.

This means a pipeline with the above assumptions can use at most 2 H100 servers, 1 H200 server, 2 GB200 Oberon racks, 1 GB300 Oberon rack, 4 Rubin or Rubin Ultra racks (either Oberon or Kyber), 3 systems of 8x Kyber with first-year Feynman, 4 systems of 8x Kyber with second-year Feynman.

Assuming 12-Hi HBM stacks for Rubin Ultra, 2 HBM stacks per compute die and 576 compute dies per rack for Feynman, total HBM per scale-up system is 640 GB for a H100 server, 1.1 TB for H200, 14 TB for GB200 Oberon, 20 TB for GB300 and Rubin Oberon racks, 110 TB for Rubin Ultra Kyber, 590 TB for 8x Kyber systems with first-year Feynman chips, 740 TB for 8x Kyber systems with second-year Feynman chips.

If half of the HBM capacity is spent on FP4 weights, and the buildout sufficiently completes a year after the system is released, we get the following upper bounds on total params of the largest models that can be served with at most 80 tokens/s per request (with 3x speedup via MTP). The constraint from pipelining on total FP4 params is 1.3T for H100 in 2023, 1.1T for H200 in 2024, 27T for GB200 Oberon in 2025, 20T for GB300 Oberon in 2026, 83T for Rubin Oberon racks in 2027, 442T for Rubin Ultra Kyber racks in 2028, 1.7 quadrillion for 8x Kyber systems with first-year Feynman chips in 2029, 2.9 quadrillion for 8x Kyber systems with second-year Feynman chips in 2030.

If serving in FP8 becomes important for very large models, the number of total params enabled by the scale-up systems is 2x lower. And of course this is an overestimate for model sizes that can actually be served at 80 tokens/s, since I'm ignoring the networking and compute overhead that couldn't be masked. The main way in which the network isn't ignored is that expert parallelism must happen only within scale-up systems rather than across different scale-up systems. This is something that for example doesn't apply to DeepSeek-V3 (see Section 3.4.2), but the resulting design constraints likely can't be sustained for the largest models.

Pretraining Compute

Based on Nvidia's apparent bet on FP8 in Rubin, where the FP8 to BF16 performance ratio is 4.4, which used to be 2 in the previous chips, I'm assuming that even the largest models will be pretrained in FP8. And I'm going to assume 3 months of pretraining at 40% FLOP/s utilization.

For 2023, the anchor is GPT-4 rumors of pretraining on 20K-25K A100s (20 MW of IT power), which at 0.3e15 BF16 FLOP/s (no FP8) per compute die gives 2.1e25 FLOPs under the above assumptions. For models of 2024 (trained in late 2023 or early 2024), that's 32K H100s (one building at the Microsoft Goodyear site), 50 MW of IT power, 2e15 FP8 FLOP/s per compute die, 2e26 FLOPs. For 2025, 100K H100s, 150 MW, 6e26 FLOPs. For 2026 models, 200K H100s or 300K Trainium 2 Ultra, 300 MW in both cases, 1.3e27 FLOPs.

The 300 MW figure for late 2025 pretraining compute anchors to 1-2 GW of total first-party compute per AI company at the end of 2025, which becomes 3-4 GW at the end of 2026, and 10 GW by the end of 2027. So for 2027 models, I'm guessing 1 GW of pretraining compute, which can't be H100s but could be B200/GB200 or Trainium 2 Ultra. IT power of a GB200/GB300 Oberon rack (72 packages per rack, 2 compute dies per package) might be 140-180 kW, possibly with another 10% of networking equipment overhead on top. This gives 730-930 compute dies per 1 GW of IT power, and a concrete example is the OpenAI Abilene site with 800K compute dies. Blackwell dies produce 2.5e15 FP8 FLOP/s, which gives 6e27 FLOPs for a 2027 model under the above assumptions.

A 2028 model might use 3 GW of pretraining compute (out of 10 GW of total first-party compute an AI company has), which can still be Blackwell, so 2e28 FLOPs. For 2028-2030 compute, I'm going to anchor to the SemiAnalysis estimate of global AI compute, obtaining an estimate for the total first-party compute available to an AI company that starts with 10 GW in late 2027, then goes to 17 GW in 2028, to 28 GW in 2029, and 40 GW in 2030. This suggests 5 GW of pretraining compute for a 2029 model (about a third of the 17 GW available in late 2028), which probably can no longer be Blackwell and must be Rubin (up to 13 GW out of the 17 GW is Rubin compute, though probably less, and maybe only 5-6 GW is in Oberon racks). At 225 kW per Oberon rack (possibly plus 10% in networking overhead), this is 580K compute dies per GW. At 8.75e15 FP8 FLOP/s per compute die, 5 GW give 8e28 FLOPs for a 2029 model.

At the end of 2029, there's 7 GW of Rubin Ultra Kyber racks from 2028, the 11 GW of the newer first-year Feynman 8x Kyber systems from 2029 can serve the older largest models, and there's 28 GW of first-party compute in total, so maybe 7 GW of Rubin could go into pretraining, which is 1e29 FLOPs. At the end of 2030, there's 40 GW of first-party compute and after the second-year Feynman buildout the large 8x Kyber scale-up systems are no longer scarce, so maybe the 11 GW of first-year Feynman compute might go into pretraining. To estimate the pretraining FLOPs, I need estimates of total IT power and FP8 FLOP/s per compute die for Feynman. Assuming a 30% higher draw than Rubin, and considering the TSMC N3P to A16 jump, I'm guessing 14e15 FP8 FLOP/s per compute die. At 30% more power, there would be only 450K Feynman compute dies per GW. A 2031 model could then be pretrained for 2.2e29 FLOPs.

Active Params from Pretraining Compute

I'm assuming there are up to 200T tokens of unique pretraining data (maybe half of it text data). Based on this Jan 2025 paper, the compute optimal ratio of tokens per active param is 3x higher for an MoE model with 8x sparsity compared to a dense model, and 6x higher for an MoE model with 30x sparsity, see Figure 11 and Figure 12, left. Based on the Jul 2024 Llama 3 405B report, the compute optimal ratio for a dense model is about 40 tokens/param at 4e25 FLOPs, see Figure 2 and Figure 3. Putting these anchors together, we get 120 and 240 tokens/param respectively. A May 2026 paper can be taken to vaguely suggest that the shortfall in unique data should be distributed equally between epochs of repetition and an increase in active params over the compute optimal number.

For 2023 models, 2.1e25 FLOPs ask for 170B active params at about 8x sparsity (with 20T training tokens). The constraint from the contemporary scale-up systems of 1.3T total params with 2 H100 servers in FP4 fits right at the 8x sparsity. The GPT-4 rumors say 1.8T total params, but RLVR and long reasoning weren't a concern then. For 2024 models, 2e26 FLOPs ask for 530B active params at about 8x sparsity (with 64T training tokens), which asks for 4.2T total params, which even in FP4 is way above the bound of 1.3T total params from 2 H100 servers or 1.1T total params from 1 H200 server. Thus 2024 models are significantly constrained by the available scale-up systems and should be much smaller than compute optimal, or else they must be slow and expensive.

A 2025 model trained for 6e26 FLOPs would want 900B active params at about 8x sparsity (with 110T training tokens), which asks for 7.2T total params, well within the bound of 27T total NVFP4 params from 2 racks of GB200 Oberon. GPT-4.5 was probably at that scale, but it was released before there were GB200 Oberon racks to serve it. With B200 NVL8, which might've been available in sufficient numbers, it would require 5 servers to fit, meaning 68 ms of reading HBM in one lap of the pipeline (47% of the 5 servers of 1.53 TB each is weights, 1/5 of the KV cache in the remaining 53% is read on one pass of a pipeline), or at most 44 tokens/s even with a 3x boost from MTP. If GPT-4.5 was actually faster (before GB200 NVL72 were plausibly available), maybe it was actually smaller. For some reason it wasn't released with RLVR after there were more GB200 Oberon racks to serve it cheaply.

A 2026 model trained for 1.3e27 FLOPs wants 1.3T active params at about 8x sparsity (with 160T training tokens), which is 10T total params, fitting well under the constraint from pipelining with the contemporary scale-up systems of either 27T FP4 params from 2 GB200 Oberon racks, or 20T FP4 params from 1 GB300 Oberon rack. This even fits in 1 GB200 Oberon rack in NVFP4, and could use FP8 with 2 GB200 Oberon racks. The largest models of 2026 are not constrained by scale-up systems if they are content with 8x sparsity.

For a 2027 model at around 8x sparsity, the 6e27 FLOPs of pretraining would want 2.9T active params with 350T tokens of training data, which is 1.75x more than the assumed 200T tokens of unique data that is actually available. Splitting the shortfall equally between more active params and more epochs of repetition, we need 1.32x more active params, which is 3.8T (trained for 1.32 epochs on the 200T unique tokens), with 30T total params. This is significantly below the constraint of 83T total params from a pipeline of 4 Rubin Oberon racks. Unclear if the AI companies would elect to go with more sparsity, increasing total params, or much higher speed from shorter pipelines. But in 2027, 30x sparsity remains out of reach ^[5] .

A 2028 model is constrained at a staggering 442T NVFP4 params if served with a pipeline of 4 Rubin Ultra Kyber racks. So targeting 30x sparsity, 2e28 FLOPs want 3.7T active params and 900T training tokens (from the assumption of 240 tokens/param being compute optimal at this sparsity). This is a shortfall of 4.5x in unique data, which needs a model with 2.1x more active params trained for 2.1 epochs of repetition. Thus a model of 2028 might have 7.9T active params and 240T total (at 30x sparsity), with a lot of room to spare in 4 Rubin Ultra Kyber racks, meaning it's going to use fewer racks and generate tokens faster. The active constraint for 2028 models is not enough pretraining compute, rather than scale-up systems that are too small.

A 2029 model pretrains for 8e28 FLOPs, which at 30x sparsity asks for 7.4T active params and 1,800T training tokens. The 200T tokens of unique data are a shortfall of 8.5x, so the model instead needs 22T active params and 650T total, trained for 2.9 epochs with the 200T unique tokens. This model can no longer be served with a pipeline of 4 Rubin Ultra Kyber racks, but the constraint from a pipeline of 3 systems of 8x Kyber with first-year Feynman chips sets a much higher constraint at 1.7 quadrillion NVFP4 params. Thus the 650T param model of 2029 could be served even in FP8 on 2 systems of 8x Kyber (instead of 3 systems), or faster/cheaper with 1-2 systems of 8x Kyber when using NVFP4.

A 2030 model pretrains for 1e29 FLOPs, which at 30x sparsity and with infinite data would want 8.3T active params and 2,000T training tokens, a shortfall of 10x. Thus the model would instead need 26T active params, meaning 790T total params at 30x sparsity, pretrained for 3.1 epochs of repeating the 200T unique tokens. This is only slightly more onerous than the 2029 model even for the first-year Feynman systems (perhaps one first-year Feynman system is no longer sufficient in NVFP4, as the model takes up 67% of its HBM), while with second-year Feynman systems (that put a constraint of 2.9 quadrillion NVFP4 params from a pipeline of 4 systems) we need just 1 of them in NVFP4, or 2 of them in FP8.

Finally, a 2031 model that pretrains for 2.2e29 FLOPs would at 30x sparsity and with infinite data ask for 12T active params and 3,000T training tokens, a shortfall of 15x. So the model instead needs 48T active params, meaning 1.4 quadrillion total params at 30x sparsity, trained for 3.9 epochs of repeating the 200T unique tokens. I didn't make estimates for post-Feynman systems, but even a pipeline of 3 systems of 8x Kyber with first-year Feynman chips suffices to serve this model in NVFP4 (the constraint is 1.7 quadrillion params), and a pipeline of 4 second-year Feynman systems can serve this model in FP8, so this model is not constrained at 30x sparsity even when served on hardware that is a year old.

Starting in 2028, the Constraint is Pretraining Compute

Overall, 2024 was the year when models were most constrained by the scale-up systems, with a compute optimally trained model infeasible to serve even at 8x sparsity. In 2023 and 2025-2026, compute optimally pretrained models fit in pipelines of scale-up systems if they have 8x sparsity. The situation marginally improves in 2027, when faster HBM4 in Rubin makes longer pipelines practical, while shorter pipelines get faster.

And then once the buildout of Rubin Ultra Kyber racks sufficiently completes in 2028, compute optimally pretrained models even with 30x sparsity become feasible to serve. This remains the case through 2031, despite the shortage of pretraining data that might require models to get 4x bigger than they would've needed to be with unlimited data.

H200 was released in 2023, before HBM3E of 2024 could grow into its full specs. So what Nvidia claims to be HBM3E in H200 has the specs of HBM3 also used in H100, the main difference is that the stacks are 12-Hi and there are 6 of them. ↩︎
The bandwidth of HBM4 in Rubin is much higher than the 2.0 TB/s required by the JEDEC standard. It's about 10.8 Gbps per pin instead of 8 Gbps per pin, with 2048 pins. ↩︎
Unlike the situation with H200, this is actual HBM4E in the sense of 4 GB per DRAM die rather than the 3 GB per DRAM die of HBM4, even though it's a year early. This is because 1024 GB per package (with 16 stacks of HBM) was announced by Nvidia when it was expected to be 16-Hi, while HBM4 would've been 768 GB per package with 16-Hi stacks. ↩︎
Samsung and SK Hynix might be able to achieve 3.6-4.0 TB/s in 12-Hi stacks, with 16-Hi stacks still in the works for HBM4E (meaning this bandwidth is probably relevant for Rubin Ultra of 2027, not just for the high-end HBM4E of 2028). Though news about samples are not a lot of evidence, since what matters is the performance achievable with high yield (ready for ramp), while the availability of samples only weakly depends on yield. So I'm giving some credence to the lower 3.6 TB/s (14 Gbps per pin). ↩︎
With 240 tokens/param compute optimal at 30x sparsity, we have 2T active params and 490T training tokens, a shortfall in unique data of 2.45. So the model would need 1.56x more active params, which is 3.2T, meaning 96T total params at 30x sparsity, more than the constraint of 83T. ↩︎

How capable do you think those 2028 models will be?

24x model size in 2 years seems big.

The compute scaling is much less significant than a 24x difference in model size would imply under the all-else-equal scaling assumptions, which is 24x squared. It's instead going from 1.3e27 FLOPs (probably Mythos 5, possibly GPT-5.5 as well, or its base model will get there with more training in GPT-5.6) to 2e28 FLOPs, a 15x increase. Then there's the change from 8x sparsity to 30x sparsity, which should give a 2x effective compute boost, to the total of 30x (3x boost from 8x sparsity compared to dense, 6x boost from 30x sparsity, see Figure 11 from this paper). But also the 2026 model has enough data, while the 2028 model is under a 4.5x data shortfall, so unclear how much of the effective compute boost survives. Maybe the 2028 model has 20x more effective compute in pretraining than the 2026 model.

On the other side of the 1.3e27 FLOPs, a 20x decrease gives 6e25 FLOPs, 3x less than late 2023 to very early 2024 compute that probably trained GPT-4o (though the model was likely significantly smaller than compute optimal). This same pretrain probably became o3 and then GPT-5.1 later in life. Perhaps the GPT-4o base model being smaller/overtrained cancels out the difference in compute, so it might be at the level of a compute optimal 8x sparsity model pretrained with 6e25 FLOPs. And GPT-5.1 was maybe around the level of Sonnet 4.5 if we try to ignore the differences in how they were post-trained, so the 2028 model might be stronger than the 2026 model by about as much as Mythos 5 is stronger than Sonnet 4.5 or GPT-5.1.

Then the 2031 model is using merely 11x more pretraining FLOPs than the 2028 model, and uses the same 30x sparsity (so there is no boost to effective compute from more sparsity), and has a 15x training data shortfall instead of 4x. So the difference between the 2028 and 2031 models is maybe 1.5x smaller than the difference between the 2026 and 2028 models (comparing the effective compute on the logarithmic scale), unless much larger pretrains have some currently-unknown advantages not captured in their token prediction loss. Perhaps the 2031 model is smarter than the 2028 model by about as much as Mythos 5 is smarter than the average of Sonnet 4.5 and Opus 4.6.

(Standard caveats, algorithmic advances can happen suddently at any time, any predictions about the current paradigm are only relevant as long as the current paradigm itself remains relevant.)

@Daniel Kokotajlo, could you share your thoughts?

@Vladimir_Nesov, thank you for the post! I have a few objections.

First of all, I would naively expect the 2028 240T model to cost around 240/10*25 = 600 dollars per million input tokens and 3000 dollars per million output tokens, basing my estimate on the prices of Mythos Preview. An SWE at Anthropic received $300K-405K/year in 2025, meaning that 100-140 million tokens from a $3000/Mtok model are to become replacable by a human year unless the 100M tokens are more valuable, meaning that the model is to become superhuman at coding or something else.
240T is more synapses than in a human brain. What is the probability that mankind fails to discover novel architectures by this time? Or was it priced in, meaning that the AI will HAVE become transformative?

As for @Lanthimum's question,^[1] I have the following answer sketch (again, I would appeciate it if Kokotajlo's team told its opinion!)

The AECI has been growing linearly for Anthropic's non-Mythos frontier since Opus 3. Unfortunately, no one measured the AECI for Sonnets 3,4 and 4.6 or Haikus 3, 3.5 and 4.5, and no other Sonnets or Haikus have been released. EpochAI's capability indices have Sonnet 4.6 lag behind Opus 4.6 by 2-3 points despite Sonnet 4.5 being on-trend and Sonnet 4 being close to Opus 4. I suspect that the scaling law of models' ECI is that they follow the trend of scaling the capabilities with lived experience or logarithm thereof until they are saturated or slowed down.
I have already urged Kokotajlo to increase the doubling time to 5-6 months due to the behavior of the METR time horizons since o3 and until Mythos Preview. Opus 4.6 had a 80%TH of 1.1 hours, Mythos Preview reached 3.1 hours, which is around double of Gemini 3 Pro. We expect at most three doublings of the o3-Opus 4.6 trend, which was supposed to give a 2-6.1 working days TH, meaning at most a 1-2.5 months TH if not outright a far smaller TH due to doublings being priced into the trend. Alas, back in AI-2027 the 80% horizon of the AC had a wide range of estimates containing Jurkovic's 1.5 months and the current estimate of a 80% horizon is a working year.
The considerations above assume that we don't discover anything like neuralese^[2] or continual learning, which an Anthropic researcher promises to solve in 2026. If such a discovery happens, then we run into a Dark Infohazardous Forest where we also find it far harder to align the models even in comparison with our current ones, which are amnesiated between instances.

My uninformed vision of the Dark Forest

Suppose that the approaches to neuralese and/or continual learning are actually discovered. I envision them as introducing new backpropagation channels and/or continual learning mechanisms with initially-low and gradually increasing rank. In this case, mankind might discover the scaling laws forcing the models' capabilities and dangerous capabilities like no-CoT math or the difficulty of interpreting the model or extracting the pseudo-CoT to also increase linearly or logarithmically with the size of parts affected by dangerous interventions until the model becomes completely brainlike. ^[3]

Then an irresponsible lab (xAI? Chinese labs?) would rush into the Forest once capabilities' scaling laws are established. A responsible lab, on the other hand, would establish the scaling laws of uncontrollability and follow the AI-2027-like path where no step is undertaken until either the previous step produces a presumably aligned model or a rival forces the lab to race.

UPD: Anthropic's experiment with automatead weak-to-strong generalisation researchers had Anthropic spend about $22 per AAR-hour. Opus 4.6's API prices are $5/MTok for input and $25/MTok for output, implying at most about a million tokens per AAR-hour. Figure 7 of the AI-2027 compute forecast has the researchers claim that a human's speed is ~13 tokens/second, meaning that a human would produce the million tokens in around 2.67 working days. Since a working year consists of ~230 working days, the human produces at least around 86 million tokens a working year.

UPD2: Anthropic's human researchers are paid $350K-800K a year, letting the LLM generate 120-270 million tokens. Which, again, are around 1.5-3 times more than a human's worth... unless the human is inspired to come up with some idea not during the hours that I included into the calculation.

^{^}
The user also was amazed by 10T->240T jump. Two years ago Claude 3.5 Sonnet with 400B params was the frontier.
^{^}
Or non-neuralese architectures like placing multiple tokens into the CoT? They might end up with a better capabilities/control tradeoff.
^{^}
Additionally, I wonder what was supposed to be Agent-5's post-brainlike architecture.

Input token cost scales with active params, essentially doesn't depend on total params (the same as the FLOPs of pretraining), and goes down over the years as FLOPs get cheaper. Output token cost is determined by KV cache per token times average context length. KV cache per token scales with model dimension, which scales with square root of active params or even slower (if more active params translates to more layers or more experts per layer), that is it scales significantly slower than the active params themselves do. Of course, output token cost can't actually get below that of the input tokens because the output tokens are HBM bandwidth bound (for big models), so their cost (not necessarily price!) remains at least 2-3x higher, but with longer contexts they become even more expensive compared to the input tokens. So at a sufficient context length, the scaling trends play out if we keep the context length fixed and vary the active params of models, with the output token cost increasing slower than the input token cost as the number of active params increases.

Most tokens are input tokens, so their pricing matters more for the AI company. In principle, output tokens could be priced with zero gross margin, they just shouldn't be generated at a loss. The cost of the output tokens is kept at a healthy 5-6x compared to input tokens by increasing the allowed context length.

So let's say the 2026 model with 1.3T active params is literally Mythos 5 and it's priced at $10 per 1M input tokens. That's already with a gross margin that might be above 70%, since the 70% figure could probably anchor Opus 4.6 price, which is $5 per 1M input tokens, exactly 2x lower than Mythos 5 price, and Mythos 5 probably doesn't have more than 2x the active params than Opus 4.6, since it probably didn't use more than 4x the compute of Opus 4 in pretraining, and both plausibly use compute optimally sized numbers of active params (maybe Mythos 5 has more sparsity than Opus 4) without yet running out of unique training data (active params scale with the square root of pretraining compute). Which is to say, the $10 per 1M input tokens price of Mythos 5 probably has significant room for coming down even today.

The 240T total params model of 2028 has only 7.9T active params, 6x more than the 1.3T active params of the 2026 model, and it runs on Rubin instead of Blackwell. GB200 produces 5e15 FP4 FLOP/s per compute die, Rubin produces 17.5e15 FP4 FLOP/s per compute die, GB200 rack needs 140 kW, Rubin Oberon rack needs 225 kW. The cost per year of a 1 GW datacenter (not just the racks) might be going up 20%, so the cost of the datacenter time per compute die goes up maybe 90%, but the FP4 FLOP/s go up 3.5x, and so the FP4 FLOPs in Rubin are 1.8x cheaper than the FP4 FLOPs in GB200 (the 2028 model needs Kyber rather than Oberon, but let's assume the cost of time per compute die isn't too different). Thus the 6x difference in active params becomes a 3.3x difference in cost. The price for the 240T total params model of 2028 might be $33/$165 per 1M input/output tokens. (The output token price is adjusted as appropriate by increasing the allowed context length, if it would otherwise try to get below 5x the input token price. And this is plausibly still at a more than 70% gross margin, so could be cheaper actually.)

3-4 GW at the end of 2026

Are you basing this on this quote?

There’s a few strings to pull on there. One is, what happens to depreciation of GPUs? I guess I didn’t answer your prior question, which is that I think Anthropic will be able to get to five gigawatts-ish, maybe a little bit more by the end of the year through themselves as well as their product being served through Bedrock, Vertex, or Foundry. I think they’ll be able to get to five or six gigawatts, which is way above their initial plans. OpenAI will be roughly the same, actually a little bit higher based on our numbers.

First-party compute is distinct from serving inference via clouds (IaaS as opposed to TaaS compute). The quote you gave (at 11:30 in the podcast) discusses IaaS plus TaaS, while I talk about just IaaS (which is more relevant for R&D and training). In the same interview, Dylan Patel also says Anthropic will get to 2+ GW end of 2026 (at 1:12:10 in the podcast, the link I gave in the post for the 10 GW claims), which likely refers to IaaS only.

I'm basing the 3-4 GW figure on multiple less-specific clues and interpolation. For Anthropic, the general IT power of compute trend of about 3x per year, the claim of 2+ GW end of 2026 probably implies 1+ GW end of 2025, then there's 10 GW end of 2027, 1 GW of TPUs in the 2026 buildout, more Trainium in 2026, so I think 2+ GW end of 2026 likely resolves to 3 GW. For OpenAI, there are contracts with Oracle for Blackwell (see the table in this section of the TPU post), which is 2025-2026 incremental compute, and compute at the end of 2025 is 2 GW. Plausibly OpenAI gets more than 4 GW end of 2026, but then it might rely on first-party compute to serve tokens more than Anthropic, and so use less of it for R&D and pretraining. Also, if OpenAI gets back to the same 10 GW at the end of 2027 as Anthropic, the possible difference over 2026 balances out, so it shouldn't be strategically relevant for them when sizing pretraining runs.

The constraint from pipelining on total FP4 params is 1.3T for H100 in 2023, 1.1T for H200 in 2024, 27T for GB200 Oberon in 2025, 20T for GB300 Oberon in 2026, 83T for Rubin Oberon racks in 2027

Why 27T and not 28T for GB200?

N_params = N_racks * HBM/2 * 8/fp_format

with 2 racks, 14TB of HBM, and parameters in FP4?

A more precise value is 27.6T. It's an upper bound, so I rounded down. It's a difference of 2%, at this level I might use rounding inconsistently, and depending on context I sometimes keep just 1-2 significant digits.