There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn't mean that the number of fruits is up 1000x in 3 years.
Price-performance of compute compounds over many years, but most algorithmic progress doesn't, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn't account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
This is a reasonable point in principle, but I don't know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.
Recursive self-improvement in AI probably comes before AGI. Evolution doesn't need to understand human minds to build them, and a parent doesn't need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn't depend on understanding how they think.
Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn't depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it's not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.
Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn't depend on such capabilities, and doesn't stall for their lack, like evolution doesn't stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.
Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I'm still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.
(It's useful to frame most ideas as exploratory engineering rather than forecasting. The question of whe...
Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren't required, it's mostly an engineering task of scaling compute.
The speed of scaling pretraining will go down ~3x in 2027-2029, reducing probability of crossing transformative capability thresholds per unit of time after that point, if they'd not been crossed yet by then.
GPT-4 was trained in 2022 at ~2e25 FLOPs, Grok-3 and GPT-4.5 were trained in 2024 at ~3e26 FLOPs (or twice that in FP8) using ~100K H100s training systems (which cost ~$4-5bn to build). In 2026, Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks (which cost ~$22-35bn to build), enough to train a ~4e27 FLOPs model. Thus recently there is a 2-year ~6x increase in cost for a frontier training system and a 2-year ~14x increase in compute. But for 2028 this would mean a $150bn training system (which is a lot, so only borderline plausible), and then $900bn in 2030. At that point AI companies would need to either somehow figure out how to pool resources, or pretraining will stop scaling before 2030 (assuming AI still doesn't hit a transformative commercial success).
If funding stops increasing, what we are left with is the increase in price performance of ~2.2x every 2 years, which is ~3.3x slower than the 2-year ~14x at the current pace. (I'm estimating price performance for a whole datacenter or at least a rack, rather than only for chips.)
Building frontier AI datacenters costs significantly more than their servers and networking. The buildings and the power aren't a minor cost because older infrastructure mostly can't be reused, similarly to how a training system needs to be built before we can talk about the much lower cost of 4 months of its time.
Apparently Crusoe's part in the Stargate Abilene datacenters is worth $15bn, which is only the buildings, power (substations and gas generators), and cooling, but not the servers and networking (Oracle is taking care of that). With 400K chips in GB200 NVL72 racks (which is 5.6K racks), at maybe $4M per rack or $5M per rack together with external-to-racks networking[1] ($70K per chip all-in on compute hardware), that's about $27bn, a figure that's comparable to the $15bn for the non-compute parts of the datacenters.
This makes the funding burden significantly higher ($7.5M per rack or $105K per chip), so that the Stargate Abilene site alone would cost about $40-45bn and not only $25-30bn. I'm guessing the buildings and the power infrastructure are not usually counted because they last a long time, so the relatively small time cost of using them (such as paying for electrici...
It seems more accurate to say that AI progress is linear rather than exponential, as a result of being logarithmic in resources that are in turn exponentially increasing with time. (This is not quantitative, any more than the "exponential progress" I'm disagreeing with[1].)
Logarithmic return on resources means strongly diminishing returns, but that's not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore's law in the price-performance form held for a really long time; even though it's much slower than the present funding ramp, it's still promising exponentially more compute over time.
And so the progress won't obviously have an opportunity to actually plateau, merely proceed at a slower linear pace, until some capability threshold or a non-incremental algorithmic improvement. Observing the continued absence of the never-real exponential progress doesn't oppose this expectation. Incremental releases are already apparently making it difficult for people to notice the extent of improvement over the last 2.5 years. With 3x slower progress (after 2029-2032), a similar amount of improvement wo...
There is a natural sense in which AI progress is exponential: capabilities are increasing at a rate which involves exponentially increasing impact (as measured by e.g. economic value).
Exponential increase in total economic value is not specific to AI, any new tech is going to start exponentially (possibly following the startups championing it) before it gets further on the adoption S-curve. The unusual things about AI is that it gets better with more resources (while most other things just don't get better at all in a straightforward scaling law manner), that the logarithm of resources thing leaves the persistent impression of plateauing despite not actually plateauing, and that even if it runs out of the adoption S-curve it still has Moore's law of price-performance to keep fueling its improvement. These unusual things frame the sense in which it's linear/logarithmic.
If the improvement keeps raising the ceiling on adoption (capabilities) fast enough, funding keeps scaling into slightly more absurd territory, but even then it won't go a long way without the kind of takeoff that makes anything like the modern industry obsolete. After the exponential phase of adoption comes to an end, it falls back to Moore's law, which still keeps giving it exponential compute to slowly keep fueling further progress, and in that sense there is some unusual exponential-ness to this. Though probably there are other things with scaling laws of their own that global economic growth (instead of Moore's law) would similarly fuel, even slower.
In many industries cost decreases by some factor with every doubling of cumulative production. This is how solar eventually became economically viable.
A surprising report by Bloomberg claims 16K GB200[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.
If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won't be out in 2025 and officially "GPT-5" will mean something else (since it's due "in months" in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it's "Q4 2025" since their FY 2025 runs to end of Jan 2025).
In principle "16K GB200" might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 "superchip", but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbe
I think 'GB200' refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.
My guess is that Bloomberg's phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I'd be very surprised if OpenAI don't have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1/4 of what Microsoft alone plan to invest this year.
Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]
There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.
The marketing terminology is inconvenient, a "superchip" can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it's better to talk in terms of chips (that are not "superchips"), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn't consistently help because it tends to get reported at randomly chosen precision and sparsity[1].
Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it's not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it's just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole...
Crusoe/OpenAI Abilene campus might come online in Feb-Jun 2026. Crusoe CEO said during RAISE Summit 2025 (that took place on 8-9 Jul 2025) that the 6 buildings of phase 2 will "be coming online" in "just over 200 days" (at 7:03 during a panel discussion). If this means 230 days, that's end of Feb 2026. If he really means "coming online", then it becomes available at that time. If he actually means that it's when the last building of 8 from both phases will be ready to install the compute hardware, then it's at least 3-4 months more to do that (judging by xAI's Colossus), possibly May-Jun 2026.
This is plausibly the first 400K chip system in GB200/GB300 NVL72 racks (about 900 MW), which is 10x 100K H100s of 2024 in FLOP/s and 12x H200s in HBM per scale-up world (for GB200, at 14 TB), making models 10x larger in total params feasible to inference or train with a lot of RLVR. Currently only Google plausibly has comparable compute, with their Trillium (TPUv6e) systems that across 256 chips per pod (scale-up world) offer 8 TB of HBM (generally available since Dec 2024 in 100K chip systems). The older TPUv5p from 2023 has even larger pods, but it's unclear if they have enough of them to f...
Here's a couple of my recent relevant posts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I'm mostly discussing total params count and HBM capacity per scale-up world, not compute, how it's constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.
Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it's less than 200K chips[1], but more than the surprising 16K-32K chips[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute "GPT-5" (2e27 FLOPs) by the end of 2025 would require using FP8[3].
The Crusoe post says "initial phase, comprising two buildings at ... 200+ megawatts" and "each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s". Dylan Patel's estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it's way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).
So the "50K GB200 NVL72s" per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the "100K GPUs" per building from the Jul 2024 Crusoe post must've meant 100K compute dies (which is 50K chips). It...
It's instrumentally useful for early AGIs to Pause development of superintelligence for the same reasons as it is for humans. Thus preliminary work on policy tools for Pausing unfettered RSI is also something early AGIs could be aimed at, even if it's only half-baked ideas available on the eve of potential takeoff, as the AGIs are proving hard to aim and start doing things for their own reasons.
If (early) scheming-for-long-run-preferences AGIs were in control, they would likely prefer a pause (all else equal). If they aren't, it's very unclear and they very well might not. (E.g., because they gamble that more powerful AIs will share their preferences (edit: share their preferences more than the humans in control do) and they think that these AIs would have a better shot at takeover.)
Musk on a Y Combinator podcast, at 42:42 (about AI risk):
I think it most likely will be a good outcome. I guess I sort of agree with Geoff Hinton that maybe it's a 10 to 20 percent chance on annihilation. But look on the bright side, that's 80 to 90 percent probability of a great outcome.
OMG! GEOFF! STOP STATING YOUR DEFERENTIAL PROBABILITY without also stating your first-order probability! If your first-order probability is >50% then say so! Otherwise you're making other people (ELON MUSK!) double count evidence from "other people".
https://www.youtube.com/watch?v=PTF5Up1hMhw&t=2283s
https://tsvibt.blogspot.com/2022/09/dangers-of-deferrence.html
Musk is in charge of xAI, one of the only 5 companies in the world that both have access to frontier AI training compute and pursue development of AGI (Google DeepMind, OpenAI, Anthropic, xAI, and Meta). So seeing unambiguous "annihilation" with a significant weight in his probability distribution (and also on the record) is a notable development. (In 2023 there was a statement on extinction risk signed by Hassabis, Amodei, and Altman, but it didn't state the weight of the risk, and wasn't signed by Musk or Zuckerberg.)
Edit: The rest of this comment in its original form got out of hand, you can now read it as a post.
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there's a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be abo...
Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It's a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it's very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters k...
New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).
So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.
SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎
Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎
At 4 months, with $2/hour, this takes $3
Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.
For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott's blue whale slide.
There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it's plausibly merely 4e25 FLOPs based on Dario Amodei's (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.
There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd
And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn't doing image generation, it isn't doing voice synthesis, it isn't doing video generation... (As far as we know they aren't researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That's it.
But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to 'concise' responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to 'overwrite', especially in what I was working on.)
OpenAI's gpt-oss-120b might be the first open weights model (implicitly) revealed to be pretrained for 100T-200T tokens. In the section "Pretraining" of the model card, it's said that "The training run for gpt-oss-120b required 2.1 million H100-hours", so probably this is just the GPU-time for pretraining rather than both pretraining and RLVR.
The pretraining precision is unclear, but for a model of this size FP8 is likely. Because H100-hours are mentioned, it couldn't (usefully) be MXFP4 the model ended up with, since H100 can't do FP4 faster than FP8 (but Blackwell can). Also, despite claims that the model was "trained with native MXFP4 precision" the model card also says "We post-trained the models with quantization of the MoE weights to MXFP4 format", suggesting higher precision before post-training.
At 40% utilization, with 2e15 FP8 FLOP/s per H100, 2.1e6 H100-hours give 6e24 FLOPs (3.5x less than the original GPT-4, 2x more than DeepSeek-V3). The model only has 5.1B active params, so this suggests 188T tokens by 6ND rule. If it was pretrained in BF16 for some reason, that's still 94T tokens.
For comparison, a compute optimal 5e26 model pretrained on 100K H100s from 2024 would al...
The input token batch price is $0.625, which works for a 850B active param model running in FP4 on GB200 NVL72 priced at $8 per chip-hour with 60% compute utilization (for prefill). If the cost of chip-hours is a third of the capital cost of compute equipment in the first year, and 100K chips of GB200 NVL72 cost $7bn ($5M per rack all-in, with networking), then its chip-hour should cost at least $2.66.
So there is some possibility for gross margin here in principle, even though $8 per chip-hour already sounds very cheap. GCP is selling B200-hours for $11 (a4-highgpu-8g instances), though B200s are also on gpulist for $3-4. Oracle is selling actual GB200 in 4-chip instances for $16 per chip-hour, if I'm reading it right (it's in principle possible it's actually $4 and $16 is for the 4-chip instance as a whole, but GCP's prices for B200 corroborate that $16 could be right for a single chip).
There's the Oct 2024 knowledge cutoff, which is later than Orion should've started training, but in principle this could be for mid-training that got re-applied recently, or they could've just redone the whole run with the learnings from GPT-4.5 and an updated pretraining dataset. Also they would'v...
By 2027-2028, pretraining compute might get an unexpected ~4x boost in price-performance above trend. Nvidia Rubin NVL144 CPX will double the number of compute dies per rack compared to the previously announced Rubin NVL144, and there is a May 2025 paper demonstrating BF16 parity of Nvidia's NVFP4 4-bit block number format.
The additional chips[1] in the NVL144 CPX racks don't introduce any overhead to the scale-up networking of the non-CPX chips (they mostly just increase the power consumption), and they don't include HBM, thus it's in principle an extremely cost-effective increase in the amount of compute (if it can find high utilization). It's not useful for decoding/generation (output tokens), but it can be useful for pretraining (as well as the declared purpose of prefill, input token processing during inference). Not being included in a big scale-up world could in principle be a problem early in a large pretraining run, because it forces larger batch sizes, but high-granularity MoE (where many experts are active) can oppose that, and also merely getting into play a bit later in a pretraining run once larger batch sizes are less of a problem might be impactful enough.
Previously...
in general publicly known training techniques are behind sota, so this should be taken into account.
Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!
Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)
A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎
o3 has a different base model (presumably).
All of the figures are base model equivalated between RL and not
I would expect "this paper doesn't have the actual optimal methods" is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.
But if one has enough money, you can finetune GPT models, and test that.
Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper.
Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths).
We know that RL decreases model entropy, so the first k passes will be more different for a high variance model.
Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples).
At very large K, we would expect variance to matter more than mean.
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.
Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk's Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can't train a 2e27 FLOPs model.
The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there's already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI's pace, and go into pretraining for 4 months, then with 1 more month of post-training it's already November.
So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.
So the rumors about GPT-5 in late May 2025 either represent change in the naming convention
In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
I think he's pretty plainly saying that this "GPT-5" will be a completely different thing from a 100x'd GPT-4.
if OpenAI follows the usual naming convention of roughly 100x in raw compute.
I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.
Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.
The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it's only a vague plan. But Microsoft not agreeing to build training systems on OpenAI's schedule is some evidence.
OpenAI would want to get from under Microsoft's thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is som...
The 50M H100 equivalent compute by 2030 figure tweeted by Musk is on trend (assuming a 2028 slowdown), might cost about $300bn in total (for the training systems built in 2025-2030 for one AI company, including the buildings and power infrastructure).
If the current trend of compute scaling continues to 2028, there will be 160x more compute per training system than the 100K H100s of 2024. It will require 5 GW of power and cost about $140bn in compute hardware and an additional $60bn in buildings, power, and cooling infrastructure[1].
However, if the slowdown starts earlier while still targeting an eventual spend of $100bn per year, and a 5 GW frontier AI training system isn't yet built in 2028-2029 (which seems plausible), building it in 2030 would use the next generation of compute hardware, which will be about 2x more performant for an approximately unchanged cost. This means 320x more compute than the 100K H100s systems of 2024, or 32M H100 equivalent compute. If we sum it up with the preceding generations of frontier AI training systems built for the same company, say 2 GW in 2028 and 1 GW in 2026, this gives us 40M H100 equivalents, which is the same as 50M given the error bars ...
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it'll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn't need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
Dario Amodei suggests that in-context learning might suffice for continual learning. The way LLMs do in-context learning with long context is disanalogous to anything humans can do, but a context window of 15M tokens is 500 days of 30K tokens per day, which is more than enough to progress from "first day on the job" to knowing what you are doing with this particular source of tasks. Needs to work mostly with text (if it works at all), or 15M tokens won't be enough, but that could be sufficient.
So this might just be about moving from RAG to including more free-form observations that were historically made by the model itself for the same source of tasks, with massively more tokens of context, and the current memory features of chatbots in the form of long text files might with sufficient scale become the real thing, rather than remaining a dead-end crutch, once these text files get into the habit of accumulating megabytes of observations. And RLVR can plausibly teach the models how to make a good use of these very long contexts.
With this year's 14 TB of HBM per GB200 NVL72, very long context windows become more feasible (than with ~1 TB of HBM per node that most current models are still running on), and then there's the next step in 2028 with Rubin Ultra NVL576 systems that have 147 TB of HBM.
Long reasoning with MoE models doesn't get cheaper with overtraining, and pretraining data scarcity might make it useful to have even more active params than compute optimal.
Overtraining (less active params than compute optimal) is useful for processing input tokens, but reasoning models want to generate so many output tokens that cheapness of input tokens plausibly becomes relatively unimportant for some use cases. Performance for output tokens depends on total params and KV cache per token, you want total params and hundreds of KV cache contexts to fit in a few nodes (servers, scale-up worlds). Until recently, an 8-chip node of H100/H200/B200 only had 0.7-1.4 TB of HBM, which means that it was manageable to generate output tokens for models with maybe 1-2T total params, using 2-4 nodes, as long as KV cache per token was small enough (which depends on the attention mechanism, and indirectly on model dimension, but plausibly only weakly on the number of active params, in terms of compute optimality).
With GB200 NVL72, we get to 13 TB per rack (and then 20 TB for GB300), an increase of 10x, so plausibly models with 10-20T total params become feasible to run long reasoning on (and tra...
Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It's above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It's above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.
The developer's site says it's a MoE model. Developer's API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that's about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won't have significant margins, and model size is known, unlike with lightweight closed models.) ↩︎
Superintelligence that both lets humans survive (or revives cryonauts) and doesn't enable indefinite lifespans is a very contrived package. Grading "doom" on concerns centrally about the first decades to centuries of post-AGI future (value/culture drift, successors, the next few generations of humanity) is not taking into account that the next billions+ years is also what could happen to you or people you know personally, if there is a future for originally-humans at all.
(This is analogous to the "missing mood" of not taking superintelligence into account ...
Cultural/moral maturity (in a civilization) has never been observed before, similarly to technological maturity. Scalable production of a new kind of thing brings its abundance in sight, which fails to be a concern earlier, while it couldn't be scaled. A moderate level of AI alignment or of cultural change is not an equilibrium if these things are anchored to scalable resources (effective cognition and coordination, fast subjective serial time). Instead they reach extremes of the kind never observed before those resources become scalable.
Agentic RLVR targeting ability of AI to apply RLVR (or more lightweight finetuning) to itself when appropriate (using something like OpenAI's RL API) potentially gives ARA capabilities and substitutes for more innate hypothetical ways of doing online/continual learning or undergoing on-boarding[1]. Thus ability of AI to "do AI research" is not primarily about RSI or increasing productivity of AI researchers, it's about removing the last important hobble on LLMs that currently causes unrecoverable inability (for a given AI) to do some simple things or truly...
Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don't transfer between these very different machines. A better design doesn't just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.
A reflectively stable agent prefers to preserve some property of itself. This doesn't in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don't prevent presence of self-improving agents in the world.
The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the ...