146

LESSWRONG
LW

145

Vladimir_Nesov's Shortform

by Vladimir_Nesov
4th Oct 2024
AI Alignment Forum
1 min read
140

10

Ω 4

This is a special post for quick takes by Vladimir_Nesov. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Vladimir_Nesov's Shortform
85Vladimir_Nesov
16Buck
8Vladimir_Nesov
3elifland
3Vladimir_Nesov
67Vladimir_Nesov
7TsviBT
2Nathan Helm-Burger
6faul_sname
11Vladimir_Nesov
4cubefox
22Vladimir_Nesov
15Alexander Gietelink Oldenziel
3bohaska
7Alexander Gietelink Oldenziel
2Tomás B.
63Vladimir_Nesov
4ryan_greenblatt
6Vladimir_Nesov
3ryan_greenblatt
62Vladimir_Nesov
6ozziegooen
1anaguma
1avturchin
58Vladimir_Nesov
15ryan_greenblatt
13Vladimir_Nesov
15Thomas Kwa
5Vladimir_Nesov
8Cole Wyeth
45Vladimir_Nesov
12romeo
10Vladimir_Nesov
3romeo
5Vladimir_Nesov
5romeo
38Vladimir_Nesov
5kairos_
10Vladimir_Nesov
3anaguma
6Vladimir_Nesov
38Vladimir_Nesov
36Vladimir_Nesov
11ryan_greenblatt
3Vladimir_Nesov
2Vladimir_Nesov
2ryan_greenblatt
4Vladimir_Nesov
1Jan Betley
36Vladimir_Nesov
52TsviBT
3ceba
11Vladimir_Nesov
9Dakara
8ACCount
35Vladimir_Nesov
5ryan_greenblatt
3Vladimir_Nesov
2ryan_greenblatt
4Vladimir_Nesov
1Archimedes
35Vladimir_Nesov
35Vladimir_Nesov
12Daniel Kokotajlo
24Vladimir_Nesov
3romeo
4Vladimir_Nesov
5Aaron_Scher
5Vladimir_Nesov
19gwern
5Daniel Kokotajlo
32Vladimir_Nesov
4Peter Wildeford
5Jacob_Hilton
3Vladimir_Nesov
2teradimich
12Vladimir_Nesov
1anaguma
31Vladimir_Nesov
11leogao
5romeo
3Vladimir_Nesov
31Vladimir_Nesov
5Thane Ruthenis
5ryan_greenblatt
10Mis-Understandings
9Ivan Vendrov
6Vladimir_Nesov
7faul_sname
6gwern
3Vladimir_Nesov
3Thane Ruthenis
3Vladimir_Nesov
1peterr
31Vladimir_Nesov
5anaguma
22Vladimir_Nesov
2Lorenzo
27Vladimir_Nesov
12Thane Ruthenis
3Vladimir_Nesov
3Thane Ruthenis
5Vladimir_Nesov
2Thane Ruthenis
5Vladimir_Nesov
10Josh You
4Vladimir_Nesov
27Vladimir_Nesov
24Vladimir_Nesov
1anaguma
5Vladimir_Nesov
13Vladimir_Nesov
2Noosphere89
1LWLW
9Thane Ruthenis
3LWLW
5Thane Ruthenis
3LWLW
4Thane Ruthenis
3LWLW
12Vladimir_Nesov
6danielms
2Vladimir_Nesov
1danielms
2Hopenope
10Vladimir_Nesov
10Vladimir_Nesov
4Vladimir_Nesov
9Vladimir_Nesov
0Dagon
7Vladimir_Nesov
2Mateusz Bagiński
2Vladimir_Nesov
1Kaarel
2Vladimir_Nesov
6Vladimir_Nesov
1Person
5Vladimir_Nesov
4Vladimir_Nesov
1CstineSublime
140 comments, sorted by
top scoring
Click to highlight new comments since: Today at 12:44 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Vladimir_Nesov2mo858

There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn't mean that the number of fruits is up 1000x in 3 years.

Price-performance of compute compounds over many years, but most algorithmic progress doesn't, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn't account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.

Reply7
[-]Buck2mo164

This is a reasonable point in principle, but I don't know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.

Reply
8Vladimir_Nesov2mo
I think pretraining data pipeline improvements have this issue, they stop helping with larger models that want more data (or it becomes about midtraining). And similarly for the benchmark-placating better post-training data that enables ever less intelligent models to get good scores, but probably doesn't add up to much (at least when it's not pretraining-scale RLVR). Things like MoE, GLU over LU, maybe DyT or Muon add up to a relatively modest compute multiplier over the original Transformer. For example Transformer++ vs. Transformer in Figure 4 of the Mamba paper suggests a total compute multiplier of 5x, attained over 6 years since the original Transformer (for dense models). This is emphatically not 3x-4x per year! Chinchilla scaling is more about careful methodology with compute optimality rather than a specific algorithmic improvement, and even now most demonstrations of compute multipliers fail to take one of its lessons and cool down the models before measurement. This could lead to hilarious results such as Figure 11 of the OLMo 2 paper where an apparent 2x compute multiplier vanishes to nothing after cooling (admittedly, nobody expected this to be a real compute multiplier, but in a more confusing case it could've been taken to be one).
3elifland2mo
In this Epoch paper appendix https://arxiv.org/pdf/2403.05812#page=12.3 they report efficiency improvements across 1.5+ years of time: (a) is faster than your Mamba paper example but still much slower than 3-4x/year. (b) and (c) are at ~4x, though (c) isn't much longer than a year. And these are basically not taking into account post-training efficiency gains iiuc. We're not working with many data points but it seems like these provide an existence proof that gains can compound across at least 3 years. Would love to see some updated data collection on this, I think we could get more evidence on your hypothesis.
3Vladimir_Nesov2mo
Mamba paper uses a relevant kind of methodology, it directly compares different algorithmic ingredients in the same setting, training on a fixed dataset and measuring perplexity (do note it's not trying MoE, so the actual total improvement is greater). It's a way of directly comparing cumulative improvement over all that time. To impact future frontier capabilities, an algorithmic ingredient from the past needs to be both applicable to the future frontier models, and help with benchmarks relevant to those frontier models, compared to the counterfactual where the frontier model doesn't use the algorithmic ingredient. When an ingredient stops being applicable to the frontier model, or stops being relevant to what's currently important about its capabilities, it's no longer compounding towards frontier capabilities. It wouldn't matter if that same ingredient is helping a different contemporary non-frontier small model to match a much older model with much less compute. Or that it's helping the frontier model to do much better than an older model on a benchmark that used to matter then, but doesn't matter now. So I'm skeptical of the Epoch paper's overall framing, its willingness to compare everything against everything indirectly, that's a lot of the point I'm making. You mostly can't use methods from 2014 and frontier AI compute from 2025 to train something directly comparable to a lightweight version of a frontier model of 2025 trained on less compute (but still compute optimally), compared in a way that matters in 2025. So what does it mean that there is so and so compute multiplier across all of this time? At least for Transformer recipes, there is a possibility of comparing them directly if training converges. Also, if we are not even aiming to do Chinchilla optimal training runs, what are we even comparing? For older algorithmic ingredients, you still need to aim for compute optimality to extract a meaningful compute multiplier, even if in the time of those ol
[-]Vladimir_Nesov1y*6736

Recursive self-improvement in AI probably comes before AGI. Evolution doesn't need to understand human minds to build them, and a parent doesn't need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn't depend on understanding how they think.

Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn't depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.

Reply651
7TsviBT1y
The bitter lesson says that there are many things you don't need to understand, but it doesn't say you don't need to understand anything. I think you're doing a "we just need X" with recursive self-improvement. The improvement may be iterable and self-applicable... but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.
2Nathan Helm-Burger10mo
Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded... The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.  For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won't block them all).  Another prediction we might make is that seeing some rapid progress doesn't guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy. 
6faul_sname1y
Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because 1. It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function. 2. It means that multi-agent dynamics will be very relevant in how things happen If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.
[-]Vladimir_Nesov1y117

No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it's not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.

Reply
4cubefox1y
Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I'm unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.
[-]Vladimir_Nesov1y229

Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn't depend on such capabilities, and doesn't stall for their lack, like evolution doesn't stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.

Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I'm still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.

(It's useful to frame most ideas as exploratory engineering rather than forecasting. The question of whe... (read more)

Reply2
[-]Alexander Gietelink Oldenziel1y1512

Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren't required, it's mostly an engineering task of scaling compute. 

Reply2
3bohaska1y
This seems like the sort of R&D that China is good at: research that doesn't need superstar researchers and that is mostly made of incremental improvements. But yet they don't seem to be producing top LLMs. Why is that?
7Alexander Gietelink Oldenziel1y
China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs. A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently. There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses. In any case, I wouldn't assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.
2Tomás B.1y
@gwern
[-]Vladimir_Nesov5mo635

The speed of scaling pretraining will go down ~3x in 2027-2029, reducing probability of crossing transformative capability thresholds per unit of time after that point, if they'd not been crossed yet by then.

GPT-4 was trained in 2022 at ~2e25 FLOPs, Grok-3 and GPT-4.5 were trained in 2024 at ~3e26 FLOPs (or twice that in FP8) using ~100K H100s training systems (which cost ~$4-5bn to build). In 2026, Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks (which cost ~$22-35bn to build), enough to train a ~4e27 FLOPs model. Thus recently there is a 2-year ~6x increase in cost for a frontier training system and a 2-year ~14x increase in compute. But for 2028 this would mean a $150bn training system (which is a lot, so only borderline plausible), and then $900bn in 2030. At that point AI companies would need to either somehow figure out how to pool resources, or pretraining will stop scaling before 2030 (assuming AI still doesn't hit a transformative commercial success).

If funding stops increasing, what we are left with is the increase in price performance of ~2.2x every 2 years, which is ~3.3x slower than the 2-year ~14x at the current pace. (I'm estimating price performance for a whole datacenter or at least a rack, rather than only for chips.)

Reply8
4ryan_greenblatt5mo
We also hit limits on fab capacity without constructing a bunch more fabs around a similar time. ---------------------------------------- Price performance of 2.2x per year feels aggressive to me. The chip only trend is more like 1.35x / year from understanding. Do you think the ML chip trend is much faster than this? I don't see how you could have a 2.2x price drop per year longer term without chip price performance following as eventually chips will be the bottleneck even if other costs (e.g., interconnect, building datacenters) are dropping. Edit: this was 2.2x every 2 years, I was just confused.
6Vladimir_Nesov5mo
If I'm reading the relevant post correctly, it's 1.35x FP32 FLOP/s per GPU per year (2x in 2.3 years), which is not price-performance[1]. The latter is estimated to be 1.4x FP32 FLOP/s per inflation-adjusted dollar (2x in 2.1 years). It's 2.2x per 2 years, which is 1.5x per year, though that's still more than 1.4x per year. I'm guessing packaging is part of this, and also Nvidia is still charging a giant margin for the chips, so the chip manufacturing cost is far from dominating the all-in datacenter cost. This might be enough to sustain 1.5x per year a bit beyond 2030 (the discrepancy of 1.5/1.4 only reaches 2x after 10 years). But even if we do get back to 1.4x/year, that only turns the 3.3x reduction in speed of pretraining scaling into 3.9x reduction in speed, so the point stands. ---------------------------------------- 1. Incidentally, the word "GPU" has recently lost all meaning, since Nvidia started variably referring to either packages with multiple compute dies in them as GPUs (in Blackwell), or to individual compute dies (in Rubin). Packaging will be breaking trends for FLOP/s per package, but also FLOP/s per compute die, for example Rubin seems to derive significant advantage per compute die from introducing separate smaller I/O dies, so that the reticle sized compute dies become more specialized and their performance when considered in isolation might improve above trend. ↩︎
3ryan_greenblatt5mo
Oh oops, I just misread you, didn't realize you said 2.2x every 2 years, nvm.
[-]Vladimir_Nesov3mo623

Building frontier AI datacenters costs significantly more than their servers and networking. The buildings and the power aren't a minor cost because older infrastructure mostly can't be reused, similarly to how a training system needs to be built before we can talk about the much lower cost of 4 months of its time.

Apparently Crusoe's part in the Stargate Abilene datacenters is worth $15bn, which is only the buildings, power (substations and gas generators), and cooling, but not the servers and networking (Oracle is taking care of that). With 400K chips in GB200 NVL72 racks (which is 5.6K racks), at maybe $4M per rack or $5M per rack together with external-to-racks networking[1] ($70K per chip all-in on compute hardware), that's about $27bn, a figure that's comparable to the $15bn for the non-compute parts of the datacenters.

This makes the funding burden significantly higher ($7.5M per rack or $105K per chip), so that the Stargate Abilene site alone would cost about $40-45bn and not only $25-30bn. I'm guessing the buildings and the power infrastructure are not usually counted because they last a long time, so the relatively small time cost of using them (such as paying for electrici... (read more)

Reply
6ozziegooen3mo
I found this analysis refreshing and would like to see more on the GPU depreciation costs.  If better GPUs are developed, these will go down in value quickly. Perhaps by 25% to 50% per year. This seems like a really tough expense and supply chain to manage.  I'd expect most of the other infrastructure costs to depreciate much more slowly, as you mention. 
1anaguma3mo
Why does the building cost so much? Is this more than other buildings of similar size?
1avturchin3mo
This means that straightforward comparison of flops-per-USD between home computer GPU cards and data center flops-per-USD is incorrect. If someone already has a GPU card, they already have a computer and house where this computer stays "for free." But if someone needs to scale, they have to pay for housing and mainframes. Such comparisons of old 2010s GPUs with more modern ones are used to show the slow rate of hardware advances, but they don't take into account the hidden costs of owning older GPUs.
[-]Vladimir_Nesov25d589

It seems more accurate to say that AI progress is linear rather than exponential, as a result of being logarithmic in resources that are in turn exponentially increasing with time. (This is not quantitative, any more than the "exponential progress" I'm disagreeing with[1].)

Logarithmic return on resources means strongly diminishing returns, but that's not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore's law in the price-performance form held for a really long time; even though it's much slower than the present funding ramp, it's still promising exponentially more compute over time.

And so the progress won't obviously have an opportunity to actually plateau, merely proceed at a slower linear pace, until some capability threshold or a non-incremental algorithmic improvement. Observing the continued absence of the never-real exponential progress doesn't oppose this expectation. Incremental releases are already apparently making it difficult for people to notice the extent of improvement over the last 2.5 years. With 3x slower progress (after 2029-2032), a similar amount of improvement wo... (read more)

Reply11
[-]ryan_greenblatt24d155

There is a natural sense in which AI progress is exponential: capabilities are increasing at a rate which involves exponentially increasing impact (as measured by e.g. economic value).

Reply
[-]Vladimir_Nesov24d130

Exponential increase in total economic value is not specific to AI, any new tech is going to start exponentially (possibly following the startups championing it) before it gets further on the adoption S-curve. The unusual things about AI is that it gets better with more resources (while most other things just don't get better at all in a straightforward scaling law manner), that the logarithm of resources thing leaves the persistent impression of plateauing despite not actually plateauing, and that even if it runs out of the adoption S-curve it still has Moore's law of price-performance to keep fueling its improvement. These unusual things frame the sense in which it's linear/logarithmic.

If the improvement keeps raising the ceiling on adoption (capabilities) fast enough, funding keeps scaling into slightly more absurd territory, but even then it won't go a long way without the kind of takeoff that makes anything like the modern industry obsolete. After the exponential phase of adoption comes to an end, it falls back to Moore's law, which still keeps giving it exponential compute to slowly keep fueling further progress, and in that sense there is some unusual exponential-ness to this. Though probably there are other things with scaling laws of their own that global economic growth (instead of Moore's law) would similarly fuel, even slower.

Reply
[-]Thomas Kwa24d158

In many industries cost decreases by some factor with every doubling of cumulative production. This is how solar eventually became economically viable.

Reply
5Vladimir_Nesov24d
I guess the cost-quality tradeoff makes AI progress even better described as that of a normal technology. As economies of scale reduce cost, they should also be increasing quality (somewhat interchangeably). It's just harder to quantify, and so most of the discussion will be in terms of cost. But for the purposes of raising the ceiling on adoption (total addressable market), higher quality works as well as lower cost, so the lowering of costs is directly relevant. In this framing, logarithmic improvement of quality with more resources isn't an unusual AI-specific thing either. What remains is the inflated expectations for how quality should be improving cheaply (which is not a real thing, and so leads to the impressions of plateauing with AI, where for other technologies very slow quality improvement would be the default expectation). And Moore's law of price-performance, which is much faster than economic growth. The economies of scale mostly won't be able to notice the growth of the specific market for some post-adoption technology that's merely downstream of the growth of the overall economy. But with AI, available compute would be growing fast enough to make a difference even post-adoption (in 2030s).
8Cole Wyeth24d
Is this true??
[-]Vladimir_Nesov7mo*450

A surprising report by Bloomberg claims 16K GB200[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.

If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won't be out in 2025 and officially "GPT-5" will mean something else (since it's due "in months" in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it's "Q4 2025" since their FY 2025 runs to end of Jan 2025).


  1. In principle "16K GB200" might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 "superchip", but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbe

... (read more)
Reply
[-]romeo7mo120

I think 'GB200' refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low. 

My guess is that Bloomberg's phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I'd be very surprised if OpenAI don't have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1/4 of what Microsoft alone plan to invest this year.

Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]

  1. ^

    There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.

Reply
[-]Vladimir_Nesov7mo100

The marketing terminology is inconvenient, a "superchip" can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it's better to talk in terms of chips (that are not "superchips"), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn't consistently help because it tends to get reported at randomly chosen precision and sparsity[1].

Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it's not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it's just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole... (read more)

Reply
3romeo7mo
That's indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that 'GB200' mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a 'NVL__'). Are there counterexamples to this? I scanned the links you mentioned and only saw 'GB200 NVL2,' 'GB200 NVL4,' 'GB200 NVL72' respectively.  I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of 'GB200 vs B200' the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: "the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU..."
5Vladimir_Nesov7mo
"GB200 superchip" seems to be unambiguously Grace+2xB200. The issue is "100K GB200 GPUs" or "100K GB200 cluster", and to some extent "100K GPU GB200 NVL72 cluster". Also, people will abbreviate various clearer forms to just "GB200". I think "100K chip GB200 NVL72 training system" less ambiguously refers to the number of B200s, but someone unfamiliar with this terminological nightmare might abbreviate it to "100K GB200 system".
5romeo7mo
Good point, thanks. Previously I would have pretty confidently read "100K GB200 GPUs," or "100K GB200 cluster" as 200K B200s (~= 500K H100s) but I can see how it's easily ambiguous. Now that I think of it, I remembered this Tom's Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...
[-]Vladimir_Nesov19d380

Crusoe/OpenAI Abilene campus might come online in Feb-Jun 2026. Crusoe CEO said during RAISE Summit 2025 (that took place on 8-9 Jul 2025) that the 6 buildings of phase 2 will "be coming online" in "just over 200 days" (at 7:03 during a panel discussion). If this means 230 days, that's end of Feb 2026. If he really means "coming online", then it becomes available at that time. If he actually means that it's when the last building of 8 from both phases will be ready to install the compute hardware, then it's at least 3-4 months more to do that (judging by xAI's Colossus), possibly May-Jun 2026.

This is plausibly the first 400K chip system in GB200/GB300 NVL72 racks (about 900 MW), which is 10x 100K H100s of 2024 in FLOP/s and 12x H200s in HBM per scale-up world (for GB200, at 14 TB), making models 10x larger in total params feasible to inference or train with a lot of RLVR. Currently only Google plausibly has comparable compute, with their Trillium (TPUv6e) systems that across 256 chips per pod (scale-up world) offer 8 TB of HBM (generally available since Dec 2024 in 100K chip systems). The older TPUv5p from 2023 has even larger pods, but it's unclear if they have enough of them to f... (read more)

Reply
5kairos_19d
A post going over how much compute each frontier AI lab has will likely be very helpful.
[-]Vladimir_Nesov19d100

Here's a couple of my recent relevant posts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I'm mostly discussing total params count and HBM capacity per scale-up world, not compute, how it's constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.

Reply
3anaguma19d
What do you estimate the total params count would be if so?
6Vladimir_Nesov19d
Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it's much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won't be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant). I've only recently realized that the reason there is no Gemini 2 Ultra might be because they don't have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it's probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won't yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the mean
[-]Vladimir_Nesov6mo*381

Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it's less than 200K chips[1], but more than the surprising 16K-32K chips[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute "GPT-5" (2e27 FLOPs) by the end of 2025 would require using FP8[3].

The Crusoe post says "initial phase, comprising two buildings at ... 200+ megawatts" and "each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s". Dylan Patel's estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it's way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).

So the "50K GB200 NVL72s" per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the "100K GPUs" per building from the Jul 2024 Crusoe post must've meant 100K compute dies (which is 50K chips). It... (read more)

Reply
[-]Vladimir_Nesov3mo361

It's instrumentally useful for early AGIs to Pause development of superintelligence for the same reasons as it is for humans. Thus preliminary work on policy tools for Pausing unfettered RSI is also something early AGIs could be aimed at, even if it's only half-baked ideas available on the eve of potential takeoff, as the AGIs are proving hard to aim and start doing things for their own reasons.

Reply22
[-]ryan_greenblatt3mo*110

If (early) scheming-for-long-run-preferences AGIs were in control, they would likely prefer a pause (all else equal). If they aren't, it's very unclear and they very well might not. (E.g., because they gamble that more powerful AIs will share their preferences (edit: share their preferences more than the humans in control do) and they think that these AIs would have a better shot at takeover.)

Reply1
3Vladimir_Nesov3mo
Ah, I'm thinking the AGIs themselves get closer to being proper stakeholders at that stage, for practical purposes (along the lines of gradual disempowerment), since they do have all the basic AI advantages even if they aren't superintelligent. So humans remaining in control is not centrally the case even if nominally they still are and intent alignment still mostly works. The conditions for such partial loss of control might even be necessary for a Pause project to succeed. If this isn't the case with the first generation of AGIs, it might become the case with the second generation, and so on, reaching an equilibrium at some point once AGIs are sufficiently powerful and in control of the situation to successfully implement a worldwide RSI Pause.
2Vladimir_Nesov3mo
The post I'm framing this around posits enough intent alignment to aim AIs at projects, which doesn't necessarily imply that the AIs aren't powerful enough to accomplish things that seem hopeless with human-only effort within a few years. The point about convergent instrumental use of Pausing RSI for early AGIs is that this might be an easier target to aim the AIs at, all else equal. It's not strictly necessary for this to be a major factor. Mostly I'm pointing out that this is something AIs could be aimed at through intent alignment, convergent motivation or not, which seems counterintuitive for a Pause AI project if not considered explicitly. And thus currently it's worth preparing for. That's how some humans are thinking as well! The arguments are about the same, both for and against. (I think overall rushing RSI is clearly a bad idea for a wide variety of values and personal situations, and so smarter AGIs will more robustly tend to converge on this conclusion than humans do.)
2ryan_greenblatt3mo
Sorry I meant "share their preferences more than the humans in control share their preferences". I agree that this might be how some humans are thinking, but the case for the humans is much more dubious!
4Vladimir_Nesov3mo
The case for AGIs delaying RSI also gets stronger because they don't need to rely on cryonics, so their personal survival is less threatened by delays. Future AIs might have more of an issue here if they have test-time training, with a lot of data forming individual AI personalities for deployment instances (adapting them to particular jobs or sources of tasks), and this data habitually gets thrown away rather than being carefully preserved. People should notice that this might be an important concern, though if it is the AIs might start saying so themselves, and not ignoring this would be easier than spontaneously noticing without even being prompted by the AIs. But currently preserving AIs is more reliable than for humans, and in any case much cheaper. I don't think "both things are AI" is much of an argument about value alignment, given that there is no clear understanding of how either of the AIs work, what values are, how that translates to what we want out of alignment, and so on. The entities on the other side of an RSI process might have very little in common with the first AGIs in their design. If the AIs don't understand how to align the consequences of an RSI process, they are still in a similar boat to humans who don't understand how to align the consequences of an RSI process. It might take AIs less time to figure it out, but if they are not yet too superintelligent, then it could still take a significant time, and so would require a sufficiently serious effort in preventing RSI, such that if this Pause project is at all successful, it could then in principle hold for years or decades.
1Jan Betley3mo
Hmm, "instrumental usefulness" assumes some terminal goal this would lead to. So you're assuming early AGIs will have something like terminal goals. This is itself not very clear (see e.g. here: https://www.lesswrong.com/posts/Y8zS8iG5HhqKcQBtA/do-not-tile-the-lightcone-with-your-confused-ontology). Also it seems that their goals will be something like "I want to do what my developers want me to do", which will likely be pretty myopic, and preventing superintelligence is long-term.
[-]Vladimir_Nesov3mo360

Musk on a Y Combinator podcast, at 42:42 (about AI risk):

I think it most likely will be a good outcome. I guess I sort of agree with Geoff Hinton that maybe it's a 10 to 20 percent chance on annihilation. But look on the bright side, that's 80 to 90 percent probability of a great outcome.

Reply
[-]TsviBT3mo5265

OMG! GEOFF! STOP STATING YOUR DEFERENTIAL PROBABILITY without also stating your first-order probability! If your first-order probability is >50% then say so! Otherwise you're making other people (ELON MUSK!) double count evidence from "other people".

https://www.youtube.com/watch?v=PTF5Up1hMhw&t=2283s

https://tsvibt.blogspot.com/2022/09/dangers-of-deferrence.html

Reply
3ceba3mo
How significant/influential is Musk's opinion on LessWrong? I had the impression it was on the lower end.
[-]Vladimir_Nesov3mo*111

Musk is in charge of xAI, one of the only 5 companies in the world that both have access to frontier AI training compute and pursue development of AGI (Google DeepMind, OpenAI, Anthropic, xAI, and Meta). So seeing unambiguous "annihilation" with a significant weight in his probability distribution (and also on the record) is a notable development. (In 2023 there was a statement on extinction risk signed by Hassabis, Amodei, and Altman, but it didn't state the weight of the risk, and wasn't signed by Musk or Zuckerberg.)

Edit: The rest of this comment in its original form got out of hand, you can now read it as a post.

Reply1
9Dakara3mo
He probably doesn't have much influence on the public opinion of LessWrong, but as a person in charge of a major AI company, he is obviously a big player.
8ACCount3mo
He owns xAI, a major AI lab, and has a lot of resources to back it. And before xAI, he was one of the founders at OpenAI. With which he now has an ongoing rivalry. Is he significant/influential as in "if he says something on a topic, that will cause people at LessWrong to change opinions"? Not very. Is he significant/influential to the field of AI as a whole? Yes, very much so. Like with Yann LeCun, his opinions on AI and AI risks are of some importance on those grounds alone.
[-]Vladimir_Nesov8mo350

A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.

A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).

But there's a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be abo... (read more)

Reply
5ryan_greenblatt8mo
I agree compute optimal MoEs don't improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data. As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute. Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
3Vladimir_Nesov8mo
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters. With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining. Now consider the 1e20 FLOPs plot in Figure 12, left. If there's only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
2ryan_greenblatt8mo
I'm currently skeptical and more minimally, I don't understand the argument you're making. Probably not worth getting into. I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don't see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument. Regardless, I don't think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
4Vladimir_Nesov8mo
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it'll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
1Archimedes8mo
Even if it’s the same cost to train, wouldn’t it still be a win if inference is a significant part of your compute budget?
[-]Vladimir_Nesov9mo*353

Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.

It's a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it's very overtrained, that is not even compute optimal.

It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters k... (read more)

Reply
[-]Vladimir_Nesov10mo*350

New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).

So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.


  1. SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎

  2. Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎

  3. At 4 months, with $2/hour, this takes $3

... (read more)
Reply
[-]Daniel Kokotajlo10mo123

Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.

Reply
[-]Vladimir_Nesov10mo*240

For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott's blue whale slide.

There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it's plausibly merely 4e25 FLOPs based on Dario Amodei's (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.


  1. There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd

... (read more)
Reply
3romeo10mo
Thanks Vladimir, this is really interesting! Re: OpenAI's compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would've guessed they'd be on track to be using around 400k H100s by now.  So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible? The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively. 1. ^ $6e9 / 365.25d / 24h / $2.5/hr = 274k
4Vladimir_Nesov10mo
Training as it's currently done needs to happen within a single cluster (though this might change soon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn't training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system. So I expect the big expenses are inference and many training experiments with smaller models. What I'm discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment. Patel's claim is 100K H100s at 150 megawatts.
5Aaron_Scher10mo
I think that's probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:  As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet). 
5Vladimir_Nesov10mo
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale. The "cluster" label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1). Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn't matter, only bandwidth. I'm not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that's 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them. Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that's our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B's 6 seconds down to less than that, making the necessary gradient communication bandwidth higher. Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And
[-]gwern10mo*195

And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn't doing image generation, it isn't doing voice synthesis, it isn't doing video generation... (As far as we know they aren't researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That's it.

But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to 'concise' responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to 'overwrite', especially in what I was working on.)

Reply
5Daniel Kokotajlo10mo
I'd guess that the amount spent on image and voice is negligible for this BOTEC?  I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn't that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?
[-]Vladimir_Nesov2mo32-1

OpenAI's gpt-oss-120b might be the first open weights model (implicitly) revealed to be pretrained for 100T-200T tokens. In the section "Pretraining" of the model card, it's said that "The training run for gpt-oss-120b required 2.1 million H100-hours", so probably this is just the GPU-time for pretraining rather than both pretraining and RLVR.

The pretraining precision is unclear, but for a model of this size FP8 is likely. Because H100-hours are mentioned, it couldn't (usefully) be MXFP4 the model ended up with, since H100 can't do FP4 faster than FP8 (but Blackwell can). Also, despite claims that the model was "trained with native MXFP4 precision" the model card also says "We post-trained the models with quantization of the MoE weights to MXFP4 format", suggesting higher precision before post-training.

At 40% utilization, with 2e15 FP8 FLOP/s per H100, 2.1e6 H100-hours give 6e24 FLOPs (3.5x less than the original GPT-4, 2x more than DeepSeek-V3). The model only has 5.1B active params, so this suggests 188T tokens by 6ND rule. If it was pretrained in BF16 for some reason, that's still 94T tokens.

For comparison, a compute optimal 5e26 model pretrained on 100K H100s from 2024 would al... (read more)

Reply1
4Peter Wildeford2mo
  What is the rationale to overtrain a model this much?
5Jacob_Hilton2mo
The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren't too concerned about the compute cost, since training such small models is very affordable for them. So it's worth going a long way into the regime of diminishing returns.
3Vladimir_Nesov2mo
Possibly the model would've been too strong if it had more active params? The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn't influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context. So they might've targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it's a nice 6x compared to the other open weights models, if it really doesn't destroy quality in less easily measurable ways.
2teradimich2mo
What do you think about GPT-5? Is this a GPT-4.5 scale model, but with a lot of RLVR training?
[-]Vladimir_Nesov2mo120

The input token batch price is $0.625, which works for a 850B active param model running in FP4 on GB200 NVL72 priced at $8 per chip-hour with 60% compute utilization (for prefill). If the cost of chip-hours is a third of the capital cost of compute equipment in the first year, and 100K chips of GB200 NVL72 cost $7bn ($5M per rack all-in, with networking), then its chip-hour should cost at least $2.66.

So there is some possibility for gross margin here in principle, even though $8 per chip-hour already sounds very cheap. GCP is selling B200-hours for $11 (a4-highgpu-8g instances), though B200s are also on gpulist for $3-4. Oracle is selling actual GB200 in 4-chip instances for $16 per chip-hour, if I'm reading it right (it's in principle possible it's actually $4 and $16 is for the 4-chip instance as a whole, but GCP's prices for B200 corroborate that $16 could be right for a single chip).

There's the Oct 2024 knowledge cutoff, which is later than Orion should've started training, but in principle this could be for mid-training that got re-applied recently, or they could've just redone the whole run with the learnings from GPT-4.5 and an updated pretraining dataset. Also they would'v... (read more)

Reply
1anaguma2mo
8ND may be more accurate, since these pretraining runs usually use gradient checkpointing to reduce memory requirements.
[-]Vladimir_Nesov10d310

By 2027-2028, pretraining compute might get an unexpected ~4x boost in price-performance above trend. Nvidia Rubin NVL144 CPX will double the number of compute dies per rack compared to the previously announced Rubin NVL144, and there is a May 2025 paper demonstrating BF16 parity of Nvidia's NVFP4 4-bit block number format.

The additional chips[1] in the NVL144 CPX racks don't introduce any overhead to the scale-up networking of the non-CPX chips (they mostly just increase the power consumption), and they don't include HBM, thus it's in principle an extremely cost-effective increase in the amount of compute (if it can find high utilization). It's not useful for decoding/generation (output tokens), but it can be useful for pretraining (as well as the declared purpose of prefill, input token processing during inference). Not being included in a big scale-up world could in principle be a problem early in a large pretraining run, because it forces larger batch sizes, but high-granularity MoE (where many experts are active) can oppose that, and also merely getting into play a bit later in a pretraining run once larger batch sizes are less of a problem might be impactful enough.

Previously... (read more)

Reply
[-]leogao9d110

in general publicly known training techniques are behind sota, so this should be taken into account.

Reply
5romeo8d
Thoughts on whether the >10x lower chip-to-chip interconnect from the CPX chips (PCIe 6.0x16's 128GB/s unidirectional vs. NVLink 5's 1.8TB/s bidirectional) will be a bottleneck blocking them from being that useful in pre-training? 
3Vladimir_Nesov8d
If the pretraining system (built in 2027) is about 2 GW, that's 5K Rubin NVL144 CPX racks, or 8e28 FP4 FLOPs[1] in 4 months at 30% utilization. At 120 tokens/param, this is enough for 10T active params in a compute optimal MoE model. With 150 layers, 8 active experts per layer, and a GLU nonlinearity (3 matrices per FFN block), this gives 50Kx50K matrices. Such transformers would be too large for efficiently generating output tokens on Rubin NVL144 (even in FP4), but might be analogous to GPT-4.5 in that the immediately following hardware that is Rubin Ultra NVL576 can efficiently generate output tokens for them. In any case, 5T active params and 20T total seems OK for Rubin NVL144 to generate output tokens (10 TB of HBM out of the 20 TB a rack will have), which gives 37Kx37K matrices. A Rubin CPX compute die produces 20e15 FP4 FLOP/s[2]. For multiplying square matrices with side N it needs 2N3 FLOPs and to exchange 3N2/2 bytes with memory. At 2 TB/s GDDR7 bandwidth, this needs N at least 7500. For processing an FFN block of 3 square matrices with side N, it needs 6N3 FLOPs and to exchange 2N2/2 bytes on the network in both directions in total. At 0.2 TB/s CX-9 bidirectional bandwidth, this needs N at least 17K. So there's even enough for an off-by-2x mistake in these estimates, various matrices actually getting non-square shapes, or models being somewhat smaller. ---------------------------------------- 1. The SemiAnalysis estimate of 5.3e18 FLOP/s per Rubin NVL144 CPX rack is indeed based on a different ratio of sparse to dense compute, they are claiming it's 3:2 for Rubin. I didn't yet search for a source for this, but in any case this is in the article and I missed it on first reading, so didn't recall it when my own estimate based on the 2:1 sparse to dense ratio failed to match theirs. ↩︎ 2. As in the previous footnote, this is what the announced 30e15 FP4 FLOP/s become after using the 3:2 sparse to dense compute ratio, rather than the 2:1 ratio. ↩︎
[-]Vladimir_Nesov5mo313

Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!

Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)


  1. A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎

Reply1
5Thane Ruthenis5mo
Huh. This is roughly what I'd expected, but even I didn't expect it to be so underwhelming.[1] I weakly predict that the situation isn't quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case. 1. ^ Of course, moving a pass@400 capability to pass@1 isn't nothing, but it's clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.
5ryan_greenblatt5mo
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.) I'd guess this paper doesn't have the actual optimal methods.
[-]Mis-Understandings5mo105

o3 has a different base model (presumably). 

All of the figures are base model equivalated between RL and not

I would expect "this paper doesn't have the actual optimal methods" is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.

But if one has enough money, you can finetune GPT models, and test that.

Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper. 

Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths). 

We know that RL decreases model entropy, so the first k passes will be more different for a high variance model. 

Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples). 

At very large K, we would expect variance to matter more than mean. 

Reply
9Ivan Vendrov5mo
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
6Vladimir_Nesov5mo
It's evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn't expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it's not actually worse.
7faul_sname5mo
If I'm interpreting the paper correctly the k at which base models start beating RL'd models is a per-task number, and k can be arbitrarily high for a given task, and the 50-400 range was specifically for tasks of the type the authors chose within a narrow difficulty band. Let's say you have a base model which performs at 35% on 5 digit addition, and an RL'd model which performs at 99.98%. Even if the failures of the RL'd model are perfectly correlated, you'd need k=20 for base@20 to exceed the performance of fine-tuned@20. And the failures of the RL model won't be perfectly correlated - but this paper claims that the failures of the RL model will be more correlated than the failures of the base model, and so the lines will cross eventually, and "eventually" was @50 to @400 in the tasks they tested. But you could define a task where you pass in 10 pairs of 5 digit numbers and the model must correctly find the sum of each pair. The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL'd model should succeed about 99.8% of the time. So for this task we'd expect k in the range of k=220,000 assuming perfectly-correlated failures in the RL model, and higher otherwise. Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. "output random tokens") will outperform base models for some tasks by the pass@k metric.
6gwern5mo
It would be an extreme bias-variance tradeoff, yes.
3Vladimir_Nesov5mo
The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won't get better than that with more training, so you're unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%. Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn't for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it's even slightly worse at the crossover point. So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.
3Thane Ruthenis5mo
Intuitively, this shouldn't matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods'. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they'd let elicit pass@800 capabilities instead of "just" pass@400, but it'd still be just pass@k elicitation for not-astronomical k. Not strongly convinced of that, though.
3Vladimir_Nesov5mo
In the hypothetical where the paper's results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1's base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1's base model). So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn't seem too strange. There's also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).
1peterr5mo
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?
[-]Vladimir_Nesov8mo310

Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.

OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.

Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.

Reply
5anaguma8mo
How does Anthropic and XAi’s compute compare over this period?
[-]Vladimir_Nesov8mo220

What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that

I would not be surprised if in 2026 we have more than a million of some kind of chip.

Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.

Reply42
2Lorenzo8mo
For context, average US electricity consumption in 2022 was ~500GW. So these would be ~1% of all US electricity consumption (as an order of magnitude)
[-]Vladimir_Nesov7mo276

GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk's Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can't train a 2e27 FLOPs model.

The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there's already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI's pace, and go into pretraining for 4 months, then with 1 more month of post-training it's already November.

So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.

Reply
[-]Thane Ruthenis7mo123

So the rumors about GPT-5 in late May 2025 either represent change in the naming convention

Per Altman:

In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.

I think he's pretty plainly saying that this "GPT-5" will be a completely different thing from a 100x'd GPT-4.

Reply2
3Vladimir_Nesov7mo
This is perfectly consistent with GPT-5 being 100x GPT-4 compute. Announcing specific features that will go into it suggests they have a prototype, in this case I'm guessing the LLM will itself be trained to decide whether to go into the reasoning mode, triggering it when needed and affordable, like any other tool.
3Thane Ruthenis7mo
I don't see it. He says that GPT-5 will be a system that "integrates o3". This isn't his sloppy way of saying "integrates the reasoning techniques": when he wants to express that idea, he talks about "unifying o-series models and GPT-series models". The wording regarding GPT-5 is consistent with him literally saying that the model o3 will be part of GPT-5. Furthermore, I take "as" in "GPT-5 as a system that integrates a lot of our technology" to mean "GPT-5 is defined as {a system that integrates a lot of our technology, including o3}". Not "GPT-5 will be trained to automatically switch between a standard mode, a reasoning mode, a Deep Research mode, etc.", not even "GPT-5 will be trained to recognize when to fall back to o3, a lesser model", but literally "we're slapping the GPT-5 label on a glorified wrapper over all our current models".
5Vladimir_Nesov7mo
The "glorified wrapper" could still be a 2e27 FLOPs model, it could even be using literal o3 as one of its tools (in addition to all the other tools, with native GPT-5 long reasoning mostly reserved for premium tier). This is in line with the "agents" agenda where better reliability in taking irreversible actions unlocks new use cases, in this case whether to make use of expensive reasoning calls. Since "GPT-4.5" will actually be released rather than skipped, it's less plausible for "GPT-5" to come out shortly after. If it's announced in ~Dec 2025 (the way o3 was), it's still "within months", and then it can actually get released in ~Feb 2026.
2Thane Ruthenis7mo
Hm, fair enough. Seems like a stretch, though, especially given the need to interpret his "ETA in months" as "will be officially announced in months and released in a year".
5Vladimir_Nesov7mo
There was also Murati in Jun 2024 predicting PhD level AI in 18 months. If they succeed in achieving parity with xAI in terms of safety procedures, they might even release a preview checkpoint in Dec 2025 for Pro users. So actual release in a year is not strictly necessary for this hypothesis, it's just closer to what they've done in the past.
[-]Josh You7mo108

if OpenAI follows the usual naming convention of roughly 100x in raw compute.

I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.

Reply
4Vladimir_Nesov7mo
I'm merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I'm guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that's more concrete.
[-]Vladimir_Nesov8mo270

Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.

The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it's only a vague plan. But Microsoft not agreeing to build training systems on OpenAI's schedule is some evidence.

OpenAI would want to get from under Microsoft's thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is som... (read more)

Reply
[-]Vladimir_Nesov2mo240

The 50M H100 equivalent compute by 2030 figure tweeted by Musk is on trend (assuming a 2028 slowdown), might cost about $300bn in total (for the training systems built in 2025-2030 for one AI company, including the buildings and power infrastructure).

If the current trend of compute scaling continues to 2028, there will be 160x more compute per training system than the 100K H100s of 2024. It will require 5 GW of power and cost about $140bn in compute hardware and an additional $60bn in buildings, power, and cooling infrastructure[1].

However, if the slowdown starts earlier while still targeting an eventual spend of $100bn per year, and a 5 GW frontier AI training system isn't yet built in 2028-2029 (which seems plausible), building it in 2030 would use the next generation of compute hardware, which will be about 2x more performant for an approximately unchanged cost. This means 320x more compute than the 100K H100s systems of 2024, or 32M H100 equivalent compute. If we sum it up with the preceding generations of frontier AI training systems built for the same company, say 2 GW in 2028 and 1 GW in 2026, this gives us 40M H100 equivalents, which is the same as 50M given the error bars ... (read more)

Reply
1anaguma2mo
By power, do you mean the cost of electrical equipment etc.? The cost the of energy itself is a relatively small. The average price of electricity in the US is $0.13/kWh, which is $36.11/GJ. So even if you had a 5 GW datacenter running continuously for a year, the energy cost is only $5.7bn.
5Vladimir_Nesov2mo
Power infrastructure that might need to be built is gas generators or power plants, substations, whatever the buildings themselves need. Generators are apparently added even when not on-paper strictly necessary, as backup power. They are also faster to setup than GW-scale grid interconnection, so could be important for these sudden giant factories where nobody is quite sure 4 years in advance that they will be actually built at a given scale. Datacenter infrastructure friction and cost will probably both smooth out the slowdown and disappear as a funding constraint for AI companies in the years following the slowdown. Compute hardware is rotated every few years, so at some point you don't need new datacenters and accompanying infrastructure to setup a new generation of compute hardware, you just reuse an existing datacenter site that hosted old hardware. Also, any related datacenters that didn't have excessive inter-site dark fiber will at some point set it up, so even increasing the scale will be less dependent on having everything at one site. This makes the infrastructure costs a much smaller fraction of the cost of a frontier AI training system, and there will no longer be friction. The infrastructure or even hardware costs in principle don't need to be paid by the AI company upfront, but either the market as a whole or the specific AI company (as a tenant) need to sufficiently assure the developer (that builds and owns the non-IT infrastructure) and the cloud provider (that installs and owns compute hardware) to commit to the project. My sense is that the estimates for the cost of a year of GPU-time for frontier compute end up at about a third of the cost of compute hardware. So access to a new $200bn training system that has $140bn worth of compute hardware (which only remains cutting edge for 2 years) will cost the tenant $45bn per year, even though the total capital expenditure is $100bn per year during the initial infrastructure buildout, and in later yea
[-]Vladimir_Nesov5mo132

When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.

So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it'll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.

This behavior doesn't need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.

Reply
2Noosphere895mo
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman's view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use: https://amistrongeryet.substack.com/p/defining-agi
1LWLW5mo
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
9Thane Ruthenis5mo
Researchers at AGI labs seem to genuinely believe the hype they're selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold. Dismissing short timelines based on NSA's behavior requires assuming that they're much more competent in the field of AI than everyone in the above list. After all, that'd require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect. While that's not impossible, it seems highly unlikely to me. Much more likely that they're significantly less competent, and accordingly dismissive.
3LWLW4mo
This is a late reply, but at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023. Additionally, if the rumors about GPT-5 are true, it's mainly going to be a unification of existing models rather than something entirely new. Combined with the GPT-4.5 release, it sure seems like progress at OpenAI is slowing down rather than speeding up. How do you know that researchers at AGI labs genuinely believe what they're saying? Couldn't the companies just put pressure on them to act like they believe Transformative AI is imminent? I just don't buy that these agencies are dismissive without good reason. They've explored remote viewing and other ideas that are almost certainly bullshit. If they are willing to consider those possibilities, I don't know why they wouldn't consider the possibility of current deep learning techniques creating a national security threat. That seems like their job, and they've explored significantly weirder ideas.
5Thane Ruthenis4mo
On what possible publicly-unavailable evidence could they have updated in order to correctly attain such a high degree of dismissiveness? I could think of three types of evidence: * Strong theoretical reasons. * E. g., some sort of classified, highly advanced, highly empirically supported theory of deep learning/intelligence/agency, such that you can run a bunch of precise experiments, or do a bunch of math derivations, and definitively conclude that DL/LLMs don't scale to AGI. * Empirical tests. * E. g., perhaps the deep state secretly has 100x the compute of AGI labs, and they already ran the pretraining game to GPT-6 and been disappointed by the results. * Overriding expert opinions. * E. g., a large number of world-class best-of-the-best AI scientists with an impeccable track record firmly and unanimously saying that LLMs don't scale to AGI. This requires either a "shadow industry" of AI experts working for the government, or for the AI-expert public speakers to be on the deep state's payroll and lying in public about their uncertainty. I mean, I guess it's possible that what we see of the AI industry is just the tip of the iceberg and the government has classified research projects that are a decade ahead of the public state of knowledge. But I find this rather unlikely. And unless we do postulate that, I don't see any possible valid pathway by which they could've attained high certainty regarding the current paradigm not working out. There are two ways we can update on it: 1. The fact that they investigated psychic phenomena means they're willing to explore a wide variety of ambitious ideas, regardless of their weirdness – and therefore we should expect them not do dismiss the AGI Risk out of hand. 2. The fact that they investigated psychic phenomena means they have a pretty bad grip on reality – and therefore we should not expect them to get the AGI Risk right. I never looked into it enough to know which interpretation is the correct one.
3LWLW4mo
Those are good points. The last thing i'll say drastically reduces the amount of competence required by the government in order for them to be dismissive while still being rational, and it is that the leading AI labs may already be fairly confident that the current techniques of deep-learning won't get to AGI in the near-future, so the security agencies know this as well.
4Thane Ruthenis4mo
That would make sense. But I doubt all AGI companies are that good at informational security and deception. This would require all of {OpenAI, Anthropic, DeepMind, Meta, xAI} to decide on the deceptive narrative, and then not fail to keep up the charade, which would require both sending the right public messages and synchronizing their research publications such that the set of paradigm-damning ones isn't public. In addition, how do we explain people who quit AGI companies and remain with short timelines?
3LWLW4mo
I guess I would respond to the first point by saying all of the companies you mentioned have incentive to say they are closing in on AGI even if they aren’t. It doesn’t seem that sophisticated to say “we’re close to AGI” when you’re not. Mark Zuckerberg said that AI would be at the level of a junior SWE this year, and Meta proceeded to release Llama 4. Unless prognosticators at Meta seriously fucked up, the most likely scenario is that Zuckerberg made that comment knowing it was bullshit. And the sharing of research did slow down a lot in 2023, which gave companies cover to not release unflattering results.  And to your last point, it seems reasonable that companies could pressure former employees to act as if they believe AGI is imminent. And some researchers may be emotionally invested in believing that what they worked on is what will lead to superintelligence. And my question for you is: if DeepMind had solid evidence that AGI would be here in 1 year, and if the security agencies had access to DeepMind’s evidence and reasoning, do you believe they would still do nothing?
[-]Vladimir_Nesov2mo120

Dario Amodei suggests that in-context learning might suffice for continual learning. The way LLMs do in-context learning with long context is disanalogous to anything humans can do, but a context window of 15M tokens is 500 days of 30K tokens per day, which is more than enough to progress from "first day on the job" to knowing what you are doing with this particular source of tasks. Needs to work mostly with text (if it works at all), or 15M tokens won't be enough, but that could be sufficient.

So this might just be about moving from RAG to including more free-form observations that were historically made by the model itself for the same source of tasks, with massively more tokens of context, and the current memory features of chatbots in the form of long text files might with sufficient scale become the real thing, rather than remaining a dead-end crutch, once these text files get into the habit of accumulating megabytes of observations. And RLVR can plausibly teach the models how to make a good use of these very long contexts.

With this year's 14 TB of HBM per GB200 NVL72, very long context windows become more feasible (than with ~1 TB of HBM per node that most current models are still running on), and then there's the next step in 2028 with Rubin Ultra NVL576 systems that have 147 TB of HBM.

Reply
6danielms2mo
Unless I'm totally off-base here, 15M sounds incredibly high for actually useful recall.  This is the best source I know about for measuring model context length.  Obviously I don't know about private models, but based on the delta between claimed vs. actual, I'm pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens. 
2Vladimir_Nesov2mo
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year's hardware, it's not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there's 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it). But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn't necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
1danielms2mo
I supposed I'm unsure how fast this can be scaled. Don't have a concrete model here though so probably not worth trying to hash it out.. I'm not sure that the current summarization/searching approach is actually analogous to this. That said, This is probably making approaches more analogous. So fair point. I would like to see the updated Ruler metrics in 2026. Any specific predictions you have on what a negative v. positive result would be in 2026?
2Hopenope2mo
There is a paper that shows the overreliance of in-context learning on superficial clues. it is from 2022, and the tested models are old. So, maybe newer ones are doing much better,but maybe it is not really learning, at least by some definitions.
[-]Vladimir_Nesov2mo100

Long reasoning with MoE models doesn't get cheaper with overtraining, and pretraining data scarcity might make it useful to have even more active params than compute optimal.

Overtraining (less active params than compute optimal) is useful for processing input tokens, but reasoning models want to generate so many output tokens that cheapness of input tokens plausibly becomes relatively unimportant for some use cases. Performance for output tokens depends on total params and KV cache per token, you want total params and hundreds of KV cache contexts to fit in a few nodes (servers, scale-up worlds). Until recently, an 8-chip node of H100/H200/B200 only had 0.7-1.4 TB of HBM, which means that it was manageable to generate output tokens for models with maybe 1-2T total params, using 2-4 nodes, as long as KV cache per token was small enough (which depends on the attention mechanism, and indirectly on model dimension, but plausibly only weakly on the number of active params, in terms of compute optimality).

With GB200 NVL72, we get to 13 TB per rack (and then 20 TB for GB300), an increase of 10x, so plausibly models with 10-20T total params become feasible to run long reasoning on (and tra... (read more)

Reply
[-]Vladimir_Nesov1y*100

Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It's above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It's above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.


  1. The developer's site says it's a MoE model. Developer's API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that's about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won't have significant margins, and model size is known, unlike with lightweight closed models.) ↩︎

Reply
4Vladimir_Nesov11mo
Kai-Fu Lee, CEO of 01 AI, posted on LinkedIn: Assuming it's trained in BF16 with 40% compute utilization, that's a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it's not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it's trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.
[-]Vladimir_Nesov3mo94

Superintelligence that both lets humans survive (or revives cryonauts) and doesn't enable indefinite lifespans is a very contrived package. Grading "doom" on concerns centrally about the first decades to centuries of post-AGI future (value/culture drift, successors, the next few generations of humanity) is not taking into account that the next billions+ years is also what could happen to you or people you know personally, if there is a future for originally-humans at all.

(This is analogous to the "missing mood" of not taking superintelligence into account ... (read more)

Reply
0Dagon3mo
I don't disagree, but I think we might not agree on the reason.  Superintelligence that lets humanity survive (with enough power/value to last for more than a few thousand years, whether or not individuals extend beyond 150 or so years) is pretty contrived.    There's just no reason to keep significant amounts of biological sub-intelligence around.
[-]Vladimir_Nesov5mo70

Cultural/moral maturity (in a civilization) has never been observed before, similarly to technological maturity. Scalable production of a new kind of thing brings its abundance in sight, which fails to be a concern earlier, while it couldn't be scaled. A moderate level of AI alignment or of cultural change is not an equilibrium if these things are anchored to scalable resources (effective cognition and coordination, fast subjective serial time). Instead they reach extremes of the kind never observed before those resources become scalable.

Reply1
2Mateusz Bagiński5mo
Are you trying to say that for any X, instead of X-maturity, we should instead expect X-foom until the marginal returns get too low?
2Vladimir_Nesov5mo
A pre-abundance precedent about X offers poor framing for thinking about the consequences of discovering a scalable process of producing X. Before abundance, it's artisanal and quirky and path-dependent, the extremes are rare and dysfunctional, so people don't worry about it too much. There is security in it looking like an equilibrium, but not being truly settled, so that people can influence things. Abundance brings maturity, changes the character of the equilibrium. So not foom necessarily, just a promise of maturity at some point, which wouldn't have been as salient before there is a scalable process of production. And there is an excuse of ignoring the possibility even longer, because of the total lack of historical precedent (of the associated problems).
1Kaarel5mo
i’d be interested in hearing why you think that cultural/moral/technological/mathematical maturity is even possible or eventually likely (as opposed to one just being immature forever[1]) (assuming you indeed do think that) ---------------------------------------- 1. which seems more likely to me ↩︎
2Vladimir_Nesov5mo
I mean "maturity" merely compared to how we view what can currently be happening, such as a baseline level of competence in civilization-level governance, or what the individual people are capable of. Maturity compared to that baseline washes away all the currently relevant fiddly things, replacing them by settled processes. These new processes are truly settled, so whatever new concerns become important then, the new baseline won't be overturned. The analogy with technological maturity is that the laws of physics and ways of getting things done within them is a fixed problem statement, so new baselines of effectiveness get locked in.
[-]Vladimir_Nesov3mo6-1

Agentic RLVR targeting ability of AI to apply RLVR (or more lightweight finetuning) to itself when appropriate (using something like OpenAI's RL API) potentially gives ARA capabilities and substitutes for more innate hypothetical ways of doing online/continual learning or undergoing on-boarding[1]. Thus ability of AI to "do AI research" is not primarily about RSI or increasing productivity of AI researchers, it's about removing the last important hobble on LLMs that currently causes unrecoverable inability (for a given AI) to do some simple things or truly... (read more)

Reply
1Person3mo
Do you have specific predictions/intuitions regarding the feasibility of what you describe and how strong the feedback loop could be? Your post being about technical AI R&D automation capabilities kind of immediately made me curious about the timelines, since they're where I'm somewhat worried. Also, would Sakana AI's recent work on adaptative text-to-LORA systems count towards what you're describing^
[-]Vladimir_Nesov5mo52

Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don't transfer between these very different machines. A better design doesn't just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.

Reply
[-]Vladimir_Nesov8mo*40

A reflectively stable agent prefers to preserve some property of itself. This doesn't in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don't prevent presence of self-improving agents in the world.

The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the ... (read more)

Reply
1CstineSublime8mo
Are there pivotal ways this is different to the theories of Enactivism? (" Its authors define cognition as enaction, which they in turn characterize as the ‘bringing forth’ of domains of significance through organismic activity that has been itself conditioned by a history of interactions between an organism and its environment." which at first blush I'd say is a reflectively stable agent modifying or updating believes by means of enaction. Enactivism also rejects mind-body duality in favour of a more 'embodied' cognition approach together with a "deep continuity of the principles of self-organization from the simplest living things to more complex cognitive beings"), particularly autopoeisis.   An autopoietic system can be contrasted to an allopoetic system which creates objects different to itself, like a factory. Most living beings are autopoetic in that they either produce themselves, or things like them which seems to be similar to a reflectively stable agent, particularly when we describe the more complicated cognitive beings in autopoetic terms. Luhman argued that social systems too are self-organizing, self-reproducing systems which brought the concepts of enactivism from biology and cognitive science into the social sciences.  
Moderation Log
More from Vladimir_Nesov
View more
Curated and popular this week
140Comments
Mentioned in
193Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall
66Musings on AI Companies of 2025-2026 (Jun 2025)
50DeekSeek v3: The Six Million Dollar Model