Vladimir_Nesov

Wiki Contributions

Comments

Sorted by

This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.

The "cluster" label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).

Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn't matter, only bandwidth. I'm not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that's 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.

Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once, and with 8K tokens per sequence that's our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B's 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.

Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there'll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s, you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of an 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.

With 72 B200s per NVLink domain in a 500K B200s system, that's 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it's higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that's 20 times higher, about 300 Tbps.

That's still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it's no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.


  1. 16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total. ↩︎

  2. This gives another way of getting the estimate of 6 seconds per step, which doesn't depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence. ↩︎

  3. So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain. ↩︎

Long reflection is a concrete baseline for indirect normativity. It's straightforwardly meaningful, even if it's unlikely to be possible or a good idea to run in base reality. From there, you iterate to do better.

Path dependence of long reflection could be addressed by considering many possible long reflection traces jointly, aggregating their own judgement about each other to define which traces are more legitimate (as a fixpoint of some voting/preference setup), or how to influence the course of such traces to make them more legitimate. For example, a misaligned AI takeover within a long reflection trace makes it illegitimate, and preventing such is an intervention that improves a trace.

"Locking in" preferences seems like something that should be avoided as much as possible, but creating new people or influencing existing ones is probably morally irreversible, and that applies to what happens inside long reflection as well. I'm not sure that "nonperson" modeling of long reflection is possible, that sufficiently good prediction of long traces of thinking doesn't require modeling people well enough to qualify as morally relevant to a similar extent as concrete people performing that thinking in base reality. But here too considering many possible traces somewhat helps, making all possibilities real (morally valent) according to how much attention is paid to their details, which should follow their collectively self-defined legitimacy. In this frame, the more legitimate possible traces of long reflection become the utopia itself, rather than a nonperson computation planning it. Nonperson predictions of reflection's judgement might steer it a bit in advance of legitimacy or influence decisions, but possibly not much, lest they attain moral valence and start coloring the utopia through their content and not only consequences.

Training as it's currently done needs to happen within a single cluster (though this might change soon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn't training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system.

So I expect the big expenses are inference and many training experiments with smaller models. What I'm discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment.

the 150k H100s campus in Phoenix

Patel's claim is 100K H100s at 150 megawatts.

For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott's blue whale slide.

There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it's plausibly merely 4e25 FLOPs based on Dario Amodei's (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.


  1. There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd that in Sep 2024 only the first 3 had walls, so the 4th is probably not yet done. ↩︎

New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).

So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.


  1. SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎

  2. Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎

  3. At 4 months, with $2/hour, this takes $300 million, which is at odds with $100 million Dario Amodei gestured at in Jun 2024, but that only applies to Claude 3.5 Sonnet, not Opus. So Opus 3.5 (if it does come out) might be a 2e26 FLOPs model, while Sonnet 3.5 a 7e25-1e26 FLOPs model. On the other hand, $2 per H100-hour is not AWS prices, at those prices Sonnet 3.5 might be capped at 4e25 FLOPs, same as Llama-3-405B. ↩︎

In Transparent Newcomb's Problem, your decision determines whether you existed when you were making the decision. It's not valid to conclude that you exist merely from subjective observation of your own existence, because such observation can take place within counterfactuals, and you wouldn't be able to tell that you are within a counterfactual or actuality other than by reasoning about (or determining) your situation's actuality status. Like math, existence can't be perceived by looking at rocks.

In a single universe interpretation, we can posit biogenesis is rare, but we do know it happened at least once in ~two trillion galaxies worth of stars in ~13 billion years.

So this already doesn't follow, your conclusion is true, but doesn't require MWI. Even without MWI, observing our own existence doesn't tell us anything about the probability of biogenesis (and subsequent development of generally intelligent life).

The current scaling speed is created by increasing funding for training projects, which isn't sustainable without continued success. Without this, the speed goes down to the much slower FLOP/dollar trend of improving cost efficiency of compute, making better AI accelerators. The 2 + 4 + 8 years thing might describe gradual increase in funding, but there are still 2 OOMs of training compute beyond original GPT-4 that are already baked-in in the scale of the datacenters that are being built and didn't yet produce deployed models. We'll only observe this in full by late 2026, so the current capabilities don't yet match the capabilities before a possible scaling slowdown.

This is DiLoCo (Nov 2023 paper), a local SGD setup where the outer optimizer updates much more rarely (every 100-500 steps of the inner optimizers), asking for much less bandwidth (it uses Nesterov momentum in its state). The inner optimizers run within individual clusters, and the outer optimizer aggregates updates from individual clusters, using a much slower network that connects the clusters. The experiments were done with models of up to 400M parameters. (See also this paper on asynchronous variants of DiLoCo.)

The original paper lacks good compute efficiency measurements. The distributed training experiments start from a checkpoint trained for 24K steps, continuing for 64K more steps (to a total of 88K) in various distributed configurations. Even for the non-distributed configuration the perplexity keeps increasing to step 29K (Figure 7b, Figure 9). The compute expended in a non-distributed run between steps 24K and 88K gets expended in an 8-cluster run between steps 24K and 32K, when perplexity barely starts going down from the global maximum. So there is no way of comparing how well an 8-cluster run uses its compute, because the non-distributed experiment stops so early (at 88K steps) that the uninformative poorly optimized early state of the model still dominates the distributed configuration that uses the same amount of compute (at 32K steps).

Prime Intellect first reproduced DiLoCo in Jul 2024 (blog post, paper) on models of up to 1.1B parameters, taking training efficiency measurements. The largest experiment with a 1.1B model runs across 4 nodes that communicate only every 125 steps, and matches perplexity of a similar training run within a single cluster (with communication every step) using 20% more compute (Figure 7, comparing with 4x batch size baseline).

The new 10B model lacks baselines for comparison, so doesn't help with understanding how training efficiency depends on scale, but the results on benchmarks seem similar to those of other models with similar size and number of training tokens (Table 4 in the blog post).

Llama-3-405B is an important anchor for compute of other models. With 4e25 FLOPs and conservative training techniques it's about as capable, so the other models probably don't use much more. If they have better techniques, they need less compute to get similar performance, not more. And they probably didn't train for more than 6 months. At $2 per H100-hour[1], $3 billion buys 6 months of time on 300K H100s. There are no publicly known training systems this large, the first 100K H100s systems started appearing in the later part of this year. Thus the training cost figures must include smaller experiments that in aggregate eat more compute than the largest training runs, through the now-ubiquitous smaller clusters also used for inference.

So anchoring to total number of GPUs is misleading about frontier model training because most GPUs are used for inference and smaller experiments, and the above estimate shows that figures like $3 billion for training are also poor anchors. If instead we look at 20K H100s as the typical scale of largest clusters in mid 2023 to early 2024, and 4 months as a typical duration of frontier model training, we get $120 million at $2 per H100-hour or 8e25 dense BF16 FLOPs at 40% compute utilization, only about 2x Llama-3-405B compute. This agrees with how Dario Amodei claimed that in Jun 2024 the scale of deployed models is about $100 million.


  1. For what it's worth, since training the largest models requires building the training system yourself, which makes the market price of renting fewer GPUs from much smaller clusters not that relevant. ↩︎

A first candidate abstraction is the neuron doctrine: the idea that the mind is governed purely by patterns of neuron spiking. This abstraction seems simple enough (it could be captured with an artificial neural network of the same size: requiring around parameters).

Human brain only holds about 200-300 trillion synapses, getting to 1e20 asks for almost a million parameters per synapse.

Load More