# 79

Contemporary GPUs often have very imbalanced memory vs arithmetic operation capabilities. For instance, an H100 can do around 3e15 8-bit FLOP/s, but the speed at which information can move between the cores and the GPU memory is only 3 TB/s. As 8 bits = 1 byte, there is a mismatch of three orders of magnitude between the arithmetic operation capabilities of the GPU and its memory bandwidth.

This imbalance ends up substantially lowering the utilization rate of ML hardware when batch sizes are small. For instance, suppose we have a model parametrized by 1.6 trillion 8-bit floating point numbers. To just fit the parameters of the model onto the GPUs, we'll need at least 20 H100s, as each H100 has a VRAM of 80 GB. Suppose we split our model into 20 layers and use 20-way tensor parallelism: this means that we slice the parameters of the model "vertically", such that the first GPU holds the first 5% of the parameters in every layer, the second GPU holds the second 5%, et cetera.

This sounds good, but now think of what happens when we try to run this model. In this case, roughly speaking, each parameter comes with one addition and one multiplication operation, so we do around 3.2 trillion arithmetic operations in one forward pass. As each H100 does 3e15 8-bit FLOP/s and we have 20 of them running tensor parallel, we can do this in a mere ~ 0.05 milliseconds. However, each parameter also has to be read into memory, and here our total memory bandwidth is only 60 TB/s, meaning for a model of size 1.6 TB we must spend (1.6 TB)/(60 TB/s) ~= 27 ms just because of the memory bottlenecks! This bottlenecks inference and we end up with an abysmal utilization rate of approximately (0.05 ms)/(27 ms) ~= 0.2%. This becomes even worse when we also take in inter-GPU communication costs into account, which would be at around 1 TB/s if the GPUs are using NVLink.

Well, this is not very good. Most of our arithmetic operation capability is being wasted because the ALUs spend most of their time idling and waiting for the parameters to be moved to the GPU cores. Can we somehow improve this?

A crucial observation is that if getting the parameters to the GPU cores is the bottleneck, we want to somehow amortize this over many calls to the model. For instance, imagine we could move a batch of parameters to the cores and use them a thousand times before moving on to the next batch. This would do much to remedy the imbalance between memory read and compute times.

If our model is an LLM, then unfortunately we cannot do this for a single user because text is generated serially: even though each token needs its own LLM call and so the user needs to make many calls to the model to generate text, we can't parallelize these calls because each future token call needs to know all the past tokens. This inherently serial nature of text generation makes it infeasible to improve the memory read and compute time balance if only a single user is being serviced by the model.

However, things are different if we get to batch requests from multiple users together. For instance, suppose that our model is being asked to generate tokens by thousands of users at any given time. Then, we can parallelize these calls: every time we load some parameters onto the GPU cores, we perform the operations associated with those parameters for all user calls at once. This way, we amortize the reading cost of the parameters over many users, greatly improving our situation. Eventually this hits diminishing returns because we must also read the hidden state of each user's calls into GPU memory, but the hidden states are usually significantly smaller than the whole model, so parallelization still results in huge gains before we enter this regime.

For instance, if we could batch requests from 100 users together in our above setup, we might be able to achieve a utilization rate of 20% - note that in a realistic setup this would be much lower due to many sources of overhead the simplistic calculation is ignoring, but morally the calculation still gives the right result.

The result is massive economies of scale not just in training AI models, but also in running them. If an individual user wanted to run a large model at a reasonable speed, they might have to pay a thousand times what they would pay to a centralized API provider which relies on large GPU clusters to batch requests from many different users.

Some simple math on this: if you need 1000 concurrent users for reasonable utilization rates because of the 1000:1 imbalance between ALU ops and memory bandwidth in GPUs, and each user on average spends 10 minutes per day using your service, then you need a total user base of at least (1000 users)/(10 minutes/day) ~= 144K users. If you also want the service to be consistent, i.e. low latency and high throughput 24 hours a day, you probably need to exceed this by some substantial margin, perhaps even approach 1M total users. This is of course much smaller than the scale of a search engine such as Google, but still probably outside the realm where individual hobbyists or enthusiasts can hope to compete with the cost-effectiveness of centralized providers.

The contrast with the human brain is instructive. A H100 GPU draws 700 W of power to do 3e15 8-bit FLOP/s, which we think is similar to the computational power of the brain, though with ~ 30x the power draw. However, a H100 GPU has a mere 80 GB of VRAM, compared to the human brain's storage of the "parameter values" of around ~ 100 trillion synapses, which would probably take up ~ 100 TB of memory. On top of this, the human brain can run a (trivially) human equivalent intelligence at reasonable latency and throughput at a batch size of one: no parallelization across brains is needed. This suggests the human brain does not suffer from the same memory bandwidth versus arithmetic operation imbalance problem that modern GPUs have.

Whether this imbalance can possibly be cheaply engineered away or not might determine the extent to which the market for AI deployment (which may or may not become vertically disintegrated from AI R&D and training) is dominated by a few small actors, and seems like an important question about hardware R&D. I don't have the expertise to judge to what extent engineering away these memory bottlenecks is feasible and would be interested to hear from people who do have expertise in this domain.

# 79

New Comment

I point this - the VN bottleneck - out now and then.

Its really just a simple consequence of scaling geometry. Compute scales with device surface area (for 2d chips) or volume (for 3d systems like the brain), while bandwidth/interconnect scales with dimension minus one.

A few years back VCs were fooled by a number of well meaning startups based on the pitch "We can just make a big matmul chip like a GPU but with far more on chip SRAM and thereby avoid the VN bottleneck!" But Nvidia is in fact pretty smart, and understands why exactly this approach doesn't actually work (at least not yet with SRAM), and much money was wasted.

I used to be pretty excited about neuromorphic computing around 2010 ish. I still am - but today it still seems to be about a decade away.

A few years back VCs were fooled by a number of well meaning startups based on the pitch "We can just make a big matmul chip like a GPU but with far more on chip SRAM and thereby avoid the VN bottleneck!"

Including Cerebras?

Whether this imbalance can possibly be cheaply engineered away or not might determine the extent to which the market for AI deployment (which may or may not become vertically disintegrated from AI R&D and training) is dominated by a few small actors, and seems like an important question about hardware R&D. I don't have the expertise to judge to what extent engineering away these memory bottlenecks is feasible and would be interested to hear from people who do have expertise in this domain.

You may know this, but "in-memory computing" is the major search term here. (Or compute-in-memory, or compute-near-memory in the nearterm, or neuromorphic computing for an umbrella over that and other ideas.) Progress is being made, though not cheaply, and my read is that we won't have a consensus technology for another decade or so. Whatever that ends up being, scaling it up could easily take another decade.

On a different topic but answering to the same quote : advancements in quantization of models to significantly reduce model memory consumption for inference without reducing model performance might also mitigate the imbalance between ALU ops and memory bandwith. This might only shift the problem a few orders of magnitude away, but still, I think it‘s worth mentioning.

Etched.ai is a new DL ASIC startup. Their big idea (podcast interview) seems to be to burn specific Transformer models into the ASIC, and then make full use of the compute while using a specific parameter by running very large batches in parallel:

While we run a transformer model, we have this huge number of parameters. And each one of these parameters is a number. And to use that number, we take in a number from our input. We multiply them together, and we add them to a running total.

So every one of those parameters, in the case of GPT-3, 175 billion, is loaded from memory once and then used in the math operation once. It turns out that loading a thing from memory is way more expensive than doing the math. So how do we solve this problem? Well, we say that these weights are the same of across one user or two users or four users or eight users. So you batch together a huge number of queries. Then we load in that weight once, and we use it 16, 32, 64 times.

And that's one of the really interesting things that a transformer ASIC can do. You can have a much, much larger batch, not 64 but 2,500. So we're able to go load that weight in once, pay the expensive price and then amortize that expensive price over a huge number of users, making inference much, much cheaper. And now while this sounds good in theory, this does mean that you have to run that model in a place where you can have a huge number of users all kind of grouped together. So I think that means inference will be centralized in much the same way as training.

The problem with etching specific models is scale. It costs around \$1M to design a custom chip mask, so it needs to be amortized over tens or hundreds of thousands of chips to become profitable. But no companies need that many.

Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.

Even OpenAI would only need hundreds, maybe thousands of chips. The solution is smaller-scale chip production. There are startups working on electron beam lithography, but I'm unaware of a retailer Etched could buy from right now.

EDIT: 3 trillion flops/token (similar to GPT-4) is 3e12, so that would be 100,000 chips. The scale is actually there.

If you read through the podcast, which is the only material I could quickly find laying out the Etched paradigm in any kind of detail, their argument seems to be that they can improve the workflow and easily pay for a trivial \$1m (which is what, a measly 20 H100 GPUs?), and that, as AI eats the global white-collar economy, inference costs is the main limit and the main obstacle to justifying the training runs for even more powerful models (it does you little good to create GPT-5 if you can't then inference it at a competitive cost), and so plenty of companies actually would need or buy such chips, and many would find it worthwhile to make their own by finetuning on a company-wide corpus (akin to BlombergGPT).

At current economics, it might not make sense, sure; but they are big believers in the future, and point to other ways to soak up that compute: tree search, specifically. (You may not need that many GPT-4 tokens, because of its inherent limitations, so burning it onto a chip to make it >100x cheaper doesn't do you much good, but if you can figure out how to do MCTS to make it the equivalent of GPT-6 at the same net cost...)

I'm not sure how much I believe their proprietary simulations claiming such speedups, and I'd definitely be concerned about models changing so fast* that this doesn't make any sense to do for the foreseeable future given all of the latencies involved (how useful would a GPT-2 ASIC be today, even if you could run it for free at literally \$0/token?), so this strikes me as a very gutsy bet but one that could pay off - there are many DL hardware startups, but I don't know of anyone else seriously pursuing the literally-make-a-NN-ASIC idea.

* right now, the models behind the big APIs like Claude or ChatGPT change fairly regularly. Obviously, you can't really do that with an ASIC which has burned in the weights... so you would either have to be very sure you don't want to update the model any time soon or you have to figure out some way to improve it, like pipelining models, perhaps, or maybe leaving in unused transistors which can be WORMed to periodically add in 'update layers' akin to lightweight finetuning of individual layers. If you believe burned-in ASICs are the future, similar to Hinton's 'mortal nets', this would be a very open and almost untouched area of research: how to best 'work around' an ASIC being inherently WORM.

They appear to have launched 'Sohu', for LLaMA-3-70b: https://www.etched.com/announcing-etched

Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.

These numbers seem wrong. I think inference flops per token for powerful models is closer to 1e12-1e13. (The same as the number of params for dense models.)

More generally, I think expecting a similar amount of money spent on training as on inference is broadly reasonable. So, if a future powerful model is trained for \$1 billion, then spending \$1 million to design custom inference chips is fine (though I expect the design cost is higher than this in practice).

Look into AMD MI300x. Has 192 GB HBM3 memory. With FP4 weights, might run GPT-4 in single node of 8 GPUs, still have plenty to spare for KV. Eliminating cross-node communication easily allows 2x batch size.

Fungibility is a good idea, would take avg. KVUtil from 10% to 30% imo.

I think the human brain has around 2.5 petabytes of memory storage, which is insane compared to only 80 gigabytes in the H100 VRAM, and it all does this for 20 watts, and I think this gives a lot of credence to the belief that the near future of AI will be a lot more brain-like than people think.

If the brain is basically at the limits of efficient algorithms, and we don't get new paradigms for computing, then Jacob Cannell's scenario for AI takeover would be quite right.

If algorithmic progress does have a larger effect on things, than Steven Byrnes's/Max H's take will likely be correct on AI takeover.

Do you feel like your memory contains 2.5 petabytes of data? I'm not sure such a number passes the smell test.

While I wouldn't endorse the 2.5 PB figure itself, I would caution against this line of argument. It's possible for your brain to contain plenty of information that is not accessible to your memory. Indeed, we know of plenty of such cognitive systems in the brain whose algorithms are both sophisticated and inaccessible to any kind of introspection: locomotion and vision are two obvious examples.

I do want to ask why don't you think the 2.5 petabyte figure is right, exactly?

It might be right, I don't know. I'm just making a local counterargument without commenting on whether the 2.5 PB figure is right or not, hence the lack of endorsement. I don't think we know enough about the brain to endorse any specific figure, though 2.5 PB could perhaps fall within some plausible range.

a gpu contains 2.5 petabytes of data if you oversample its wires enough. if you count every genome in the brain it easily contains that much. my point being, I agree, but I also see how someone could come up with a huge number like that and not be totally locally wrong, just highly misleading.

To me any big number seems plausible, given that AFAIK people don't seem to have run into upper limits of how much information the human brain can contain - while you do forget some things that don't get rehearsed, and learning does slow down at old age, there are plenty of people who continue learning things and having a reasonably sharp memory all the way to old age. If there's any point when the brain "runs out of hard drive space" and becomes unable to store new information, I'm at least not aware of any study that would suggest this.

My immediate intuition is that any additional skills or facts about the world picked up later in life, wouldn't affect data storage requirements enough to be relevant to the argument?

For example, if you already have vision and locomotion machinery and you can play the guitar and that takes X petabytes of data, and you then learn how to play the piano, I'd feel quite surprised if that ended up requiring your brain to contain more than even 2X petabytes total of data!

(I recognise I'm not arguing for it, but posting in case others share this intuition)

I don't immediately see the connection in your comment to what I was saying, which implies that I didn't express my point clearly enough.

To rephrase: I interpreted FeepingCreature's comment to suggest that 2.5 petabytes feels implausibly large, and that it to be implausible because based on introspection it doesn't feel like one's memory would contain that much information. My comment was meant to suggest that given that we don't seem to ever run out of memory storage, then we should expect our memory to contain far less information than the brain's maximum capacity, as there always seems to be more capacity to spare for new information.

Sure, but surely that's how it feels from the inside when your mind uses a LRU storage system that progressively discards detail. I'm more interested in how much I can access - and um, there's no way I can access 2.5 petabytes of data.

I think you just have a hard time imagining how much 2.5 petabyte is. If I literally stored in memory a high-resolution poorly compressed JPEG image (1MB) every second for the rest of my life, I would still not reach that storage limit. 2.5 petabyte would allow the brain to remember everything it has ever perceived, with very minimal compression, in full video, easily. We know that the actual memories we retrieve are heavily compressed. If we had 2.5 petabytes of storage, there'd be no reason for the brain to bother!

If we had 2.5 petabytes of storage, there'd be no reason for the brain to bother!

I recall reading an anecdote (though don't remember the source, ironically enough) from someone who said they had an exceptional memory, saying that such a perfect memory gets nightmarish. Everything they saw constantly reminded them of some other thing associated with it. And when they recalled a memory, they didn't just recall the memory, but they also recalled each time in their life when they had recalled that memory, and also every time they had recalled recalling those memories, and so on.

I also have a friend whose memory isn't quite that good, but she says that unpleasant events have an extra impact on her because the memory of them never fades or weakens. She can recall embarrassments and humiliations from decades back with an equal force and vividity as if they happened yesterday.

Those kinds of anecdotes suggest to me that the issue is not that the brain would in principle have insufficient capacity for storing everything, but that recalling everything would create too much interference and that the median human is more functional if most things are forgotten.

EDIT: Here is one case study reporting this kind of a thing:

We know of no other reported case of someone who recalls personal memories over and over again, who is both the warden and the prisoner of her memories, as AJ reports. We took seriously what she told us about her memory. She is dominated by her constant, uncontrollable remembering, finds her remembering both soothing and burdensome, thinks about the past “all the time,” lives as if she has in her mind “a running movie that never stops” [...]

One way to conceptualize this phenomenon is to see AJ as someone who spends a great deal of time remembering her past and who cannot help but be stimulated by retrieval cues. Normally people do not dwell on their past but they are oriented to the present, the here and now. Yet AJ is bound by recollections of her past. As we have described, recollection of one event from her past links to another and another, with one memory cueing the retrieval of another in a seemingly “unstoppable” manner. [...]

Like us all, AJ has a rich storehouse of memories latent, awaiting the right cues to invigorate them. The memories are there, seemingly dormant, until the right cue brings them to life. But unlike AJ, most of us would not be able to retrieve what we were doing five years ago from this date. Given a date, AJ somehow goes to the day, then what she was doing, then what she was doing next, and left to her own style of recalling, what she was doing next. Give her an opportunity to recall one event and there is a spreading activation of recollection from one island of memory to the next. Her retrieval mode is open, and her recollections are vast and specific.

Perhaps you are thinking of this (i think) autobiographical essay by Tim Rogers? He also talks about it in his 5th chapter of his boku no natsuyasumi review.

[+][comment deleted]20

That memory would be used for what might be called semantic indexing. So it's not that I can remember tons of info, it's that I remember it in exactly the right situation.

I have no idea if that's an accurate figure. You've got the synapse count and a few bits per synapse ( or maybe more), but you've also got to account for the choices of which cells synapse on which other cells, which is also wired and learned exquisite specifically, and so constitutes information storage of some sort.

I got that from googling around the capacity of the human brain, and I found it via many sources. I definitely think that while this number is surprisingly high, I do think it makes a little sense, especially since I remember that one big issue with AI is essentially the fact that it has way less memory than the human brain, even when computation is similar in level.

Many of the calculations on the brain capacity are based on wrong assumptions. Is there an original source for that 2.5 PB calculation? This video is very relevant to the topic if you have some time to check it out:

Reber (2010) was my original source for the claim that the human brain has 2.5 petabytes of memory, but it's definitely something that got reported a lot by secondary sources like the Scientific American.

Yep, that's the source I was looking for to find the original source of the claim.

From what i've seen even the larger synapses store only about 5 bits ish, and the 'median' or typical synapse probably stores less than 1 bit in some sense (as the typical brain synapse only barely exists in a probabilistic sense - as in a neuromorphic computer a physical synaptic connection is an obvious but unappreciated prerequisite for a logical synapse, but the former does not necessarily entail the latter: see also quantal synaptic failure).

In my 2022 roadmap I estimated brain capacity at 1e15 bits but that's probably an overestimate for logical bits.

Also the brain is quite sparse for energy efficiency, but that usually comes at a tradeoff in parameter efficiency. This is well expored in the various tradeoffs for ANNS that model 3d space (NERFs etc) but generalizes to other modalities. The most parameter efficient models will be more dense but less compute/energy efficient for inference as a result. There are always more ways to compress the information stored in an ANN, but those optimization directions are extremely unlikely to align with the optimizations favoring more efficient inference via runtime sparsity (and extreme runtime sparsity probably requires redundancy aka anti-compression).

if the human brain had around 2.5 petabytes of storage, that would decrease my credence in AI being brain-like, because i believe AI is on track to match human intelligence in its current paradigm, so the brain being different just means the brain is different.

Agreed, and interested in @Noosphere89 elaborating on why you have the opposite intuition.

Basically, it has to do with the fundamental issue of the Von Neumann bottleneck, and the issue is that there is a massive imbalance between memory and computation, and while LLMs and human brains differ in their algorithms a lot, another non-algorithmic difference is the fact that the human brain has way more memory than pretty much any GPT, as well as basically all AI that exists.

Besides, more memory is good anyways.

And that causes issues when you try simulating an entire brain at high speed, and in particular it becomes a large issue when you have to wait all the time since the compute keeps shuffling around in memory.

[-]jsd30

According to SemiAnalysis in July:

OpenAI regularly hits a batch size of 4k+ on their inference clusters, which means even with optimal load balancing between experts, the experts only have batch sizes of ~500. This requires very large amounts of usage to achieve.

Our understanding is that OpenAI runs inference on a cluster of 128 GPUs. They have multiple of these clusters in multiple datacenters and geographies. The inference is done at 8-way tensor parallelism and 16-way pipeline parallelism. Each node of 8 GPUs has only ~130B parameters, or less than 30GB per GPU at FP16 and less than 15GB at FP8/int8. This enables inference to be run on 40GB A100’s as long as the KV cache size across all batches doesn’t balloon too large.

[-]Mir00

Wow, this is a good argument. Especially if assumptions hold.

1. The ALU computes the input much faster than the results can be moved to the next layer.
2. So if the AI only receives a single user's prompt, the ALUs waste a lot of time waiting for input.
3. But if many users are sending prompts all the time, the ALUs can be sent many more operations at once (assuming the wires are bottlenecked by speed rather than amount of information they can carry).
4. So if your AI is extremely popular (e.g., OpenAI), your ALUs have to spend less time idling, so the GPUs you use are much more cost-effective.
5. Compute is much more expensive for less popular AIs (plausibly >1000x).