How much AI inference can we do?

Benjamin_Todd

How much AI inference can we do?

by Benjamin_Todd

6 min read14th May 20247 comments

20

Frontpage

This is a linkpost for https://benjamintodd.substack.com/p/how-much-ai-inference-can-we-do

Suppose you have a bunch of GPUs. How many LLM forward passes can you do with them?^[1]

This is relevant to figuring out how profitable AI will be in the short-term, how powerful AI systems might be able to come in the near future, how large the compute overhang will be and other strategic questions.

Here’s my attempt to understand this topic as a non-specialist. I’ve had it checked over by some technical advisors, but I don’t claim any special expertise. I wrote it because I haven’t been able to find an accessible explainer elsewhere. I appreciate corrections.

The most obvious approach – the one I often see people in the community taking – is to look up how many FLOP per second your GPU can process, then how many FLOP it takes to run a forward pass, and then divide the two.

For example, Nvidia’s A100 GPU is listed at 312 teraflop per second (3e14) on its spec sheet (FP16 tensor), a forward pass of GPT-4 requires 5.6e11 FLOP per forward pass.^[2] So that would imply a single GPU can do about 560 forward passes per second.

But this turns out to be much too high.

Even if it were possible to achieve spec sheet FLOP in a real life application (it’s not), this wouldn’t be the relevant figure because, in practice, inference is limited more by memory than by FLOP:

Each forward pass requires all the parameters to also pass through the GPU’s memory. If 280 billion parameters are activated, and each parameter requires 16-bits = 2 bytes to encode it, then 560 gigabytes must pass through memory.^[3]

But the A100’s memory bandwidth is 2000 gigabytes per second – only enough for 4 forward passes.

However, 4 forward passes per second is also not right.

In practice, GPUs are parallelised, so multiple forward passes are processed in batches and many other optimisations are applied, allowing real world efficiency to be much higher than the memory bandwidth of an individual GPU would suggest.

So, the first FLOP-based method is an upper bound, the second memory-based method a lower bound, and the real world lies somewhere in between.

Figuring out where real world efficiency lies is tricky. Not only is it an area of active research, but it also depends on many factors, such as acceptable latency, context length, batch size, size of the model, etc.

The best estimate I’ve seen so far is from Semianalysis. In their article on GPT-4 architecture, they estimate that a cluster of 128 A100s can output 1 million tokens for $4.9 of compute (assuming fairly high utilisation and a context seqlen of 8k).

If A100s cost $1/hour on the cloud in 2023, running the cluster costs $128 per hour. This means the cluster must produce $128/4.9 = 26 million forward passes per hour.

That’s about 60 forward passes per chip per second – about 10% of the theoretical max, but 15 times better than the lower bound.

(Interestingly, it’s significantly worse than the ~33% utilisation of FLOP that can be achieved in training, which means that even if a certain number of FLOP were used for training, the same GPUs couldn’t produce that many FLOP if applied to inference.)

What about more advanced GPUs?

In the same article, Semianalysis provides a similar figure for the H100. They also have a newer post analysing the inference performance of the newer Blackwell chips, which gives a rough sense of how the B200 compares to the H100.^[4] For the H200, I looked at some comparisons of inference performance with the H100 and guessed 1.7x.

From this, I can make the following table:

One interesting point is that inference throughput has increased about 20x in 5 years, compared to only an 8x increase in FLOP.

This seems to be at least partly because Nvidia has crammed a lot more memory bandwidth into the newest chips, and memory is still usually a bigger constraint to inference than FLOP/s. It also allows for larger batch sizes. And more broadly it seems like the recent generation of chips have been more optimised for inference relative to training.

However, the underlying speed of memory bandwidth has been increasing more slowly than FLOP for years (the so-called ‘memory wall’), so while memory has been able to catch up recently, my best guess would be that it’s a temporary effect.

My figures are also based on FP16 and 16-bit encoding of model weights, but it seems like inference is switching to FP8 and 8-bit encoding, which could also roughly double how much inference can be done per GPU.^[5]

What about future models and algorithms?

As a first pass, FLOP and memory requirements per forward pass scale linearly with the number of model parameters activated per forward pass. So if a model is 10 times bigger, all else equal we’ll only be able to perform about one tenth as many forward passes.

In reality, there are many complications.

For example, if users really value long contexts and low latency, that makes it harder to batch, which pushes throughput towards the lower bound. If, however, most inference is short context and higher latency, throughput could be much closer to the theoretical max (but probably not above about ~50%).

We should also probably expect chips, parallelisation and algorithms to get better optimised over inference over time, allowing throughput to get closer to the max, or to achieve more performance with less compute. These effects can be big.

One example is that new parallelisation techniques could be discovered, allowing inference to get closer to the upper bound of available FLOP.

A more interesting example is that by using a mixture of experts structure, GPT-4 only needs to activate about one tenth of its parameters on a typical forward pass, so it only requires about a tenth of the compute suggested by its total number of parameters. Future models might be able to activate an even smaller fraction of parameters to achieve similar performance.

Models can also be trained using more data and use fewer parameters, which makes the model cheaper to run.

As a concrete example, by using 10 times as many experts, a lot of data and some other architecture improvements, it seems like DeepSeek has been able to achieve performance approaching GPT-4 while only activating about a tenth as many parameters.

As a final aside, Dylan Patel of Semianalysis claims that the computing requirements to run a forward pass will increase more slowly than linearly with model size. I’m not sure exactly why this is, but it could be because larger models open up more possibilities for optimisation.

What about input tokens?

Everything above has been just about the compute needed to produce one output token. But in reality, the compute required also depends on the number of tokens that are input into the model before producing the output.

I’ve heard conflicting accounts of the relationship between output tokens and compute, but a technical advisor told me that for FLOP adding input and output tokens works as a rough rule of thumb.

For example, if you input 1,000 input tokens and get 50 output tokens, then you need about 1050 times the FLOP required for one forward pass.

That also lines up with this article by Az16 (and would be consistent with the fees for using LLMs being linear in input tokens, though there are other reasons for this).

So, we can roughly say the throughput numbers above hold for the number of input or output tokens in most cases.

Though my understanding is this could break down if the number of input tokens / context is very large, in which case the memory requirements can increase faster than linear, pushing performance closer to the lower bound per token.

Summing up

We can look at the FLOP per second of future chips to get a rough upper bound on future ability to do inference, and memory bandwidth to get a lower bound, and think about where real life performance might fall within that range. Then we can compare that to the size of future models.

Historically, our ability to use maximum available FLOP in inference has been worse than in training.

However, inference throughput has been getting closer to the upper bound recently, as chips have been more adapted to inference (especially through having more memory bandwidth), and parallelisation & batching techniques have improved. This trend could continue (up to a max of maybe around 50% of the upper bound) if we discover more parallelisation techniques. Or, it could start to reverse due to the memory wall.

Algorithmic improvements have also allowed models to achieve the same performance while using much less compute, and that trend seems likely to continue.

Switching from FP16 to FP8 could also roughly double how much inference we can do with a given cluster of GPUs.

This was originally posted on benjamintodd.substack.com. Subscribe to get all my posts.