Wiki Contributions


Inference cost limits the impact of ever larger models

My point is that, while PCIe bandwidths aren't increasing very quickly, it's easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.

(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)

Inference cost limits the impact of ever larger models

Beware bandwidth bottlenecks, as I mentioned in my original post.

Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you'll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.

(Yes, 8x gpu->gpu communications will hurt overall latency... but not by all that much I don't think. 1 second is an eternity.)

Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you're willing to use a small batch size.

Inference cost limits the impact of ever larger models

Thanks for elaborating I think I know what you mean now. I missed this:

I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer's computation.

My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don't have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer's weights from RAM at the same as computing the previous layer's output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.

My original claim was therefore misconceived. I'll revise it to a different claim: bigger neural nets ought to have higher inference latency in general - regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn't reduce latency. However, adding more layers increases latency, and it's hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN's width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it's not quadratic, I was thinking of the parameter count].) Does that seem right to you?

The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.

Inference cost limits the impact of ever larger models

The key is: pipelining doesn't help with latency of individual requests. But that's not what we care about here. What we care about is the latency from starting request 1 to finishing request N

Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.

Inference cost limits the impact of ever larger models

Perhaps what you meant is that latency will be high but this isn't a problem as long as you have high throughput. That's is basically true for training. But this post is about inference where latency matters a lot more.

(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don't want to interact with it in real time, even at GPT-3 scale)

Inference cost limits the impact of ever larger models

That would be interesting if true. I thought that pipelining doesn't help with latency. Can you expand?

Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That's why the latency to compute f(x) is high.

NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has higher latency than the small ones in the OA API.

Inference cost limits the impact of ever larger models

No, they don't. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)

The motivation to make inference cheaper doesn't seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn't seem that MoEs change the ratio of training to inference cost, except insofar as they're currently finicky to train.

But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model.

Only if you switch to a dense model, which again doesn't save you that much inference compute. But as you said, they should instead distill into an MoE with smaller experts. It's still unclear to me how much inference cost this could save, and at what loss of accuracy.

Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of 'thinking' in a short wallclock time).

Paul's estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise

Fair, that seems like the most plausible explanation.

Inference cost limits the impact of ever larger models

You may have better info, but I'm not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn't improve latency. For example, GPT-3 already has high latency compared to smaller models and it won't help if you make it 10^3x or 10^6x bigger.

Inference cost limits the impact of ever larger models

As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.

I agree that transformative impact is possible once we've built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.

Inference cost limits the impact of ever larger models

you're missing all the possibilities of a 'merely human-level' AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment

I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I'd add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-tune. Not to speak of the switching cost of moving from an economy based on humans to one based on ML systems. I'm not disputing that a human-level model could be transformative in years or decades -- I just argue that it may not be in the short-term.

Load More