Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Executive Summary

Using a dataset of 470 models of graphics processing units (GPUs) released between 2006 and 2021, we find that the amount of floating-point operations/second per $ (hereafter FLOP/s per $) doubles every ~2.5 years. For top GPUs, we find a slower rate of improvement (FLOP/s per $ doubles every 2.95 years), while for models of GPU typically used in ML research, we find a faster rate of improvement (FLOP/s per $ doubles every 2.07 years). GPU price-performance improvements have generally been slightly slower than the 2-year doubling time associated with Moore’s law, much slower than what is implied by Huang’s law, yet considerably faster than was generally found in prior work on trends in GPU price-performance. Our work aims to provide a more precise characterization of GPU price-performance trends based on more or higher-quality data, that is more robust to justifiable changes in the analysis than previous investigations.

Figure 1. Plots of FLOP/s and FLOP/s per dollar for our dataset and relevant trends from the existing literature

Trend2x time10x timeMetric
Our dataset
(n=470)
2.46 years 
[2.24, 2.72]
8.17 years 
[7.45, 9.04]
FLOP/s per dollar
ML GPUs
(n=26)
2.07 years
[1.54, 3.13]
6.86 years
[5.12, 10.39]
FLOP/s per dollar
Top GPUs
(n=57)
2.95 years
[2.54, 3.52]
9.81 years
[8.45, 11.71]
FLOP/s per dollar
Our data FP16 (n=91)2.30 years
[1.69, 3.62]
7.64 years
[5.60, 12.03]
FLOP/s per dollar
Moore’s law2 years6.64 yearsFLOP/s
Huang’s law1.08 years3.58 yearsFLOP/s
CPU historical (AI Impacts, 2019)2.32 years7.7 yearsFLOP/s per dollar
Bergal, 20194.4 years14.7 yearsFLOPs/dollar

Table 1. Summary of our findings on GPU price-performance trends and relevant trends in the existing literature with the 95% confidence intervals in square brackets.

In future work, we intend to build on this work to produce projections of GPU price-performance, and investigate how our findings inform us about the growth in dollar-spending on computing hardware in Machine Learning.

We would like to thank Alyssa Vance, Ashwin Acharya, Jessica Taylor and the Epoch team for helpful feedback and comments.


 

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 5:43 AM

This trend may hit a wall in the near future as by some industry analysts predict TSMC's 3nm process will actually be more expensive per transistor than the current state-of-the-art 5nm process.

We're currently looking deeper into how we can extrapolate this trend. Our preliminary high uncertainty estimate is that it is more likely to slow down than speed up over the foreseeable future. 

memory latency and bandwidth is critically important in ML algorithm performance in a way that makes this chart less straightforward than it appears; it's a good investigation with, in my view, an inconclusive result. achievable FLOPS on various models would be an interesting comparison.

In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading. 

In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth is probably faster than Moore's law. 

The main reason is that FP32 is just not the right thing to look at in modern ML and we even knew this at the time of writing, i.e. it ignores tensor cores and lower precisions like TF16 or INT8. 

I'm a little worried that people who read this post but don't have any background in ML got the wrong takeaway from the post and we should have emphasized this difference even more at the time. We have written a follow-up post about this recently here: https://epochai.org/blog/trends-in-machine-learning-hardware
I feel like the new post does a better job at explaining where compute progress comes from.

I notice that the ML GPUs are not the best bang-for-your-buck in this chart. I assume that researchers prefer them because they pack more 'bang' (FLOPS/s) in one unit, and that distributing across multiple cards has a performance penalty and/or adds complexity. How do factors like the cost of the rig (motherboard, power supply, case) and the cost of electricity play into this? Would a large cluster of more commodity GPUs be an effective research setup which just isn't economically competitive with ML GPUs, or would it be impractical at research scale?

I believe the performance/complexity penalty generally makes large clusters of cheap consumer GPUs not viable, with memory capacity being the biggest problem. From my perspective outside looking in, it takes a lot of effort and reengineering to make many ML projects just do inference on consumer GPUs with lower memory, and even more work to make it possible to train them with numerous GPUs of low memory. And it the vast majority cases the author say it's not even possible. 

The lone exception being the consumer 3090 GPU, as a massive outlier with 24GB of memory. But in pure flops the 3080 GPU is almost equivalent to a 3090 but has only 10 GB.

You have more than an order of magnitude scatter in your plot, but you write 3 significant figures to your calculated doubling period. Is this precision of value?  

Also, your black data appears to have something different going on prior to 2008.  It would be worthwhile doing a separate fit to post 2008 data.  Eyeballing it, it is longer than 4 year doubling time.

[-]Cullen2yΩ110

Is there a publicly accessible version of the dataset?

Update: it's published now and you can find it here: https://chip-dataset.vercel.app/

How did you decide where the y-intercept for Huang’s law should be? It seems that even if you fix the slope to 25x per 5 years, the line could still be made to fit the data better by giving it a different y-intercept.

The comparison lines (dotted) have completely arbitrary y-intercepts. You should only take the slope seriously. 

That might be worth mentioning, as I wondered about the same. (I didn't realize until now that all the slope curves start at the same point on the left hand side of the figure)