The Colliding Exponentials of AI

by VermillionStuka5 min read14th Oct 202010 comments

13

AI TimelinesGPTForecasts (Specific Predictions)AI
Frontpage

Epistemic status: I have made many predictions for quantitative AI development this decade, these predictions were based on what I think is solid reasoning, and extrapolations from prior data.

 

If people do not intuitively understand the timescales of exponential functions, then multiple converging exponential functions will be even more misunderstood. 

Currently there are three exponential trends acting upon AI performance, these being Algorithmic Improvements, Increasing Budgets and Hardware Improvements. I have given an overview of these trends and extrapolated a lower and upper bound for their increases out to 2030. These extrapolated increases are then combined to get the total multiplier of equivalent compute that frontier 2030 models may have over their 2020 counterparts. 

 

Firstly...

Algorithmic Improvements

Algorithmic improvements for AI are much more well-known and quantified than they were a few years ago, much in thanks to OpenAI’s paper and blog AI and Efficiency (2020).  

OpenAI showed the efficiency of image processing algorithms has been doubling every 16 months since 2012. This resulted in a 44x decrease in compute required to reach Alexnet level performance after 7 years, as Figure 1 shows.  

Figure 1, Compute to reach Alexnet level performance, OpenAI

OpenAI also showed algorithmic improvements in the following areas:

Transformers had surpassed seq2seq performance on English to French translation on WMT’14 with 61x less training compute in three years.

AlphaZero took 8x less compute to get to AlphaGoZero level performance 1 year later.

OpenAI Five Rerun required 5x less training compute to surpass OpenAI Five 3 months later.

Additionally Hippke’s LessWrong post Measuring Hardware Overhang detailed algorithmic improvements in chess, finding that Deep Blue level performance could have been reached on a 1994 desktop PC-level of compute, rather than the 1997 supercomputer-level of compute that it was, if using modern algorithms.

Algorithmic improvements come not just from architectural developments, but also from their optimization library’s. An example is Microsoft DeepSpeed (2020).  DeepSpeed claims to train models 2–7x faster on regular clusters, 10x bigger model training on a single GPU, powering 10x longer sequences and 6x faster execution with 5x communication volume reduction. With up to 13 billion parameter models trainable on a single Nvidia V100 GPU.

So, across a wide range of machine learning areas major algorithmic improvements have been regularly occurring. Additionally, while this is harder to quantify thanks to limited precedent, it seems the introduction of new architectures can cause sudden and/or discontinuous leaps of performance in a domain, as Transformers did for NLP. As a result, extrapolating past trendlines may not capture such future developments. 

If the algorithmic efficiency of machine learning in general had a halving time like image processing’s 16 months, we would expect to see ~160x greater efficiency by the end of the decade. So, I think an estimate of general algorithmic improvement of 100 - 1000x by 2030 seems reasonable. 

Edit: I feel less confident and bullish about algorithmic progress now. 

 

Increasing Budgets

The modern era of AI began in 2012, this was the year that the compute used in the largest models began to rapidly increase, with a doubling time of 3.4 months (~10x Yearly), per OpenAI’s blog AI and Compute (2018), see Figure 2 below. While the graph stops in 2018, the trend held steady with the predicted thousands of petaflop/s-days range being reached in 2020 with GPT-3, the largest ever (non sparse) model, which had an estimated training cost of $4.6 Million, based on the price of a Tesla V100 cloud instance. 

Figure 2, Modern AI era vs first Era, OpenAI

Before 2012 the growth rate for compute of the largest systems had a 2-year doubling time, essentially just following Moore’s law with mostly stable budgets, however in 2012 a new exponential trend began: Increasing budgets.

This 3.4 month doubling time cannot be reliably extrapolated because the increasing budget trend isn’t sustainable, as it would result in the following approximations (without hardware improvements):

2021 | $10-100M

2022 | $100M-1B

2023 | $1-10B

2024 | $10-100B

2025 | $100B-1T

Clearly without a radical shift in the field, this trend could only continue for a limited time. Astronomical as these figures appear, the cost of the necessary supercomputers would be even more.

Costs have moved away from mere academic budgets and are now in the domain of large corporations, where extrapolations will soon exceed even their limits.  

The annual research and development expenditure of Google’s parent company Alphabet was $26 Billion in 2019, I have extrapolated their published R&D budgets to 2030 in Figure 3.

Figure 3, R&D Budget for Alphabet to 2030

By 2030 Alphabets R&D should be just below $60 Billion (approximately $46.8 Billion in 2020 dollars). So how much would Google, or a competitor, be willing to spend training a giant model?

Well to put those figures into perspective: The international translation services market is currently $43 billion and judging from the success of GPT-3 in NLP its successors may be capable of absorbing a good chunk of that. So that domain alone could seemingly justify $1B+ training runs. And what about other domains within NLP like programming assistants? 

Investors are willing to put up massive amounts of capital for speculative AI tech already; the self-driving car domain had disclosed investments of $80 Billion from 2014-2017 per a report from Brookings .With those kind of figures even a $10 Billion training run doesn’t seem unrealistic if the resulting model was powerful enough to justify it.

My estimate is that by 2030 the training run cost for the largest models will be in the $1-10 Billion range (with total system costs higher still). Compared to the single digit millions training cost for frontier 2020 systems, that estimate represents 1,000-10,000x larger training runs.

 

Hardware Improvements

Moore’s Law had a historic 2-year doubling time that has since slowed. While it originally referred to just transistor count increases, it has changed to commonly refer to just the performance increase. Some have predicted its stagnation as early as the mid-point of this decade (including Gordon Moore himself), but that is contested. More exotic paths forward such as non-silicon materials and 3D stacking have yet to be explored at scale, but research continues. 

The microprocessor engineer Jim Keller stated in February 2020 that he doesn’t think Moore’s law is dead, that current transistors which are sized 1000x1000x1000 atoms can be reduced to 10x10x10 atoms before quantum effects (which occur at 2-10 atoms) stop any further shrinking, an effective 1,000,000x size reduction. Keller expects 10-20 more years of shrinking, and that performance increases will come from other areas of chip design as well. Finally, Keller says that the transistor count increase has slowed down more recently to a ‘shrink factor’ of 0.6 rather than the traditional 0.5, every 2 years. If that trend holds it will result in a 12.8x increase in performance in 10 years.

But Hardware improvements for AI need not just come just from Moore’s law. Other sources of improvement such as Neuromorphic chips designed especially for running neural nets or specialised giant chips could create greater performance for AI. 

By the end of the decade I estimate we should see between 8-13x improvement in hardware performance.

 

 

Conclusions and Comparisons 

If we put my estimates for algorithmic improvements, increased budgets and hardware improvements together we see what equivalent compute multiplier we might expect a frontier 2030 system to have compared to a frontier 2020 system. 

Estimations for 2030:

Algorithmic Improvements: 100-1000x

Budget Increases: 1000-10,000x

Hardware Improvements: 8-13x

 

That results in an 800,000 - 130,000,000x multiplier in equivalent compute.

Between EIGHT HUNDRED THOUSAND and ONE HUNDRED and THIRTY MILLION

To put those compute equivalent multipliers into perspective in terms of what capability they represent there is only one architecture that seems worth extrapolating them out on: Transformers, specifically GPT-3. 

Firstly lets relate them to Gwern’s estimates for human vs GPT-3 level perplexity from his blogpost On GPT-3. Remember that perplexity is a measurement of how well a probability distribution or probability model predicts a sample. This is a useful comparison to make because it has been speculated both that human level prediction on text would represent human level NLP, and that NLP would be an AI complete problem requiring human equivalent general faculties.

Gwern states that his estimate is very rough and relies on un-sauced claims from OpenAI about human level perplexity on benchmarks, and that the absolute prediction performance of GPT-3 is at a "best guess", double that of a human. With some “irresponsible” extrapolations of GPT-3 performance curves he finds that a 2,200,000× increase in compute would bring GPT-3 down to human perplexity. Interestingly that’s not far above the lower bound in the 800,000 – 130,000,000x equivalent compute estimate range. 

It’s worth stressing, 2030 AI systems could have human level prediction capabilities if scaling continues.

Another way to put the equivalent compute multiplier estimates into perspective is extrapolating out the GPT-3 paper’s graph Aggregate performance across benchmarks (Done in Figure 4 below). OpenAI stated the aggregate performance “should not be seen as a rigorous or meaningful benchmark in itself” so don't assume it represents human level capabilities. This is something worthy of consideration, especially given GPT-3’s current capabilities with ‘only’ 175 billion parameters.

 

Figure 4, GPT-3 cross benchmark performance extrapolated 

For the 800,000x estimate, few shot learning accuracy approaches 100%, one shot is above 90% and zero shot below 80%. For the 130,000,000x estimate both few and one-shot learning approach 100% and zero shot reaches 90%. This extrapolation maps greater equivalent compute cleanly onto more parameters, which may not be reliable, but gives us an idea of what we are dealing with.

 

Ultimately the point of these extrapolations isn’t necessarily the specific figures or dates but the clear general trend: not only are much more powerful AI systems coming, they are coming soon

13