Measuring hardware overhang

hippke

Measuring hardware overhang

Summary

How can we measure a potential AI or hardware overhang? For the problem of chess, modern algorithms gained two orders of magnitude in compute (or ten years in time) compared to older versions. While it took the supercomputer "Deep Blue" to win over world champion Gary Kasparov in 1997, today's Stockfish program achieves the same ELO level on a 486-DX4-100 MHz from 1994. In contrast, the scaling of neural network chess algorithms to slower hardware is worse (and more difficult to implement) compared to classical algorithms. Similarly, future algorithms will likely be able to better leverage today's hardware by 2-3 orders of magnitude. I would be interested in extending this scaling relation to AI problems other than chess to check its universality.

Introduction

Hardware overhang is a situation where sufficient compute is available, but the algorithms are suboptimal. It is relevant if we build AGI with large initial build, but cheaper run costs. Once built, the AGI might run on many comparably slow machines. That's a hardware overhang with a risk of exponential speed-up. This asymmetry exists for current neural networks: Creating them requires orders of magnitude more compute than running them. On the other hand, in The Bitter Lesson by Rich Sutton it is argued that the increase in computation is much more important (orders of magnitude) than clever algorithms (factor of two or less). In the following, I will examine the current state of the algorithm-art using chess as an example.

The example of chess

One of the most well-researched AI topics is chess. It has a long history of algorithms going back to a program on the 1956 MANIAC. It is comparatively easy to measure the quality of a player by its ELO score. As an instructive example, we examine the most symbolic event in computer chess. In 1997, the IBM supercomputer "Deep Blue" defeated the reigning world chess champion under tournament conditions. The win was taken as a sign that artificial intelligence was catching up to human intelligence.

By today's standards, Deep Blue used simple algorithms. Its strength came from computing power. It was a RS/6000-based system with 30 nodes, each with a 120 MHz CPU plus 480 special purpose VLSI chess chips. For comparison, a common computer at the time was the Intel Pentium II at 300 MHz.

Method: An experiment using a 2020 chess engine

We may wonder: How do modern (better) chess algorithms perform on slower hardware? I tested this with Stockfish version 8 (SF8), one of the strongest classical chess engine. I simulated 10k matches of SF8 against slower versions of itself and a series of older engines for calibration, using cutechess-cli. In these benchmarks, I varied the total number of nodes to be searched during each game. I kept the RAM constant (this may be unrealistic for very old machines, see below). By assuming a fixed thinking time per game, the experiments scale out to slower machines. By cross-correlating various old benchmarks of Stockfish and other engines on older machines, I matched these ratings to units of MIPS; and finally, MIPS approximately to the calendar year. Depending on the actual release dates of the processors, the year axis has a jitter up to 2 years. I estimate the error for the compute estimates to be perhaps 20%, and certainly less than 50%. As we will see, the results measure in orders of magnitude, so that these errors are small in comparison (<10%).

Results

SF8 achieves Kasparov's 2850 ELOs running on a 486-100 MHz introduced in 1994, three years before the Kasparov-Deep Blue match. These ELOs refer to tournament conditions as in the 1997 IBM games. In other words, with today's algorithms, computers would have beat the world world chess champion already in 1994 on a contemporary desk computer (not a supercomputer).

The full scaling relation is shown in the Figure. The gray line shows the ELO rating of Kasparov and Carlsen over time, hovering around 2800. The blue symbols indicate the common top engines at their times. The plot is linear in time, and logarithmic in compute. Consequently, ELO scales approximately with the square of compute. Finally, the red line shows the ELOs of SF8 as a function of compute. Starting with the 2019 rating of ~3400 points, it falls below 3000 when reducing MIPs from 10^5 to a few times 10^3. This is a decrease of 2-3 orders of magnitude. It falls below 2850 ELO, Kasparov's level, at 68 MIPs. For comparison, the 486 achieves 70 MIPS at 100 MHz. At its maximum, the hardware overhang amounts to slightly more than 10 years in time, or 2-3 orders of magnitude in compute. Going back very far (to the era of 386-PCs), the gap reduces. This is understandable: On very slow hardware, you can't do very much, no matter what algorithm you have. The orange line shows the scaling relation of a neural network-based chess engine, Leela Chess Zero (LC0), as discussed below.

Discussion

I originally ran these tests in 2019. Now (August 2020), SF8 has been superseded by SF11, with another ~150 ELO increase (at today's speed). It remains unclear how much improvement is left for future algorithms when scaled down to a 486-100. I strongly suspect, however, that we're running into diminishing returns here. There is only so much you can do on a "very slow" machine; improvements will never go to infinity. My guess is that the scaling will remain below three orders of magnitude.

Neural network-based algorithms such as AlphaZero or Leela Chess Zero can outperform classical chess engines. However, for this comparison they are less suited. I find that their scaling is considerably worse, especially when not using GPUs. In other words, they do not perform well on CPUs of slower machines. Depending on the size of the neural net, older machines may even be incapable of executing it. In principle, it would be very interesting to make this work: Train a network on today's machines, and execute (run) it on a very old (slow) machine. But with current algorithms, the scaling is worse than SF8. As a reference point, LC0 achieves ~3000 ELOs on a Pentium 200 under tournament conditions; SF8 is at the same level with about half the compute.

Conclusion and future research proposals

Similarly, scaling of other NN algorithms to slower hardware (with less RAM etc.) should yield interesting insights. While x86 CPUs are in principle backwards-compatible since the 1980s, there are several breaking changes which make comparisons difficult. For example, the introduction of modern GPUs produces a speed gap when executing optimized algorithms on CPUs. Also, older 32-bit CPUs are capped at 4 GB of RAM, making execution of larger models impossible. Looking into the future, it appears likely that similar breaking changes will occur. One recent example is the introduction of TPUs and/or GPUs with large amounts of RAM. Without these, it may be impossible to execute certain algorithms. If AGI relies on similar (yet unknown) technologies, the hardware overhang is reduced until more of such the units are produced. Then, the vast amount of old (existing) compute can not be used.

I would be interested in researching this scaling relation for other problems outside of chess, such as voice and image recognition. Most problems are harder to measure and benchmark than chess. Will the scalings show a similar 2-3 orders if magnitude software overhang? Most certainly, many problems will show similar diminishing returns (or a cap) due to RAM restrictions and wait time. For example, you just can't run a self-driving car on an Atari, no matter how good the algorithms. I would be interested in researching the scaling for other AI and ML fields, possibly leading to an academic paper.

Thanks for running these experiments!

I don't see the figure here, do you have a link to it?

Pulled from the wayback machine

Replaced the image in the post with this image.

From the graph it looks like stockfish is able to match the results of engines from ~2000 using ~1.5 orders of magnitude less compute.

Is that the right way to read this graph?
Do you have the numbers for SF8 evaluations so that I can use those directly rather than eyeballing from this graph? (I'm generally interested in whatever raw data you have.)

Yes, that is a correct interpretation. The SF8 numbers are:

MIPS = [139814.4, 69907, 17476, 8738, 4369, 2184, 1092, 546.2, 273.1, 136.5, 68.3, 34.1, 17.1, 8.5, 4.3]

ELO = [3407,3375,3318,3290,3260,3225,3181,3125,3051,2955,2831,2671,2470,2219,1910]

Note that the range of values is larger than the plotted range in the Figure. The Figure cuts off at a 80486DX 33 MHz, 27 MIPS, introduced May 7, 1990.

To derive an analytical result, it is reasonable to interpolate with a spline and then subtract. Let me know if you have a specific question (e.g. for the year 2000).

How are those MIPS numbers produced? My impression was that the raw numbers were nodes/sec, and then some calibration was done to relate this to MIPS?

The MIPS are only a "lookup table" from the year, based on a CPU list. It's for the reader's convenience to show the year (linear), plus a rough measure of compute (exponential).

The nodes/s measure has the problem that it is engine-dependent.

The real math was done by scaling down one engine (SF8) by time-per-move, and then calibrating the time to the computers of that era (e.g., a Quad i7 from 2009 has 200x the nodes/s compared to a PII-300 from 1999)

Thanks for writing this post. I have a handful of quick questions: (a) What was the reference MIPS (or the corresponding CPU) you used for the c. 2019-2020 data point? (b) What was the constant amount of RAM you used to run Stockfish? (c) Do I correctly understand that the Stockfish-to-MIPS comparison is based on the equation [edit: not sure how to best format this LaTeX...]:

So, your post piqued my interest to investigate the Intel 80486 a bit more with the question in mind: how comparable are old vs. new CPUs according not just to MIPS but also to other metrics?

Instructions per second (MIPS): Using the Wikipedia MIPS source, the 80486 has 70 MIPS at 100 MHz, whereas a recent Skylake 8-core CPU (i9-9900K) has around 50,000 MIPS/core at a clock of 4.7 GHz, and closer to 40,000 MIPS/core at a sustained clock of 3.6 GHz.
Memory bandwidth: From Wikipedia, the 80486 has a 33 MHz bus and a data rate of up to 32 bits, so roughly 130 MB/s, whereas the 9900K is closer to 40 GB/s (from Intel).
Memory latency: The 80486 takes 5 bus cycles at 33 MHz, or 15 CPU cycles at 100 MHz, to access memory (16 byte cache line size, from comp.arch). From Anandtech, Skylake can take 100 cycles to access memory (64 byte cache line size, typical for x86_64).
Memory capacity: The 80486 is a 32-bit CPU and can address up to 4 GB of RAM. Again from Wikipedia, the 80486 accepts SIMM form factor RAM, with up to 2 GB capacity (but tens of MB may have been more commonly available). On the other hand, the 9900K supports a maximum memory capacity of 128 GB.

It would seem that improvements in MIPS have outpaced improvements in memory-related metrics, e.g. memory latency in units of cycles has gotten worse. What I don't know is how sensitive Stockfish is to variations in memory performance. However I would conclude that I'd update the estimated SF8 ELO on older hardware upwards when additionally accounting for the effect of memory-related metrics in addition to MIPS. In other words, it seems more likely to me that the SF8 ELO curve is underestimated when you include both MIPS and memory-related effects.

(There is another direction one could follow: get Stockfish to compile on an 80486 or other old hardware that one has lying around, and report back with results.)

(a) The most recent data points are from CCRL. They use an i7-4770k and the listed tournament conditions. With this setup, SF11 has about 3500 ELOs. That's what I used as the baseline to calibrate my own machine (an i7-7700k).

(b) I used the SF8 default which is 1 GB.

(c) Yes. However, the hardware details (RAM, memory bandwidth) are not all that important. You can use these SF9 benchmarks on various CPUs. For example, the AMD Ryzen 1800 is listed with 304,510 MIPS and gets 14,377,000 nodes/sec on Stockfish (i.e., 19.9 nodes per MIPS). The oldest CPU in the list, the Pentium-150 has 282 MIPS and reaches 5,626 nodes/sec (i.e., 47.2 nodes per MIPS). That's about a factor of two difference, due to memory and related advantages. As we're getting that much every 18 months due to Moore's law, it's a small (but relevant) detail, and decreases the hardware overhang slightly. Thanks for bringing that up!

Giving Stockfish more memory also helps, but not a lot. Also, you can't give 128 GB of RAM to a 486 CPU. The 1 GB is probably already stretching it. Another small detail which reduces the overhang by likely less than one year.

There are a few more subtle details like endgame databases. Back then, these were small, constrained by disk space limitations. Today, we have 7-stone endgame databases through the cloud (they weigh in at 140 TB). That seems to be worth about 50 ELO.

Wow, thanks for the very comprehensive response. (Also fun to see someone has compiled a modern chess engine on early-mid-90s hardware and shared their results.)

On the one hand this is an interesting and useful piece of data on AI scaling and the progress of algorithms. It's also important because it makes the point that the very notion of "progress of algorithms" implies hardware overhang as important as >10 years of Moore's law. I also enjoyed the follow-up work that this spawned in 2021.

Your results don't seem to show upper bounds on the amount of hardware overhang, or very strong lower bounds. What concrete progress has been made in chess playing algorithms since deep blue? As far as I can tell, it is a collection of reasonably small, chess specific tricks. Better opening libraries, low level performance tricks, tricks for fine tuning evaluation functions. Ways of representing chessboards in memory that shave 5 processor cycles off computing a knights move ect.

P=PSPACE is still an open conjecture. Chess is brute forcible in PSPACE. If P=PSPACE, and the overhead is a small constant, then it might be possible to program a vacuum tube machine to play perfect chess. Alternately, showing that stockfish is orders of magnitude better than deep blue, doesn't actually show that there is a hardware overhead for chess. It shows that there was one when deep blue was created.

Also note that there never was a hardware overhang for integer binary addition. The problem is simple enough that the first answer that any reasonably smart person can come up with is nearly optimal. (It is fairly straightforward to do addition with only a few logic gates per input, and as the output depends bitwise on all inputs (changing any single bit changes the output) then you need at least one logic gate per input. ) It is plausible that playing chess is a problem simple enough that we have figured out how to do it nearly optimally, whereas on other problems there is a hardware overhang.

If we assume that the expert human brain is about equally efficient at AI design, general programming, airoplane engineering and a variety of other STEM tasks. (A plausible assumption, given that these tasks seem similarly far from the environment of evolutionary adaptedness, ) Then it should take a similar amount of compute to display top human performance in these, as in chess. On the other hand, doing arithmetic used to be a skilled job, and the compute needed for superhuman chess (with current algorithms) is way higher than that needed for arithmetic.

Also, older 32-bit CPUs are capped at 4 GB of RAM, making execution of larger models impossible.

Slower, not impossible. I don't think any of the chess or Go models have model sizes >1GB, and even if they did, you don't have to load the entire model into RAM, they're just feedforward CNNs, you only need to be able to fit one layer at a time. With appropriate tricks you could probably even slowly train the models, like https://arxiv.org/abs/2002.05645v5

Right. My experiment used 1 GB for Stockfish, which would also work on a 486 machine (although at the time, it was almost unheard of...)

Thanks for running these experiments!

I don't see the figure here, do you have a link to it?

Pulled from the wayback machine

Replaced the image in the post with this image.

From the graph it looks like stockfish is able to match the results of engines from ~2000 using ~1.5 orders of magnitude less compute.

Is that the right way to read this graph?
Do you have the numbers for SF8 evaluations so that I can use those directly rather than eyeballing from this graph? (I'm generally interested in whatever raw data you have.)