In May 2020, OpenAI released a report on algorithmic efficiency improvements in deep learning. Main headline:
Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.
A lot people were surprised by this; there’s a common narrative in which AI progress has come mostly from throwing more and more compute at relatively-dumb algorithms. (This is a common interpretation of The Bitter Lesson, though I would argue it is largely a misinterpretation.)
I’ve had various experiences over the years which made the result not-that-surprising. Algorithms beating compute is the sort of thing I expect by default, on a gut level. The point of this post is to tell a few of the stories which underlie that intuition, aimed especially toward people who don’t have much first-hand experience with software engineering, ML, or simulation. (There will still be some jargon, though.)
Disclaimer: this does not mean that you should put tons of confidence on this view. The goal is just to provide a possible lens through which “algorithmic progress has yielded more gains than classical hardware efficiency” makes sense; I want to raise that hypothesis from entropy. I’m not going to provide the sort of evidence which would justify very high confidence, I’m just going to point it out as a hypothesis to keep in the back of your mind, and update on when results like OpenAI’s come along.
Rewrite In C
Back in college, I spent a summer simulating an unusual type of biochemical oscillator, officially under the aegis of the Minnesota Supercomputing Institute. The algorithm was conceptually simple: every time a reaction occurs between two molecules, update the counts of each molecule, then randomly sample to figure out when the next reaction happens. (In a single cell, molecule counts are often small enough that we can simulate each individual reaction like this, at least for a particular reaction-type.) Early in the summer, I spent a few days studying the algorithm and coding up a simulation in python. In order to get decent statistics, we needed to run it a lot, so the professor overseeing the work recommended that I book some time on one of the supercomputer clusters.
I did not book time on the supercomputers. Instead, I spent another three days re-writing the algorithm in C and tweaking it a bit for speed. That sped it up by a factor of about a hundred, which was enough that I could get statistically significant results on my laptop in an hour. Now, that means I needed to sit around for an hour waiting for results whenever I changed something, but that’s still a lot faster than applying for a timeslot on the supercomputer!
I’ve also rewritten things in C outside of academia. At one company, we had a recommendation algorithm which gave great recommendations but took about a second to run. (It was an exponential-time Bayesian inference algorithm, but with only a handful of data points.) I rewrote it in C, and it went down to < 10 ms - fast enough that a user wouldn’t notice the page loading slowly.
Also: have you ever checked just how much faster numpy is, compared to naive matrix multiplication written in python?
I coded up a quick test just now, with two 1k by 1k random matrices. Numpy.dot ran in about 40ms (after warmup). My simple three-nested-for-loop ran in about 130-140 seconds (slow enough that warmup was not relevant). That’s a speedup factor of more than 3k. Now, that’s not just from running in C (or fortran); it’s also largely from algorithms which make efficient use of the cache, and maybe a little bit from Strassen’s algorithm (though that’s probably not the main factor for matrices of this size).
Back in the day, scikit-learn included a translation algorithm which took pairs of corresponding sentences in two languages (e.g. English and French), and counted how many times each pair of words occurred in corresponding pairs. So, for instance, if French sentences containing “cochon” tend to correspond to English sentences containing “pig”, we might guess that “cochon” is French for “pig”.
Unfortunately, scikit’s implementation did something like this:
for french_word in french_words: for english_word in english_words: for sentence in corpus: if french_word in sentence.french and english_word in sentence.english: counts[french_word][english_word] += 1
What’s wrong with this? Well, let’s say there are 30k words in each language. Then our outer two loops will loop over ~900M word pairs, and it will go through every single sentence pair in the corpus for each of those 900M word pairs. Yet the vast majority of word pairs do not occur in the vast majority of sentences; if we have 100k sentences with an average of 10 words each, then we only have ~100 word pairs per sentence pair, and only ~10^2*100k = 10M word pairs actually in the corpus at all.
So, we can swap the loops around like this:
for sentence in corpus: for french_word in sentence.french: for english_word in sentence.english: counts[french_word][english_word] += 1
This avoids checking each sentence for all the word pairs which aren’t in the sentence. It’s a speedup from ~900M*100k = 90T operations to ~10^2*100k = 10M operations, roughly a factor of 9M improvement. (To actually get that big a speedup overall also requires switching the counts to use a sparse data structure.) The code went from so slow that it would not finish running before the class assignment was due, to running in under a second.
This is an unusually dramatic example of the speedup achievable by algorithmic improvements even in a fairly widely-used library. But it’s certainly not the only such example. Scikit-learn was particularly terrible on this front - I’ve had 1k+ speedup factors from fixing at least two other algorithms in that library (Gaussian mixtures and logistic regression), and eventually I stopped using it altogether because the algorithms were so consistently suboptimal. Outside of scikit, some other places I’ve seen 1k+ speedup factors:
- Facebook published an algorithm for figuring out how close people were to various friends by looking at their mutual friend graph. As written, it was O(n^4); a bit of thought improved that to O(n^2). For a person with n=500 friends, that was a speedup of ~250k. (The company I was at used this algorithm in production; it worked remarkably well.)
- In physical simulations and other numerical problems, it’s very common to need to invert a sparse (or sparse-plus-low-rank) matrix - i.e. a matrix where most of the entries are zero. Exploiting the sparsity pattern can take this from O(n^3) down to O(n)-ish, by ignoring all the zeros. I’ve run into this in optimization problems in ML, where the matrix itself isn’t sparse, but it is sparse-plus-low-rank. Exploiting sparsity takes the algorithm from “we can run this with n ~ 1k” to “we can run this on our entire database”.
- SQL queries can often go from O(n^2) to O(n), or from O(n) to O(1), by adding an index. This is probably the most common type of algorithmic improvement in software engineering, especially in large code bases, and speedup factors in (at least) the thousands are common.
Main key thing to note with all these examples: they’re all big-O speedups. In practice (I claim) big-O speedups are either useless (as in e.g. the theoretical fast matrix multiplication algorithms which might be the fastest way to multiply two matrices, if only those matrices had as many entries as there are atoms in the universe) or quite large (like 1k or more speedup); it’s rare for big-O improvements to help just a little bit.
Secondary key thing to note: you might think these are low-hanging fruit and therefore rare, but they were in widely-used libraries and a paper from Facebook. For the optimization/simulation example, the sparsity structure of the matrix is often highly non-obvious - the matrix one needs to invert is not actually sparse, and we need to play some Schur complement games to back out an equivalent sparse matrix problem - a skill rare enough that even most programmers reading this probably haven’t heard of it. For the SQL example, plenty of developers spend large chunks of their working hours hunting down places where they need indices or other speedups to SQL calls, and somehow there are always lots more to find (I know this from experience). Point is: these opportunities are out there.
Two main takeaways here:
- In practice, there is lots of fruit to be picked just from code efficiency and algorithms. People do not already use efficient code and algorithms all the time.
- Algorithmic gains are big. Even simple efficiency improvements (like rewriting in C) are usually at least a factor of 10 speedup, and often a factor of 100. Big-O improvements, if they’re useful at all, tend to yield a speedup factor of over 1k.
In deep learning over the past ~10 years, there hasn’t been any really revolutionary efficiency improvement. No big-O breakthrough. Yet even without a big efficiency breakthrough, we’ve seen a 44X improvement from algorithms. That’s not crazy; it’s normal.
Is this context-specific to AI? This position seems to imply that new algorithms come out of the box at only a factor 2 above maximum efficiency, which seems like an extravagant claim (if anyone were to actually make it).
In the general software engineering context, I understood the consensus narrative to be that code has gotten less efficient on average, due to the free gains coming from Moore's Law permitting a more lax approach.
Separately, regarding the bitter lesson: I have seen this come up mostly in the context of the value of data. Some example situations are the supervised vs. unsupervised learning approaches; AlphaGo's self-play training; questions about what kind of insights the Chinese government AI programs will be able to deliver with the expected expansion of surveillance data, etc. The way I understand this is that compute improvements have proven more valuable than domain expertise (the first approach) and big data (the most recent contender).
My intuitive guess for the cause is that compute is the perspective that lets us handle the dimensionality problem at all gracefully.
Reflecting on this, I think I should have said that algorithms are the perspective that lets us handle dimensionality gracefully, but also that algorithms and compute are really the same category, because algorithms are how compute is exploited.
Algorithm vs compute feels like a second-order comparison in the same way as CPU vs GPU, or RAM vs Flash, or SSD vs HDD, just on the abstract side of the physical/abstraction divide. I contrast this with compute v. data v. expertise, which feel like the first-order comparison.
Chris Rackauckas as an informal explanation for algorithm efficiency which I always think of in this context. The pitch is that your algorithm will be efficient in line with how much information about your problem it has, because it can exploit that information.
Another source of speedup is making the right approximations. Ages ago I coded a numerical simulation of a neuromusular synaptic transmission, tracking 50k separate molecules bumping into each other, including release, diffusion, uptake etc that ended up modeling the full process faithfully (as compared with using a PDE solver) after removing irrelevant parts that took compute time but did not affect the outcome.
Another great example of this is Striped Smith-Waterman, which takes advantage of SIMD instructions to achieve a 2-8 speed-up (potentially much more on modern CPUs though) for constructing sequence local alignments.
If 11x of the 44x total speedup is from hardware, doesn't that leave just 4x from software?
Moore's law simply means that the 44x less compute is 11x cheaper, right? Moore's law doesn't make algorithms need less compute, just lowers the cost of that compute.