TL;DR I explain why I think AI research has been slowing down, not speeding up, in the past few years.
How have your expectations for the future of AI research changed in the past three years? Based on recent posts in this forum, it seems that results in text generation, protein folding, image synthesis, and other fields have accomplished feats beyond what was thought possible. From a bird's eye view, it seems as though the breakneck pace of AI research is already accelerating exponentially, which would make the safe bet on AI timelines quite short.
This way of thinking misses the reality on the front lines of AI research. Innovation is stalling beyond just throwing more computation at the problem, and the forces that made scaling computation cheaper or more effective are slowing. The past three years of AI results have been dominated by wealthy companies throwing very large models at novel problems. While this expands the economic impact of AI, it does not accelerate AI development.
To figure out whether AI development is actually accelerating, we need to answer a few key questions:
- What has changed in AI in the past three years?
- Why has it changed, and what factors have allowed that change?
- How have those underlying factors changed in the past three years?
By answering these fundamental questions, we can get a better understanding of how we should expect AI research to develop over the near future. And maybe along the way, you'll learn something about lifting weights too. We shall see.
What has changed in AI research in the past three years?
Gigantic models have achieved spectacular results on a large variety of tasks.
How large is the variety of tasks? In terms of domain area, quite varied. Advances have been made in major hard science problems like protein synthesis, imaginative tasks like creating images from descriptions, and playing complex games like Starcraft.
How large is the variety of models used? While each model features many domain specific model components and training components, the core of each of these models is a giant transformer trained with a variant of gradient descent, usually ADAM.
How large are these models? That depends. DALLE2 and AlphaFold are O(10GB), AlphaStar is O(1GB), and the current state of the art few shot NLP models (Chinchilla) are O(100GB).
One of the most consistent findings of the past decade of AI research is that larger models trained with more data get better results, especially transformers. If all of these models are built on top of the same underlying architecture, why is there so much variation in size?
Think of training models like lifting weights. What limits your ability to lift heavy weights?
- Data availability: (Nutrition) If you don't eat enough food, you'll never gain muscle! Data is the food that makes models learn, and the more "muscle" you want the more "food" you need. When looking for text on the internet, it is easy to get terabytes of data to train a model. This is harder for other tasks
- Cost (exhaustion): No matter how rich your corporation is, training a model is expensive. Each polished model you see comes after a lot of experimentation and trials, which uses a lot of computational resources. AI labs are notorious cost sinks. The talent they acquire is expensive, and in addition to their salaries the talent demands access to top of the line computational resources.
- Training methodology (What exercises you do). NLP models only require to train one big transformer. More complex models like DALLE-2 and AlphaFold have many subcomponents optimized for their use cases. Training an NLP model is like deadlifting a loaded barbell and training AlphaFold is like lifting a box filled with stuff: at equivalent weight, the barbell is much easier to pick up because the load is balanced, uniform, and in one motion. When picking up the box, the weight is unevenly distributed which makes the task harder. Alphastar was trained by creating a league of AlphaStars which competed against each other in actual games. To continue our weightlifting analogy, this is like a higher rep range with lower weight.
Looked at this way, what has changed over the past three years? In short, we have discovered how to adapt a training method/exercise (the transformer) to a variety of use cases. This exercise allows us to engage our big muscles (scalable hardware and software optimized for transformers). Sure, some of these applications are more efficient than others, but overall they are way more efficient than what they were competing against. We have used this change in paradigm to "lift more weight", increasing the size and training cost of our model to achieve more impressive results.
(Think about how AlphaFold2 and Dalle-2, despite mostly being larger versions of their predecessors, drew more attention than their predecessors ever did. The prior work in the field paved the way by figuring out how to use transformers to solve these problems, and the attention comes from when they scaled the solution to achieve eye popping results. In our weightlifting analogy, we are learning a variation of an exercise. The hard part is learning the form that allows you to leverage the same muscles, but the impressive looking part is adding a lot of weight.)
Why and how has it changed?
In other words: why are we only training gigantic models and getting impressive results now?
There are many reasons for this, but the most important one is that no one had the infrastructure to train models of this size efficiently before.
The modern Deep Learning / AI craze started in 2012, when a neural network called AlexNet won the imagenet challenge. The architecture used a convolutional neural network, a method that had been invented 25 years prior and had been deemed too impractical to use. What changed?
The short answer? GPUs happened. Modern graphics applications had made specialized hardware for linear algebra cheap and commercially available. Chips had gotten almost a thousand times faster during that period, following Moore's law. When combined with myriad other computing advances in areas such as memory and network interfaces, it might not be a stretch to say that the GPU which AlexNet ran on was a million times better for Convnets than what had been available when Convnets were invented.
As the craze took off, NVIDIA stared optimizing their GPUs more and more for deep learning workloads across the stack. GPUs were given more memory to hold larger models and more intermediate computations, faster interconnects to leverage multiple gpus at once, and more optimized primitives through CUDA and cuDNN. This enabled much larger models, but by itself would not have allowed for the absolutely giant models we see now.
In the old days of deep learning, linear algebra operations had to be done by hand... well not really, but programming was a pain and the resulting programs used resources poorly. Switching to a GPU was a nightmare, and if trying to use multiple GPUs would make a pope question their faith in god. Then along came Caffe, then tensorflow, then pytorch, and suddenly training was so easy that any moron with an internet connection (me!) can use deep learning without understanding any of the math, hardware, or programming that needs to happen.
These days, training an ML model doesn't even require coding knowledge. If you do code, your code probably works on CPUs or GPUs, locally or on AWS/Azure/Google Cloud, and with one GPU or 32 across four different machines.
Furthermore, modern ML platforms will do under the hood optimization to accelerate model execution. Easy to write ML models now are easy to share, readable, well optimized, and can be scaled easily.
There are two sets of important advances that enabled large scale research. The first are a legion of improvements to methods that allowed for less computational to achieve more. The second are methods that allow for more computation to be thrown at the same problem for better and faster results. While there are also many advances in this field, three stick out: transformers, pipeline parallelism, and self supervised learning
Transformers are models that run really well on GPUs by leveraging very efficient matrix multiplication primitives. They were originally designed for text data, but it turns out that for models whose size is measured in gigabytes, transformers are just better than their competition for the same amount of computation.
If we think back to our weightlifting analogy, transformers are like your leg muscles. For sufficiently large loads, they can't be beat!
(As an ML Systems researcher, every new application where just throwing a big transformer at a problem beats years of custom machine learning approaches and panels of experts brings me a strange joy.)
Pipeline parallelism is a bit complicated to explain to a nontechnical audience, but the short version is that training a machine learning model requires much more memory on the GPU than the size of the model. For small models, splitting the data between GPUs is the best approach. For large models, splitting the model across GPUs is better. Pipeline parallelism is a much better way of splitting the model than prior approaches, especially for models which are larger than a gigabyte.
Pipeline parallelism is like having good form. For smaller lifts its not a big deal, but it is critical for larger lifts.
Self supervised learning is like making flashcards to study for a test. Ideally, your teacher would make you a comprehensive set of practice questions, but that takes a lot of effort for the teacher. A self directed student could take data that doesn't have "questions" (labels) and make up your own questions to learn the material. For example, a model trying to learn English could take a sentence, hide a bunch of words, and try to guess them. This is much cheaper than having a human "make questions".
Self supervised learning is like cooking your own food instead of hiring a personal chef for your nutritional needs. It might be worse, but it is so much cheaper, and for the same price you can make a lot more food!
How have those underlying factors changed in the past three years?
TL;DR not much. We haven't gotten stronger in the past four years, just did a bunch of different exercises which used the same muscles.
All of the advances I mentioned in the last section were from 2018 or earlier.
(For the purists, Self supervised learning went mainstream for vision in 2020 by finally outperforming supervised learning).
Chips are not getting twice as fast every two years like they used to (Moore's law is dying). The cost of a single training run for the largest ML models is on the order of ten million dollars. Adding more GPUs and more computation is pushing against the amount that companies are willing to burn on services that don't generate money for the company. Unlike the prior four years, we cannot scale up the size of models by a thousand times again. No one is willing to spend billions of dollars on training runs yet.
From a hardware perspective, we should expect the pace of innovation to slow in the coming years.
Software advances are mixed. Using ML models is becoming easier by the day. With libraries like huggingface, a single line of code can run a state of the art model for your particular use case. There is a lot of room for software innovations to make it easier to use for non technical audiences, but right now very little research is bottlenecked by software.
Research advances are the X factor. Lots of people are working on these problems, and its possible there is a magic trick for intelligence at existing compute budgets. However, that is and always was true. However, the most important research advances of the last few years primarily enabled us to use more GPUs for a given problem. Now that we are starting to run up against the limits of data acquisition and monetary cost, less low hanging fruit is available.
(Side note: Even facebook has trouble training current state of the art models. Here are some chronicles of them trying to train a GPT-3 size model).
I don't think we should expect performance gains in AI to accelerate over the next few years. As a researcher in the field, I expect the next few years will involve a lot of advances in the "long tail" of use cases and have less growth in the most studied areas. This is because we have achieved the easy pickings gains from hardware and software over the past decade.
This is my first time posting to lesswrong, and I decided to post a lightly edited first draft because if I start doing heavy edits I don't stop. Every time I see a very fast AGI prediction or someone claiming Moore's law will last a few more decades I start to write something, but this time I actually finished it before deciding to rewrite. As a result, it isn't an airtight argument, but more my general feelings as someone who has been at two of the top research institutions in the world.