On March 29th, DeepMind published a paper, "Training Compute-Optimal Large Language Models", that shows that essentially everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been training large language models with a deeply suboptimal use of compute.
Following the new scaling laws that they propose for the optimal use of compute, DeepMind trains a new, 70-billion parameter model that outperforms much larger language models, including the 175-billion parameter GPT-3 and DeepMind's own 270-billion parameter "Gopher".
I'm going to walk through the background of the now-falsified scaling laws from prior to this paper; then I'm going to describe the new laws given by this paper, and why they weren't found earlier; and finally I'll briefly mention some possible implications of this paper.
Independently of the consequences -- this paper is exciting! Machine learning researchers thought they knew laws about how to scale compute optimally, and the laws turned out to be wrong! It's a nice clear instance of science-functioning-in-ways-it-should in ML.
In 2020 OpenAI proposed scaling laws which have since been used (at least implicitly) to guide the training of large models.
These scaling laws attempt to answer several questions. One of these questions is "Given a certain quantity of compute, how large of a model should I train in order to get the best possible performance?"
The answer isn't "as large a model as possible" because, for a fixed quantity of compute, a larger model must be trained on less data. So training a 1-million parameter model on 10 books takes about as many floating point operations (FLOPs) as training a 10-million parameter model on one book.
In the case of very large language models like GPT-3, these alternatives look more like training a 20-billion parameter model on 40% of an archive of the Internet, or training a 200-billion parameter model on 4% of an archive of the Internet, or any of an infinite number of points along the same boundary.
Compute on this scale is not cheap -- so if you're going to be spending 10 million dollars per training run on a model scaled up to be 100x bigger than your toy version of the model, you want principles better than a feeling in your gut to guide how you allocate this compute between "amount of data the model sees" and "how big the model should be."
So if you get 10x more compute, how much bigger do you make your model? What about 100x more compute? Or 1000x more compute?
Well, the OpenAI paper answers the question. If you get 10x more compute, you increase your model size by about 5x and your data size by about 2x. Another 10x in compute, and model size is 25x bigger and data size is only 4x bigger.
Model size is almost everything.
Model Size Is (Almost) Everything
Subsequent researchers and institutions took this philosophy to heart, and focused mostly on figuring out how to engineer increasingly-large models, rather than training comparatively smaller models over more data. Thus, the many headlines of increasingly-larger models that we've seen coming from ML research institutions and AI accelerator startups.
See, for instance, the following chart from the new DeepMind paper.
Large Subsequent Models
Note the increase to half a trillion parameters, with identical quantities of training data.
And note that this understanding of the world has also been used to project forward future data requirements -- NVIDIA, for instance, talks about training a trillion parameter model with only 450 billion tokens. Everyone had decided model size was much more important than data size.
The DeepMind paper re-approaches the issue of scaling laws.
It uses three separate methods to try to find the correct scaling law, but I'm going to zoom in on the second because I think it's the easiest to comprehend.
The method is simple. They choose 9 different quantities of compute, ranging from about FLOPs to FLOPs.
For each quantity of compute, they train many different-sized models. Because the quantity of compute is constant for each level, the smaller models are trained for more time and the larger models for less.
The following chart from the paper illustrates this. Each line connects models (at different sizes) trained using the same amount of compute. The vertical axis is the loss, where lower is better:
Each of these curves has a clear interpretation. To the left of the minima on each curve, models are too small -- a larger model trained on less data would be an improvement. To the right of the minima on each curve, models are too large -- a smaller model trained on more data would be an improvement. The best models are at the minima.
If you connect the minima at each curve and extend the line outwards, you get a new law! Specifically, it looks like for every increase in compute, you should increase data size and model size by approximately same amount.
If you get a 10x increase in compute, you should make your model 3.1x times bigger and the data you train over 3.1x bigger; if you get a 100x increase in compute, you should make your model 10x bigger and your data 10x bigger.
Now, all of these experimental runs graphed above were on relatively small models, trained with non-insane quantities of compute. So you could have argued that this rule wouldn't work with much larger numbers.
But to verify that the law was right, DeepMind trained a 70-billion parameter model ("Chinchilla") using the same compute as had been used for the 280-billion parameter Gopher. That is, they trained the smaller Chinchilla with 1.4 trillion tokens, while the larger Gopher had only been trained with 300 billion tokens.
And, as the new scaling laws predicts, Chinchilla is a lot better than Gopher on pretty much everything. It is better by the standard less-perplexity-per-word measure, and by the more interesting usefulness-on-downstream-task measures. I could insert a bunch of graphs here, but if you aren't familiar with the measures in question they basically all sum to "Hey, number goes up!"
Number goes up (Or down when appropriate)
Given the evidence of Chinchilla, it appears pretty definite that OpenAI got the scaling laws wrong. So one natural question is "What happened that they got it wrong?"
Well, background: The learning rate of a deep neural network dictates how much the parameters of a network are updated for each piece of training data. Learning rates on large training runs are typically decreased according to a schedule, so that data towards the end of a training run adjusts the parameters of a neural network less than data towards the beginning of it. You can see this as reflecting the need to not "forget" what was learned earlier in the training run.
It looks like OpenAI used a single total annealing schedule for all of their runs, even those of different lengths. This shifted the apparent best-possible performance downwards for the networks on a non-ideal annealing schedule. And this lead to a distorted notion of what laws should be.
One funky thing about this is that we shouldn't see larger language models... at all, for at least a few years.
DeepMind provides a helpful chart of how much training data and compute you'd need to optimally train models of various sizes.
Note that it wouldn't make sense to train a model with 520 billion parameters until you had 60x as much compute as was used for Gopher / Chinchilla. You don't hit the need for a trillion parameters until you have 200x as much compute as was used for Gopher / Chinchilla.
(You might need even more compute; in part of the paper, DeepMind says that at large quantities of compute the scaling laws bend slightly, and the optimal behavior might be to scale data by even more than you scale model size. In which case you might need to increase compute by more than 200x before it would make sense to use a trillion parameters.)
So until wafer-scale chips decrease the cost of compute ten times, and Google also decides all it really needs for AGI is to put ten times as much money into LM's, we've seen the largest LM's we're likely to see. However long that may be.
One potential thing that could follow from this is that, because inference costs are obviously smaller for small language models, services such as OpenAI's GPT-3 should be cheaper for them to provide. The cost to run them, at the same level of quality, should drop by at least 3x. I don't know what percent the cost of providing these services is running them rather than training them, but potentially it could make services based on these models more efficient than they were before, and open up economic viability in places that didn't exist before.
One last consequence is that this paper makes the engineering involved in training large language models easier. Gathering more good data would be (I think) far easier than trying to efficiently split computation for increasingly large LM's across 1000s of machines.
Man, these data requirements for large models really show just how horrendously data-inefficient current deep learning actually is, you need to hit a model with thousands of different variations of a sentence for it to learn anything. I fear that we might be just one or two crucial insights away from cutting down those data numbers by orders of magnitude.
If I understand this correctly Deepmind is using each token in at most one update (They say they are training for less than one epoch), which means that it is hard to say anything about data efficiency of DL from this paper since the models are not trained to convergence on the data that they have seen.
They are probably doing this since they already have the data, and a new data point is more informative than an old one, even if your model is very slow to extract the available information with each update.
Excellent & timely analysis, thank you!
Is anyone working on updating the Biological Anchors Report model based on the updated slopes/requirements here?
Minor correction. You're saying:
> So training a 1-million parameter model on 10 books takes about as many FLOPS as training a 10-million parameter model on one book.
You link to FLOP per second aka FLOPS, whereas you're talking about the plural of FLOP, a quantity (often used is FLOPs).
Not sure why this was downvoted.
Number go up is a meme not a typo
Am I understanding this correctly, in that it means that scaling language models will require significantly more training data than OpenAI thought?
It mostly only means that training them compute optimally will require much more data, and doesn't rule out OpenAI-style mostly-parameter scaling at all. Data scaling can be necessary to minimise loss to get optimal estimates of certain entropic variables, while still being unnecessary for general intelligence. Large undertrained models still learn faster. This new paper mostly makes parameter and data scaling both significantly more efficient, but data scaling to a larger degree, such that it's more optimal to trade off these losses 1:1.
Below the fold is musing and analysis around this question. It is not a direct answer to it though.
We can take a look at the loss function, defined in terms of the irreducible loss, aka. the unmodelable entropy of language, the number of parameters N, and the number of data tokens D.
If we put in the parameters for Chinchilla, we see 406.4N0.34≈0.083, and 410.7D0.28≈0.163. Although these equations have been locally tuned and are not valid in the infinite limit of a single variable, it does roughly say that just scaling parameter counts without training for longer will only tackle about a third of the remaining reducible loss.
Note the implicit assumption that we are working in the infinite data limit, where we never intentionally train on the same tokens twice. If you run out of data, it doesn't mean you are no longer able to train your models for longer as you scale, it only means that you will have to make more use of the data you already have, which can mean as little as multiple epochs or as much as sophisticated bootstrapping methods.
The original scaling laws did not decompose so easily. I present them in simplified form.
(Note that the dataset was different so the exact losses shouldn't be centered identically.)
This has major issues, like there is no irreducible loss and the values aren't disentangled. We can still put in the parameters for GPT-3: 1.54e10N0.738≈77.7 and 1.8e13D≈60; or in the limits, (1.54e10N0.738)0.103≈1.57 and (1.8e13D)0.103≈1.52. It isn't clear what this means about the necessary amount of data scaling, as in what fraction of the loss that it captures, especially because there is no entropy term, but it does mean that there is still about 1:1 contributions from both losses at the efficient point, at least if you ignore the fact that the equation is wrong. That you have to scale both in tandem to make maximal progress remains true in this older equation, it's just more convoluted and has different factors.
What does the irreducible loss of 1.69 actually mean? I assume it's something like entropy/symbol? What does that convert to in terms of entropy/word? Does that agree with the 'standard' approximations of the entropy of English text?
It's the cross-entropy that is left after you scale to infinity, and it is measured per symbol, yes. It is measured using BPEs, and the unit is nats/token. It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
For a large enough dataset, and given you are changing only the model and not the BPEs or data distribution, then the loss should be a constant factor multiple of bits/character, bits/byte, or bits/word. Chinchilla gets 0.667 bits/byte on pile_cc and a loss of 1.97 on Wikitext103 (1.97/0.667≈3), which is unhelpfully not at all controlled but should suffice for ballpark conversions.
That's actually precisely what I'm interested in finding out. How closely this scaling would match the 'expected' entropy of English in the infinite limit. (Of course, this assumes that said approximation actually matches in the limit.)
Hm. Any idea what the compression level is of using BPE on English text? A quick look shows ~51% compression ratio on BPE on the Brown corpus, which I suppose I could use as a starting point.
So if I'm understanding correctly (one nat == 1.4 bits of entropy), ~2.43 bits / token? Assuming a BPE compression ratio of 51.08% on English text (each token encoding 4.0864 bits, given 51.08% compression on what I assume to be 8-bit ASCII), that means ~0.595 bits / character.
...which actually matches Shannon's estimation of the entropy of English surprisingly well (0.6-1.3 bits / character).
This is the vocab file GPT uses. Don't stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don't bother changing the BPEs, but those that do probably don't have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
True entropy of text is not the best defined, and it's hard to tell whether something the model can't learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn't be by very much.
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn't ASCII, but that makes up a sufficiently large fraction that I wouldn't worry about the details.
Correct. It means that if you want a very powerful language model, having compute & having data is pretty much the bottleneck, rather than having compute & being able to extend an incredibly massive model over it.
Hey look at the job listing. (https://boards.greenhouse.io/deepmind/jobs/4089743?t=bbda0eea1us)
Yes. However, before the idea was 'scaling is powerful/smart'. It is, but doing things this other way, is apparently more powerful. So if you want powerful models, grab a gallon of compute and a gallon of data, instead of two gallons of compute.*
This is probably a bad analogy, because at some point you're going to want to a) increase the stuff you were leaving the same. in your recipe, and then later, b) mix stuff up in smaller batches than 'all the ingredients'.
TLDR: I'm scared Figure 3 is wrong (the one with training loss/parameters).
WHY: From page 2: "... we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss "
This claim is true. However, it is estimating average loss during training. For a fixed compute budget, larger models take less gradient steps and thus exhibit larger loss for a larger fraction of training time. If they estimate training loss in this way for Figure 3, I would expect them to underestimate the training loss of the larger models.
EXPERIMENT: If anyone has access to training loss .csv files, we can reproduce Figure 3 using loss from the last 100 iterations. All my concerns go away if we get the same plot.
How long does it take to train them though? For a large enough value of large, the above seems obvious, and yet...why couldn't a larger model be trained over more time? (Thinking Long And Slow)
Surely it's the opposite? The more bang you get out of each parameter, the less it looks like 'just' (whatever that means) memorization/interpolation. When you needed to increase parameters a lot, disproportionately, to cope with some more data, that does not speak well of abstractions or understanding. (If I can train a 1t model to get the same loss as what I thought was going to take a 100t model, why would I think that that 100t model must be memorizing/interpolating less?) Let's take your claim to its logical extreme: suppose we discovered tomorrow a scaling law that made parameters near-constant (log, let's say); would that not suggest that those parameters are super useful and it's doing an amazing job of learning the underlying algorithm and is not memorizing/interpolating?
They already learn weird stuff, though.
Sorry, you’re completely right about the first point. I’ll correct the original comment.
Re: learning weird stuff, they definitely do, but a lot of contemporary weirdness feels very data dependent (e.g. I failed to realize my data was on a human-recognizably weird submanifold, like medical images from different hospitals with different patient populations) versus grokking-dependent (e.g. AlphaFold possibly figuring out new predictive principles underlying protein folding, or a hypothetical future model thinking about math textbooks for long enough that it solves a Millenium Prize problem).
EDIT: though actually AlphaFold might be a bad example, because it got to simulate a shit-ton of data, so maybe I’ll just stick to the “deep grokking of math” hypothetical.