New Scaling Laws for Large Language Models

[-]Razied4y290

Man, these data requirements for large models really show just how horrendously data-inefficient current deep learning actually is, you need to hit a model with thousands of different variations of a sentence for it to learn anything. I fear that we might be just one or two crucial insights away from cutting down those data numbers by orders of magnitude.

[-]NoSuchPlace4y190

If I understand this correctly Deepmind is using each token in at most one update (They say they are training for less than one epoch), which means that it is hard to say anything about data efficiency of DL from this paper since the models are not trained to convergence on the data that they have seen.

They are probably doing this since they already have the data, and a new data point is more informative than an old one, even if your model is very slow to extract the available information with each update.

[-]Daniel Kokotajlo4y170

Excellent & timely analysis, thank you!

[-]Sam Bowman4yΩ590

Is anyone working on updating the Biological Anchors Report model based on the updated slopes/requirements here?

[-]lennart4y81

Minor correction. You're saying:
> So training a 1-million parameter model on 10 books takes about as many FLOPS as training a 10-million parameter model on one book.

You link to FLOP per second aka FLOPS, whereas you're talking about the plural of FLOP, a quantity (often used is FLOPs).

[-]MondSemmel4y70

Typos:

trained a 70-billion parameter model ("Chinchilla") using the same compute as had been used for the 280-parameter Gopher. -> 280-billion parameter Gopher
Number go up -> goes up

[-]Pattern4y40

Not sure why this was downvoted.

[-]Lucas Adams3y40

Number go up is a meme not a typo

[-]TLW4y70

Am I understanding this correctly, in that it means that scaling language models will require significantly more training data than OpenAI thought?

[-]Veedrac4y400

It mostly only means that training them compute optimally will require much more data, and doesn't rule out OpenAI-style mostly-parameter scaling at all. Data scaling can be necessary to minimise loss to get optimal estimates of certain entropic variables, while still being unnecessary for general intelligence. Large undertrained models still learn faster. This new paper mostly makes parameter and data scaling both significantly more efficient, but data scaling to a larger degree, such that it's more optimal to trade off these losses 1:1.

Below the fold is musing and analysis around this question. It is not a direct answer to it though.

We can take a look at the loss function, defined in terms of the irreducible loss, aka. the unmodelable entropy of language, the number of parameters , and the number of data tokens $D$ .

$L (N, D) = 1.69 + \frac{406.4}{N^{0.34}} + \frac{410.7}{D^{0.28}}$

If we put in the parameters for Chinchilla, we see $\frac{406.4}{N^{0.34}} \approx 0.083$ , and $\frac{410.7}{D^{0.28}} \approx 0.163$ . Although these equations have been locally tuned and are not valid in the infinite limit of a single variable, it does roughly say that just scaling parameter counts without training for longer will only tackle about a third of the remaining reducible loss.

Note the implicit assumption that we are working in the infinite data limit, where we never intentionally train on the same tokens twice. If you run out of data, it doesn't mean you are no longer able to train your models for longer as you scale, it only means that you will have to make more use of the data you already have, which can mean as little as multiple epochs or as much as sophisticated bootstrapping methods.

The original scaling laws did not decompose so easily. I present them in simplified form.

$L (N, D) = {(\frac{1.54 e 10}{N^{0.738}} + \frac{1.8 e 13}{D})}^{0.103}$

(Note that the dataset was different so the exact losses shouldn't be centered identically.)

This has major issues, like there is no irreducible loss and the values aren't disentangled. We can still put in the parameters for GPT-3: $\frac{1.54 e 10}{N^{0.738}} \approx 77.7$ and $\frac{1.8 e 13}{D} \approx 60$ ; or in the limits, ${(\frac{1.54 e 10}{N^{0.738}})}^{0.103} \approx 1.57$ and ${(\frac{1.8 e 13}{D})}^{0.103} \approx 1.52$ . It isn't clear what this means about the necessary amount of data scaling, as in what fraction of the loss that it captures, especially because there is no entropy term, but it does mean that there is still about 1:1 contributions from both losses at the efficient point, at least if you ignore the fact that the equation is wrong. That you have to scale both in tandem to make maximal progress remains true in this older equation, it's just more convoluted and has different factors.

[-]TLW4y10

Interesting!

What does the irreducible loss of 1.69 actually mean? I assume it's something like entropy/symbol? What does that convert to in terms of entropy/word? Does that agree with the 'standard' approximations of the entropy of English text?

[-]Veedrac4y20

It's the cross-entropy that is left after you scale to infinity, and it is measured per symbol, yes. It is measured using BPEs, and the unit is nats/token. It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.

For a large enough dataset, and given you are changing only the model and not the BPEs or data distribution, then the loss should be a constant factor multiple of bits/character, bits/byte, or bits/word. Chinchilla gets bits/byte on pile_cc and a loss of $1.97$ on Wikitext103 ( $1.97 / 0.667 \approx 3$ ), which is unhelpfully not at all controlled but should suffice for ballpark conversions.

[-]TLW4y30

It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.

That's actually precisely what I'm interested in finding out. How closely this scaling would match the 'expected' entropy of English in the infinite limit. (Of course, this assumes that said approximation actually matches in the limit.)

It is measured using BPEs

Hm. Any idea what the compression level is of using BPE on English text? A quick look shows ~51%^[1] compression ratio on BPE on the Brown corpus, which I suppose I could use as a starting point.

and the unit is nats/token.

So if I'm understanding correctly (one nat == 1.4 bits of entropy), ~2.43 bits / token? Assuming a BPE compression ratio of 51.08% on English text (each token encoding 4.0864 bits, given 51.08% compression on what I assume to be 8-bit ASCII), that means ~0.595 bits / character.

...which actually matches Shannon's estimation of the entropy of English surprisingly well (0.6-1.3 bits / character).

^{^}
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.4046&rep=rep1&type=pdf

[-]Veedrac4y80

This is the vocab file GPT uses. Don't stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don't bother changing the BPEs, but those that do probably don't have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)

rawdownloadcloneembedreportprint

True entropy of text is not the best defined, and it's hard to tell whether something the model can't learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn't be by very much.

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½

I noted that Chinchilla gets bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn't ASCII, but that makes up a sufficiently large fraction that I wouldn't worry about the details.

ĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ

[-]1a3orn4y60

Correct. It means that if you want a very powerful language model, having compute & having data is pretty much the bottleneck, rather than having compute & being able to extend an incredibly massive model over it.

Hey look at the job listing. (https://boards.greenhouse.io/deepmind/jobs/4089743?t=bbda0eea1us)

[-]Pattern4y50

Yes. However, before the idea was 'scaling is powerful/smart'. It is, but doing things this other way, is apparently more powerful. So if you want powerful models, grab a gallon of compute and a gallon of data, instead of two gallons of compute.*

This is probably a bad analogy, because at some point you're going to want to a) increase the stuff you were leaving the same. in your recipe, and then later, b) mix stuff up in smaller batches than 'all the ingredients'.

[-]Alexander Mathiasen4y40

TLDR: I'm scared Figure 3 is wrong (the one with training loss/parameters).

WHY: From page 2: "... we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss "

This claim is true. However, it is estimating average loss during training. For a fixed compute budget, larger models take less gradient steps and thus exhibit larger loss for a larger fraction of training time. If they estimate training loss in this way for Figure 3, I would expect them to underestimate the training loss of the larger models.

EXPERIMENT: If anyone has access to training loss .csv files, we can reproduce Figure 3 using loss from the last 100 iterations. All my concerns go away if we get the same plot.

[-]Pattern4y20

One funky thing about this is that we shouldn't see larger language models... at all, for at least a few years.

How long does it take to train them though? For a large enough value of large, the above seems obvious, and yet...why couldn't a larger model be trained over more time? (Thinking Long And Slow)

[-]Melaenis Crito2y10

So until wafer-scale chips decrease the cost of compute ten times, and Google also decides all it really needs for AGI is to put ten times as much money into LM's, we've seen the largest LM's we're likely to see. However long that may be.

The numbers in the DeepMind figure indicate an exponential increase in FLOPS. With compute increasing after Moore's law and compute usage in AI even faster, why would larger models be most likely impossible? Based on these trends, it looks very reasonable to me, that the trend of larger models will continue.

[-]Not Relevant4y*10

Two thoughts:

[IGNORE; as gwern pointed out I got this backwards] the fact that data and compute need to scale proportionally seems… like a big point in favor of NNs as memorizers/interpolators.
Maybe this is baseless, but I somewhat feel better about a path to AGI based more on lots of data than “thinking really hard about a finite amount of data”. Choices over data seem much more interpretable and human-influenceable (e.g. by curating learning curricula for RL) than just throwing more compute at the same set of data and hoping it doesn’t learn anything weird.

[-]gwern4y150

the fact that data and compute need to scale proportionally seems… like a big point in favor of NNs as memorizers/interpolators.

Surely it's the opposite? The more bang you get out of each parameter, the less it looks like 'just' (whatever that means) memorization/interpolation. When you needed to increase parameters a lot, disproportionately, to cope with some more data, that does not speak well of abstractions or understanding. (If I can train a 1t model to get the same loss as what I thought was going to take a 100t model, why would I think that that 100t model must be memorizing/interpolating less?) Let's take your claim to its logical extreme: suppose we discovered tomorrow a scaling law that made parameters near-constant (log, let's say); would that not suggest that those parameters are super useful and it's doing an amazing job of learning the underlying algorithm and is not memorizing/interpolating?

and hoping it doesn’t learn anything weird.

They already learn weird stuff, though.

[-]Not Relevant4y60

Sorry, you’re completely right about the first point. I’ll correct the original comment.

Re: learning weird stuff, they definitely do, but a lot of contemporary weirdness feels very data dependent (e.g. I failed to realize my data was on a human-recognizably weird submanifold, like medical images from different hospitals with different patient populations) versus grokking-dependent (e.g. AlphaFold possibly figuring out new predictive principles underlying protein folding, or a hypothetical future model thinking about math textbooks for long enough that it solves a Millenium Prize problem).

EDIT: though actually AlphaFold might be a bad example, because it got to simulate a shit-ton of data, so maybe I’ll just stick to the “deep grokking of math” hypothetical.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

246

New Scaling Laws for Large Language Models

246

Ω 71

246

Ω 71

Background

Falsification

Consequences