(Colab notebook here.)
This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla.
The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular:
- Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big.
- If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models.
- If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible.
- The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other.
- The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available.
Some things to note at the outset:
- This post assumes you have some familiarity with LM scaling laws.
- As in the paper, I'll assume here that models never see repeated data in training.
- This simplifies things: we don't need to draw a distinction between data size and step count, or between train loss and test loss.
- I focus on the parametric scaling law from the paper's "Approach 3," because it's provides useful intuition.
- Keep in mind, though, that Approach 3 yielded somewhat different results from Approaches 1 and 2 (which agreed with one another, and were used to determine Chinchilla's model and data size).
- So you should take the exact numbers below with a grain of salt. They may be off by a few orders of magnitude (but not many orders of magnitude).
1. the scaling law
The paper fits a scaling law for LM loss , as a function of model size and data size .
Its functional form is very simple, and easier to reason about than the law from the earlier Kaplan et al papers. It is a sum of three terms:
The first term only depends on the model size. The second term only depends on the data size. And the third term is a constant.
You can think about this as follows.
An "infinitely big" model, trained on "infinite data," would achieve loss . To get the loss for a real model, you add on two "corrections":
- one for the fact that the model's only has parameters, not infinitely many
- one for the fact that the model only sees training examples, not infinitely many
Here's the same thing, with the constants fitted to DeepMind's experiments on the MassiveText dataset.
plugging in real models
Gopher is a model with 280B parameters, trained on 300B tokens of data. What happens if we plug in those numbers?
What jumps out here is that the "finite model" term is tiny.
In terms of the impact on LM loss, Gopher's parameter count might as well be infinity. There's a little more to gain on that front, but not much.
Scale the model up to 500B params, or 1T params, or 100T params, or 3^^^3 params . . . and the most this can ever do for you is an 0.052 reduction in loss.
Meanwhile, the "finite data" term is not tiny. Gopher's training data size is very much not infinity, and we can go a long way by making it bigger.
Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.
It's 70B params, trained on 1.4T tokens of data. Let's plug that in:
Without using any more compute, we've improved the loss by 0.057. That's bigger than Gopher's entire "finite model" term!
The paper demonstrates that Chinchilla roundly defeats Gopher on downstream tasks, as we'd expect.
Even that understates the accomplishment, though. At least in terms of loss, Chinchilla doesn't just beat Gopher. It beats any model trained on Gopher's data, no matter how big.
To put this in context: until this paper, it was conventional to train all large LMs on roughly 300B tokens of data. (GPT-3 did it, and everyone else followed.)
Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got.
People put immense effort into training models that big, and were working on even bigger ones, and yet none of this, in principle, could ever get as far Chinchilla did.
Here's where the various models lie on a contour plot of LM loss (per the equation), with on the x-axis and on the y-axis.
Only PaLM is remotely close to Chinchilla here. (Indeed, PaLM does slightly better.)
PaLM is a huge model. It's the largest one considered here, though MT-NLG is a close second. Everyone writing about PaLM mentions that it has 540B parameters, and the PaLM paper does a lot of experiments on the differences between the 540B PaLM and smaller variants of it.
According to this scaling law, though, PaLM's parameter count is a mere footnote relative to PaLM's training data size.
PaLM isn't competitive with Chinchilla because it's big. MT-NLG is almost the same size, and yet it's trapped in the pinkish-purple zone on the bottom-left, with Gopher and the rest.
No, PaLM is competitive with Chinchilla only because it was trained on more tokens (780B) than the other non-Chinchilla models. For example, this change in data size constitutes 85% of the loss improvement from Gopher to PaLM.
Here's the precise breakdown for PaLM:
PaLM's gains came with a great cost, though. It used way more training compute than any previous model, and its size means it also takes a lot of inference compute to run.
Here's a visualization of loss vs. training compute (loss on the y-axis and in color as well):
Man, we spent all that compute on PaLM, and all we got was the slightest edge over Chinchilla!
Could we have done better? In the equation just above, PaLM's terms look pretty unbalanced. Given that compute, we probably should have used more data and trained a smaller model.
The paper tells us how to pick optimal values for params and data, given a compute budget. Indeed, that's its main focus.
If we use its recommendations for PaLM's compute, we get the point "palm_opt" on this plot:
Ah, now we're talking!
"palm_opt" sure looks good. But how would we train it, concretely?
Let's go back to the -vs.- contour plot world.
I've changed the axis limits here, to accommodate the massive data set you'd need to spent PaLM's compute optimally.
How much data would that require? Around 6.7T tokens, or ~4.8 times as much as Chinchilla used.
Meanwhile, the resulting model would not be nearly as big as PaLM. The optimal compute law actually puts it at 63B params.
Okay, so we just need to get 6.7T tokens and . . . wait, how exactly are we going to get 6.7T tokens? How much text data is there, exactly?
2. are we running out of data?
It is frustratingly hard to find an answer to this question.
The main moral I want to get across in this post is that the large LM community has not taken data scaling seriously enough.
LM papers are meticulous about -- doing all kinds of scaling analyses on models of various sizes, etc. There has been tons of smart discussion about the hardware and software demands of training high- models. The question "what would it take to get to 1T params? (or 10T?)" is on everyone's radar.
- Everyone trained their big models on 300B tokens, for no particular reason, until this paper showed how hilariously wasteful this is
- Papers rarely do scaling analyses that vary data size -- as if the concepts of "LM scaling" and "adding more parameters" have effectively merged in people's minds
- Papers basically never talk about what it would take to scale their datasets up by 10x or 50x
- The data collection sections of LM papers tend to be vague and slapdash, often failing to answer basic questions like "where did you scrape these webpages from?" or "how many more could you scrape, if you wanted to?"
As a particularly egregious example, here is what the LaMDA paper says about the composition of their training data:
The pre-training data, called Infiniset, is a combination of dialog data from public dialog data and other public web documents. It consists of 2.97B documents and 1.12B dialogs with 13.39B utterances. The composition of the data is as follows: 50% dialogs data from public forums; 12.5% C4 data ; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc; 12.5% Wikipedia (English); 6.25% English web documents; and 6.25% Non-English web documents. The total number of words in the dataset is 1.56T.
"Dialogs data from public forums"? Which forums? Did you use all the forum data you could find, or only 0.01% of it, or something in between? And why measure words instead of tokens -- unless they meant tokens?
If people were as casual about scaling as this quotation is about scaling , the methods sections of large LM papers would all be a few sentences long. Instead, they tend to look like this (excerpted from ~3 pages of similar material):
...anyway. How much more data could we get?
This question is complicated by the fact that not all data is equally good.
(This messy Google sheet contains the calculations behind some of what I say below.)
If you just want a lot of text, the easiest way to get it is from web scrapes like Common Crawl.
But these are infamously full of garbage, and if you want to train a good LM, you probably want to aggressively filter them for quality. And the papers don't tell us how much total web data they have, only how much filtered data.
The training dataset used for Gopher and Chinchilla is called MassiveText, and the web scrape portion of it is called MassiveWeb. This data originates in a mysterious, unspecified web scrape, which is funneled through a series of filters, including quality heuristics and an attempt to only keep English text.
MassiveWeb is 506B. Could it be made bigger, by scaling up the original web scrape? That depends on how complete the original web scrape was -- but we know nothing about it.
The GLaM/PaLM web corpus
PaLM used a different web scrape corpus. It was first used in this paper about "GLaM," which again did not say anything about the original scraping process, only describing the quality filtering they did (and not in much detail).
The GLaM paper says its filtered web corpus is 143B tokens. That's a lot smaller than MassiveWeb. Is that because of the filtering? Because the original scrape was smaller? Dunno.
To further complicate matters, the PaLM authors used a variant of the GLaM dataset which made multilingual versions of (some of?) the English-only components.
How many tokens did this add? They don't say.
We are told that 27% (211B) of PaLM's training tokens came from this web corpus, and we are separately told that they tried to avoid repeating data. So the PaLM version of the GLaM web corpus is probably at least 211B, versus the original 143B. (Though I am not very confident of that.)
Still, that's much smaller than MassiveWeb. Is this because they had a higher quality bar (which would be bad news for further data scaling)? They do attribute some of PaLM's success to quality filtering, citing the ablation on this in the GLaM paper.
It's hard to tell, but there is this ominous comment, in the section where they talk about PaLM vs. Chinchilla:
Although there is a large amount of very high-quality textual data available on the web, there is not an infinite amount. For the corpus mixing proportions chosen for PaLM, data begins to repeat in some of our subcorpora after 780B tokens, which is why we chose that as the endpoint of training. It is unclear how the “value” of repeated data compares to unseen data for large-scale language model training.
The subcorpora that start to repeat are probably the web and dialogue ones.
Read literally, this passage seems to suggest that even the vast web data resources available to Google Research (!) are starting to strain against the data demands of large LMs. Is that plausible? I don't know.
We can speak with more confidence about text in specialized domains that's less common on the open web, since there's less of it out there, and people are more explicit about where they're getting it.
If you want code, it's on Github. There's some in other places too, but if you've exhausted Github, you probably aren't going to find orders of magnitude of additional code data. (I think?)
We've more-or-less exhausted Github. It's been scraped a few times with different kinds of filtering, which yielded broadly similar data sizes:
- The Pile's scrape had 631GB of text, and ~299B tokens
- The MassiveText scrape had 3.1TB of text, and 506B tokens
- The PaLM scrape had only 196GB of text (we aren't told how many tokens)
- The Codex paper's scrape was python-only and had 159GB of text
(The text to token ratios vary due to differences in how whitespace was tokenized.)
All of these scrapes contained a large fraction of the total code available on Github (in the Codex paper's case, just the python code).
Generously, there might be ~1T tokens of code out there, but not vastly more than that.
If you want to train a model on advanced academic research in physics or mathematics, you go to Arxiv.
For example, Arxiv was about half the training data for the math-problem-solving LM Minerva.
We've exhausted Arxiv. Both the Minerva paper and the Pile use basically all of Arxiv, and it amounts to a measly 21B tokens.
Books? What exactly are "books"?
In the Pile, "books" means the Books3 corpus, which means "all of Bibliotik." It contains 196,640 full-text books, amounting to only 27B tokens.
In MassiveText, a mysterious subset called "books" has 560B tokens. That's a lot more than the Pile has! Are these all the books? In . . . the world? In . . . Google books? Who even knows?
In the GLaM/PaLM dataset, an equally mysterious subset called "books" has 390B tokens.
Why is the GLaM/PaLM number so much smaller than the MassiveText number? Is it a tokenization thing? Both of these datasets were made by Google, so it's not like the Gopher authors have special access to some secret trove of forbidden books (I assume??).
If we want LMs to learn the kind of stuff you learn from books, and not just from the internet, this is what we have.
As with the web, it's hard to know what to make of it, because we don't know whether this is "basically all the books in the world" or just some subset that an engineer pulled at one point in time.
"all the data we have"
In my spreadsheet, I tried to make a rough, erring-on-generous estimate of what you'd get if you pooled together all the sub-corpora mentioned in the papers I've discussed here.
I tried to make it an overestimate, and did some extreme things like adding up both MassiveWeb and the GLaM/PaLM web corpus as though they were disjoint.
The result was ~3.2T tokens, or
- about 1.6x the size of MassiveText
- about 35% of the data we would need to train palm_opt
Recall that this already contains "basically all" of the open-source code in the world, and "basically all" of the theoretical physics papers written in the internet era -- within an order of magnitude, anyway. In these domains, the "low-hanging fruit" of data scaling are not low-hanging at all.
what is compute? (on a further barrier to data scaling)
Here's another important comment from the PaLM paper's Chinchilla discussion. This is about barriers to doing a head-to-head comparison experiment:
If the smaller model were trained using fewer TPU chips than the larger model, this would proportionally increase the wall-clock time of training, since the total training FLOP count is the same. If it were trained using the same number of TPU chips, it would be very difficult to maintain TPU compute efficiency without a drastic increase in batch size. The batch size of PaLM 540B is already 4M tokens, and it is unclear if even larger batch sizes would maintain sample efficiency.
In LM scaling research, all "compute" is treated as fungible. There's one resource, and you spend it on params and steps, where compute = params * steps.
But params can be parallelized, while steps cannot.
You can take a big model and spread it (and its activations, gradients, Adam buffers, etc.) across a cluster of machines in various ways. This is how people scale up in practice.
But to scale up , you have to either:
- take more optimization steps -- an inherently serial process, which takes linearly more time as you add data, no matter how fancy your computers are
- increase the batch size -- which tends to degrade model quality beyond a certain critical size, and current high- models are already pushing against that limit
Thus, it is unclear whether the "compute" you spend in high- models is as readily available (and as bound to grow over time) as we typically imagine "compute" to be.
If LM researchers start getting serious about scaling up data, no doubt people will think hard about this question, but that work has not yet been done.
appendix: to infinity
Earlier, I observation that Chinchilla beats any Gopher of arbitrary size.
The graph below expands on that observation, by including two variants of each model:
- one with the finite-model term set to zero, i.e. the infinite-parameter limit
- one with the finite-data term set to zero, i.e. the infinite-data limit
(There are two x-axes, one for data and one for params. I included the latter so I have a place to put the infinite-data models without making an infinitely big plot.
The dotted line is Chinchilla, to emphasize that it beats infinite-params Gopher.)
The main takeaway IMO is the size of the gap between ∞ data models and all the others. Just another way of emphasizing how skewed these models are toward , and away from .
See their footnote 2
See their equation (10)
Is 0.052 a "small" amount in some absolute sense? Not exactly, but (A) it's small compared to the loss improvements we're used to seeing from new models, and (B) small compared to the improvements possible by scaling data.
In other words, (A) we have spent a few years plucking low-hanging fruit much bigger than this, and (B) there are more such fruit available.
The two terms are still a bit imbalanced, but that's largely due to the "Approach 3 vs 1/2" nuances mentioned above.
Caveat: Gopher and Chinchilla were trained on the same data distribution, but these other models were not. Plugging them into the equation won't give us accurate loss values for the datasets they used. Still, the datasets are close enough that the broad trend ought to be accurate.
Wait, isn't that smaller than Chinchilla?
This is another Approach 3 vs. 1/2 difference.
Chinchilla was designed with Approaches 1/2. Using Approach 3, like we're doing here, give you a Chinchilla of only 33B params, which is lower than our palm_opt's 63B.
Seriously, I can't find anything about it in the Gopher paper. Except that it was "collected in November 2020."
It is not even clear that this multilingual-ization affected the web corpus at all.
Their datasheet says they "used multilingual versions of Wikipedia and conversations data." Read literally, this would suggest they didn't change the web corpus, only those other two.
I also can't tell if the original GLaM web corpus was English-only to begin with, since that paper doesn't say.
This ablation only compared filtered web data to completely unfiltered web data, which is not a very fine-grained signal. (If you're interested, EleutherAI has done more extensive experiments on the impact of filtering at smaller scales.)
They are being a little coy here. The current received wisdom by now is that repeating data is really bad for LMs and you should never do it. See this paper and this one.
EDIT 11/15/22: but see also the Galactica paper, which casts significant doubt on this claim.
The Pile authors only included a subset of this in the Pile.
The MassiveText datasheet says only that "the books dataset contains books from 1500 to 2008," which is not especially helpful.
Thought-provoking post, thanks.
One important implication is that pure AI companies such as OpenAI, Anthropic, Conjecture, Cohere are likely to fall behind companies with access to large amounts of non-public-internet text data like Facebook, Google, Apple, perhaps Slack. Email and messaging are especially massive sources of "dark" data, provided they can be used legally and safely (e.g. without exposing private user information). Taking just email, something like 500 billion emails are sent daily, which is more text than any LLM has ever been trained on (admittedly with a ton of duplication and low quality content).
Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.
Another implication is that centralised governments with the ability to aggressively collect and monitor citizen's data, such as China, could be major players.
A government such as China has no need to scrape data from the Internet, while being mindful of privacy regulations and copyright. Instead they can demand 1.4 billion people's data from all of their domestic tech companies. This includes everything such as emails, texts, WeChat, anything that the government desires.
When you exhaust all the language data from text, you can start extracting language from audio and video.
As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:
So it seems like video doesn't save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).
There are a few things in the calculation that seem wrong to me:
In any case, yeah, this does not seem like a huge amount of data. But there's enough order-of-magnitude fuzziness in the estimate that it does seem like it's worth someone's time to look into more seriously.
I agree that this points in the direction of video becoming increasingly important.
But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don't have the scaling laws, but it seems pretty clear that there's a ton of information in the non-language parts of videos that'd be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it'll take more computation to extract the same amount of useful information from video than from text.)
Thanks for posting this, it was really interesting. Some very dumb questions from someone who doesn't understand ML at all:
1. All of the loss numbers in this post "feel" very close together, and close to the minimum loss of 1.69. Does loss only make sense on a very small scale (like from 1.69 to 2.2), or is this telling us that language models are very close to optimal and there are only minimal remaining possible gains? What was the loss of GPT-1?
2. Humans "feel" better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?
3. Does this disprove claims that "scale is all you need" for AI, since we've already maxed out scale, or are those claims talking about something different?
Loss values are useful for comparing different models, but I don't recommend trying to interpret what they "mean" in an absolute sense. There are various reasons for this.
One is that the "conversion rate" between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.
Early in training, when the model's progress looks like realizing "huh, the word 'the' is more common than some other words", these simple insights correspond to relatively large decreases in loss. Once the model basically kinda knows English or whatever the language is, it's already made most of the loss progress it's going to make, and the further insights we really care about involve much smaller changes in loss. See here for more on this by gwern.
No one really knows, but my money is on "humans are actually better at this through some currently-unknown mechanism," as opposed to "humans are actually bad at this exact thing."
Why do I think this?
Well, the reason we're here talking about this at all is that LMs do write text of spookily high quality, even if they aren't as good as humans at it. That wasn't always... (read more)
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
EDIT: These results are now posted here.
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there's definitely no way I was going to get close to GPT-2.
I'm wary of the assumption that we can judge "human ability" on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I'm thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans "can do" take much longer to learn than this.
Here are some things that humans "can do," but require >> 1 hour of practice to "do," while still requiring far less exposure to task-specific example data than we're used to in ML:
... (read more)
- Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I'm not sure)
- Playing video/board/card games of nontrivial difficulty or depth
- Speaking any given language, even when learned during the critical language acquisition period
- Driving motor vehicles like cars (arguably) and planes (definitely)
- Writing good prose, for any conventional sense of "good" in any genre/style
- Computer programming (with any prof
Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren't following it. It's an attempt to train an RNN like a transformer. Initial numbers look pretty good.
Why do we say that we need less training data? Every minute instant of our existence is a multisensory point of data from before we've even exited the womb. We spend months, arguably years, hardly capable of anything at all yet still taking and retaining data. Unsupervised and mostly redundant, sure, but certainly not less than a curated collection of Internet text. By the time we're teaching a child to say "dog" for the first time they've probably experienced millions of fragments of data on creatures of various limb quantities, hair and fur types, sizes, sounds and smells, etc.; so they're already effectively pretrained on animals before we first provide a supervised connection between the sound "dog" and the sight of a four-limbed hairy creature with long ears on a leash.
I believe that Humans exceed the amount of data ML models have by multiple orders of magnitude by the time we're adults, even if it's extremely messy.
Does this imply that AGI is not as likely to emerge from language models as might have been thought? To me it looks like it's saying that the only way to get enough data would be to have the AI actively interacting in the world - getting data itself.
I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with.
I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.
If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior. The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).
All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, "for free." But once it's exhausted, producing more is slow.
IMO, most people are currently overestimating the potential of large generative models -- including image models like DALLE2 -- because of this fact.
There was all this massive data already sitting around from human activity (the web, Github, "books," Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.
When our compute finally began to catch up with our data, we effectively spent all the "stored-up p... (read more)
I don't think the real world is good enough either.
The fact that humans feel a strong sense of the tetris effect, suggest to me that the brain is constantly generating and training on synthetic data.
My two cents contra updates towards longer or more uncertain AGI timelines given the information in this post:
I am curious about this "irreducible" term in the loss. Apologies if this is covered by the familiarity with LM scaling laws mentioned as a prerequisite for this article.
When you say "irreducible", does that mean "irreducible under current techniques" or "mathematically irreducible", or something else?
Do we have any idea what a model with, say, 1.7 loss (i.e, a model almost arbitrarily big in compute and data, but with the same 1.69 irreducible) would look like?
Closer to the former, and even more restrictive: "irreducible with this type of model, trained in this fashion on this data distribution."
Because language is a communication channel, there is presumably also some nonzero lower bound on the loss that any language model could ever achieve. This is different from the "irreducible" term here, and presumably lower than it, although little is known about this issue.
Not really, although section 5 of this post expresses some of my own intuitions about what this limit looks like.
Keep in mind, also, that we're talking about LMs trained on a specific data distribution, and only evaluating their loss on data sampled from that same distribution.
So if an LM achieved 1.69 loss on MassiveText (or a scaled-up corpus that looked like MassiveText in all respects but size), it would do very well at mimicking all the types of text present in MassiveText, but that does not mean it could mimic every existing kind of text (much less every conceivable kind of text).
Such a game already exists! See https://rr-lm-game.herokuapp.com/whichonescored2 and https://rr-lm-game.herokuapp.com/. I’ve been told humans tend to do pretty badly at the games (I didn’t do too well myself), so if you feel discouraged playing and want a similar style of game that’s perhaps a bit more fun (if slightly less relevant to the question at hand), I recommend https://www.redactle.com/. Regardless, I guess I’m thinking of loss (in humans) in the more abstract sense of “what’s the distance between the correct and human-given answer [to an arbitrary question about the real world]?” If there’s some mathematically necessary positive amount of loss humans must have at a minimum, that would seemingly imply that there are fundamental limits to the ability of human cognition to model reality.
An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one "epoch".
In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.
This distinction exists in general, but it's irrelevant when training sufficiently large LMs.
It is well-established that repeating data during large LM training is not a good practice. Depending on the model size and the amount of repeating, one finds that it is either
with (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and surrounding discussion).
For more, see
I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.
I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.
I really can't tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.
You're right, the idea that multiple epochs can't possibly help is one of the weakest links in the post. Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first -- I'm kinda confused by this too.
After thinking more about it, I agree that the repeated-data papers don't provide much evidence that multiple epochs are harmful.
For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once. In their most extreme case,
But this effectively lowers the total size of the model's training dataset -- the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data. For example, if the repeated tokens are repeated 2 times, then we are only using 3/4 of the data we could be (we select 1/2 for the unrepeated part, and then select 1/4 and repeat it twice for ... (read more)
Thanks, that's interesting... the odd thing about using a single epoch, or even two epochs, is that you're treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can't extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it's effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn't too bad though.
That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?
I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.
Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would've guessed.
Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn't turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it's notable that it's harmed so little.
We're not running out of data to train on, just text.
Why did I not need 1 Trillion language examples to speak (debatable) intelligently? I'd suspect the reason is a combination of inherited training examples from my ancestors, but more importantly, language output is only the surface layer.
In order for language models to get much better, I suspect they need to be training on more than just language. It's difficult to talk intelligently about complex subjects if you've only ever read about them. Especially if you have no eyes, ears, or any other sense data. The best language models are still missing crucial context/info which could be gained through video, audio, and robotic IO.
Combined with this post, this would also suggest our hardware can already train more parameters than we need to in order to get much more intelligent models, if we can get that data from non text sources.
Some other order-of-magnitude estimates on available data, assuming words roughly equal tokens:
Wikipedia: 4B English words, according to this page.
Library of Congress: from this footnote a assume there are at most 100 million books worth of text in the LoC and from this page assume that books are 100k words, giving 10T words at most.
Constant writing: I estimate that a typical person writes at most 1000 words per day, with maybe 100 million people writing this amount of English on the internet. Over the last 10 years, these writers would have produced 370T words.
Research papers: this page estimates ~4m papers are published each year, at 10k words per paper with 100 years of research this amounts to 4T words total.
So it looks like 10T words is an optimistic order-of-magnitude estimate of the total amount of data available.
I assume the importance of a large quantity of clean text data will lead to the construction of a text database of ~1T tokens and that this database (or models trained on it) will eventually be open-sourced.
From there, it seems like really digging in to the sources of irreducible error will be necessary for further scaling. I would guess that a small part of this is... (read more)
Can you get anywhere with synthetic data? What happens if you train a model on its own output?
Interesting and thought provoking.
"It's hard to tell, but there is this ominous comment, in the section where they talk about PaLM vs. Chinchilla:". In the context of fears about AI alignment, I would say "hopeful" rather than "ominous" !
A possible avenue to explore is to expand these models to multilingual data. There are perhaps a lot of high quality text uniquely available in other languages (news, blogs, etc.). Anyways, IMO this effort should probably be directed less on acquiring the largest amount of data and more on acquiring high quality data. Chinchilla's scaling law doesn't include quality as a distinctive property, but we have reasons to believe that more challenging text are much more informative and can compensate low data environments.
I'd like to propose not talking publicly about ways to "fix" this issue. Insofar these results spell trouble for scaling up LLMs, this is a good thing!
Infohazard (meta-)discussions are thorny by their very nature and I don't want to discourage discussions around these results in general, e.g. how to interpret them or whether the analysis has merits.
I disagree, but I'm not sure how relevant my opinion is, since I'm far less worried about "AGI ruin" to begin with than the median LWer. That said, here's my thinking:
First, there's no universally agreed-upon line between "discussing whether the analysis has merits" and "giving the capabilities people free ideas." Where a person draws this line depends on how obvious they think the ideas are, or how obvious they think they will be to the capabilities people.
Second, there are costs to not talking about things. It's useful for alignment research to have a correct sense of where capabilities research is headed, and where it isn't headed. If alignment researchers talk more to one another than to "capabilities people" (true IME), and they practice self-censorship like this, they'll end up with some importantly wrong beliefs.
Also, and perhaps worse -- if alignment researchers never voice their own secret capabilities ideas in fora where "capabilities people" can hear, then they'll never receive feedback about these ideas from the people who know what it would be like to apply them in the real world. Alignment researchers may end up with private stockpiles of... (read more)
That wasn't an alignment researcher, though (was it? I thought Tomás was just an interested commenter), and it's a reasonable question to ask when no one's run the numbers, and when you get an answer like 'well, it'd take something like >5000x more compute than PaLM', that's a lesson learned.
At least among the people I've talked to, it seems reasonably well understood that Chinchilla had major implications, meant an immediate capabilities jump and cheaper deployment, and even more importantly meant parameter scaling was dead, and data and then compute were the bottleneck (which is also what I've said bluntly in my earlier comments), and this was why Chinchilla was more important than more splashy stuff like PaLM*. (One capability researcher, incidentally, wasn't revising plans but that's because he wasn't convinced Chinchilla was right in the first place! AFAIK, there has been no dramatic followup to Chinchilla on part with GPT-3 following up Kaplan et al, and in fact, no one has replicated Chinchil... (read more)
Yep. Just an interested layman.
What are the public domain internet places where one can learn more about capabilities, or see discussions of capabilities frameworks? Here's what I'm aware of:
All-in-all, there's not that much heavy discussion online. I've been told that these discussions really happen in-person, in the Bay Area and in DeepMind London offices. LessWrong actually ends up having the best discussion (in the capabilities space.)
(Since someone is likely to complain about seeking out more capabilities information, well yes it's risky, but I'm more in agreement with nostalgebraist that this level of discussion is probably harmless, and that it's better we keep an accurate and up-to-date understanding of the situation and technology.)
Mostly Discord servers in my experience: EleutherAI is a big well known one but there are others with high concentrations of top ML researchers.
I upvoted since I think discussing what should or should not be discussed is important, but I tentatively disagree:
That said, lesswrong is de facto one of the best places to discuss AI safety since the alignment forum is invite-only. thus, it seems that there should be some discussion around which tradeoffs to make on LW between "figuring out what's true" and "not spreading info hazards".
My tentative heuristic for whether you should publish a post that is potentially infohazardy is "Has company-X-who-cares-mostly-about-capabilities likely thought about this already?". It's obviously non-trivial to answer that question but I'm pretty sure most companies who build LLMs have looked at Chinchilla and come to similar conclusions as this post. In case you're unsure, write up the post in a google doc and ask someone who has thought more about infohazards whether they would publish it or not.
Also, I think Leon underestimates how fast a post can spread even if it is just intended for an alignment audience on LW.
It would be quite easy to automatically generate all of the math and logic you could ever want for these models. Far more than you could possibly ever want train it on (wouldn't want to make it a math only bot, probably.). I could easily program a computer to come up with effectively infinite correct math problems. There are quintillions of 64bit addition problems alone... (actually an immense underestimate. there are 18.4 quintillion 64bit numbers alone). Subtraction, multiplication, division, algebra, trig, calculus, statistics, etc; AND, OR, NOT, XOR, N... (read more)
Something I'm unsure about (commenting from my mod-perspective but not making a mod pronouncement) is how LW should relate to posts that lay out ideas that may advance AI capabilities.
My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these ar... (read more)
There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.
There's also a bit of a negative stereotype among some AI researchers as alignment people being theoretical philosophers doing their own thing and being entirely out of touch about what real AI is like. They might take alignment concerns a bit more seriously if they find it easy to actually find competent AI discussion on LW / Alignment Forum.
What is the mechanism you're imagining for this speedup? What happens that would not have happened without this post?
I'm struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months. The paper is not exactl... (read more)
It would be useful to have a more descriptive title, like "Chinchilla's implications for data bottlenecks" or something.
I know of two independently developed LLM in two languages where the conclusions of the developers is that "we run out of data in our language". One of them is trying to scale by going multilingual.
Where to look next? There is lots of untapped data in speech (radio shows, youtube, etc): that amount could make a difference in my opinion.
This paper came out recently: https://arxiv.org/abs/2207.14502 . It shows a way to work around the lack of sufficient training data for generating computer programs by "generating synthetic programming puzzles and solutions, verified for correctness by a Python interpreter." We can think of analogous generation for data-limited general LLMs and there are some possibilities.
I have a question. Suppose we want to create a decent language model which is a small as possible -- small enough to run on a cell phone, say. We could try to compensate for this by scaling data to infinity. Now, we may run out of data, but if we do, we can generate more data artificially using a much larger LM. For example, consider training something BERT-sized using artificial data generated by PaLM (assume we have a very high compute budget in the training phase).
How well should we expect this to perform? If we plug into the above, it seems ... (read more)
Scott Alexander asked things related to this, but still seems worth being more explicit about what this perfect 1.69 loss model would be like in practice if we got there?
Another interesting corpus (though problematic for legal reasons) would be sci-hub. Quick googling gives estimates of around 50 million research articles; the average research article runs around 4000 words, and sci-hub is estimated to contain about 69% of all research articles published in peer-reviewed journals. That would put sci-hub at about 50 million * 4000 = 200B tokens and the whole scientific journal literature at an estimated 290B tokens.
You write "Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size."
But earlier you write:
"Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.
It's 70B params, trained on 1.4T tokens of data"
300B vs. 1.4T. Is this an error?
Great post. The reason "no one was focused on data" was an error in the original OAI scaling laws paper, which was discovered by Hoffman et Al.
It's worth noting that Ajeya's BioAnchors report estimates that TAI will require a median of 22T data points, nearly an order of magnitude more than the available text tokens as estimated here. See here for more.
Super interesting post. I'm a bit late to the party, but I work in the space and the obvious reason nobody will say specifically what data they used is that their lawyers won't let them. I've literally had conversations with product counsel about how specific we can be in a paper / blog post about our data sources.
Copyright and privacy law is approximately 3 centuries behind the state of technology, and there are not clear rules about what data you're allowed to use for what. Someone like Google or Microsoft won't just say "we scraped Quo... (read more)
There is an old (2013) paper from Google here that mentions training an ngram model on 1.3T tokens: ("Our second-level distributed language model uses word 4-grams. The English model is trained on a 1.3 × 10^12 token training set"). An even earlier 2006 blog post here also references a 1T word corpus.
This number is 2x as big as MassiveWeb, more than a decade old, and not necessarily the whole web even back then. So I would be quite surprised if the MassiveWeb 506B token number represents a limit of what's available on the web. My guess would be that there'... (read more)
A comment from hacker news on this piece:
In other words, language models need so ... (read more)
Some more questions:
How come PaLM_opt is smaller than Chinchilla? Isn't Chinchilla supposed to be Gopher_opt?
These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scalin... (read more)
On the MMLU benchmark, Chinchilla five-shot reported 67.6% accuracy; how does one convert this to loss or vice versa? More to the point, what loss would the human expert 89.8% correspond to? It would be very interesting to see how much compute that scaling law predicts would be necessary to produce human expert level losses with optimal data availability, or with as much data as is likely available to such a project.
How many tokens would we have if we transcribed the audio of as many youtube videos as possible? (After a lot of filters I imagine)
Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?
I would expect the outcome of this to drive capabilities research more towards "learning to learn". Goal being to improve the amount of knowledge that is extracted from each observed piece of data.
They fit a simplistic model where the two variables were independent and the contribution of each decays exponentially. This leads to the shocking conclusion that the two inputs are independent and decay exponentially...
I mean the model is probably fine for it's intended purpose; finding the rough optimal ratio of parameters and data for a given budget. It might mean that current models have suboptimal compute budgets. But it doesn't imply anything beyond that, like some hard limit to scaling given our data supply.
If the big tech companies really want to t... (read more)