Exactly one year ago I wrote a lengthy article about the coming phase of massive scaling. Then nothing much happened. GPT-3 was replicated again and again, but nobody went way beyond GPT-3s parameter count (and published what they found). Since last week, this suddenly makes a lot more sense.
Last week Deepmind published a bombshell of a paper "Training Compute-Optimal Large Language Models". In it they detail new scaling laws based on improved best practices - mostly using better learning rate schedules and optimizing with AdamW.
These two differences in training large language models change the learning curve so significantly that the resulting scaling laws come out very different. Suddenly achieving a given loss by scaling data becomes a lot more effective, while scaling the parameters becomes comparably less necessary.
1a3orn has a great post explaining the details of the paper. I don't want to duplicate his effort and instead focus on how these new scaling laws change the perspective on a possible scaling race between AGI-outfits / big tech companies.
The results of the paper mirror the strategic decisions made by OpenAI over the last year or so. Instead of scaling model size further according to their own scaling laws, they have concentrated on using a lot more compute to optimize their language models for commercial use.
What's more, they have also hinted that even the successor models are not going to be significantly larger than GPT-3. I think it is very likely that they realized at the very least that the adoption of AdamW throws a spanner into their scaling laws. Of course by now quite a lot of competitors have blown a part of their compute budget on an oversized and undertrained model.
The old scaling laws imply that there is an intersection point not that far off from GPT-3 where scaling becomes less effective ("the point were you start overfitting even on deduplicated data" was one intuitive explanation).
The new scaling laws instead imply:
All these make pouring more compute into large language models much more enticing.
Deepmind secretly replicated/surpassed GPT-3 with Gopher within the same year (Dec 2020) and went on to publish this accomplishment only a year later after GPT-3 had been duplicated a couple of times. As 1a3orn points out, they are now hiring a team of data engineers. I think it is clear that they are fully on the scaling band wagon.
I still haven't published the article on the coming phase of massive scaling. Maybe now would be a good time.