Exactly one year ago I wrote a lengthy article about the coming phase of massive scaling. Then nothing much happened. GPT-3 was replicated again and again, but nobody went way beyond GPT-3s parameter count (and published what they found). Since last week, this suddenly makes a lot more sense.

Last week Deepmind published a bombshell of a paper "Training Compute-Optimal Large Language Models". In it they detail new scaling laws based on improved best practices - mostly using better learning rate schedules and optimizing with AdamW. 

These two differences in training large language models change the learning curve so significantly that the resulting scaling laws come out very different. Suddenly achieving a given loss by scaling data becomes a lot more effective, while scaling the parameters becomes comparably less necessary. 

1a3orn has a great post explaining the details of the paper. I don't want to duplicate his effort and instead focus on how these new scaling laws change the perspective on a possible scaling race between AGI-outfits / big tech companies.

The results of the paper mirror the strategic decisions made by OpenAI over the last year or so. Instead of scaling model size further according to their own scaling laws, they have concentrated on using a lot more compute to optimize their language models for commercial use. 

What's more, they have also hinted that even the successor models are not going to be significantly larger than GPT-3. I think it is very likely that they realized at the very least that the adoption of AdamW throws a spanner into their scaling laws. Of course by now quite a lot of competitors have blown a part of their compute budget on an oversized and undertrained model. 

The old scaling laws imply that there is an intersection point not that far off from GPT-3 where scaling becomes less effective ("the point were you start overfitting even on deduplicated data" was one intuitive explanation). 

The new scaling laws instead imply:

  1. Better performance for the same compute. Chinchilla outperforms Gopher by a lot. 
  2. Cheaper deployment and fine tuning. Faster response times (at least potentially, probably not true for Chinchilla because it has the same depth as Gopher). 
  3. Easier scaling from an engineering perspective. Larger models become more and more difficult to train, not just for pure engineering reasons, but also because training becomes less stable around 100b parameters (Big science 100b model apparently keeps blowing up). Getting more data seems to not be a problem in the medium term.
  4. An "irreducible" loss that is not that far off, i.e. might be approachable within the next one or two decades (according to gwern, though I can't find the link right now). (However, irreducible only in the sense that it is not changed by scaling data or parameters. It might represent a performance that can still be improved upon by different approaches and which is not necessarily comparable to human level intelligence.)

All these make pouring more compute into large language models much more enticing. 

Deepmind secretly replicated/surpassed GPT-3 with Gopher within the same year (Dec 2020) and went on to publish this accomplishment only a year later after GPT-3 had been duplicated a couple of times. As 1a3orn points out, they are now hiring a team of data engineers. I think it is clear that they are fully on the scaling band wagon.

I still haven't published the article on the coming phase of massive scaling. Maybe now would be a good time. 


New Comment

New to LessWrong?