Two months ago, DeepMind announced their new Chinchilla paper. The headline result was that you could get much better performance with the same computational budget, by training a smaller model for longer:

However, it seems to have been mostly unnoticed that the "scaling law" they found was generally different from the first one in Kaplan et al. Kaplan shows a "pure" power law:

while the law in the Chinchilla paper has a large constant factor and substantially larger overall exponent. Note that N here (parameters) and D (data) are both roughly proportional to the square root of C (compute):

This results in similar behavior within a certain range, but very different asymptotic behavior:

(Graph produced by plotting L(c) = 26.387 / 10^0.05c for Kaplan and L(c) = 1.69 + 514.5/10^0.156c + 560.2/10^0.1512c for Chinchilla, where c is log10(FLOPs) and L is loss; I juggled a few terms around so the two curves could be compared directly.) This curve shape appears to be confirmed by their empirical results:

This isn't a super-surprising result IMO - page 17 of Kaplan predicts that a pure power law can't continue indefinitely, and it's not clear how loss translates into practical performance anyway - but it seemed worth noting explicitly.

EDIT: There was an earlier followup to the Kaplan paper, by many of the same authors, that also tried to break down scaling into "reducible loss" (that improved with model size) vs. "irreducible loss" (a constant factor) across several different AI domains; although unlike the Chinchilla paper, they don't seem to estimate the "irreducible loss" for language models specifically. The paper discussion on LW didn't mention this and I had missed it, thanks to Celestia for pointing it out! Here's a video discussing the results:

This paper (https://arxiv.org/abs/2010.14701) shows the existence of constant terms in other generative modelling settings and relates it to the entropy of the dataset, where you can't compress beyond. It also gives empirical evidence that downstream performance in things like "finetuning a generative model to be a classifier" continues to improve as you asymptote to the constant. From a physics perspective, the constant term and coefficients on the power law pieces are "non universal data" while the exponent is going to tell you more about the model, training scheme, problem, etc.

Thanks, I hadn't seen that! Added it to the post

FWIW, the fact that the scaling laws were different and extrapolate very differently and also apparently resolve the contradiction were discussed a lot at the time; I dunno if they were discussed

enough, but certainly it was in the discussions here & /r/MLscaling & by Daniel & Nostalgebraist & the usual suspects.Do you have links handy?

Various discussion in this reddit thread: https://www.reddit.com/r/mlscaling/comments/trwkck/training_computeoptimal_large_language_models/

In particular this comment: https://www.reddit.com/r/mlscaling/comments/trwkck/comment/i2pc6bk/?utm_source=reddit&utm_medium=web2x&context=3

Dang, I've been missing out on juicy Gwern comments! I better follow them on reddit...