Two months ago, DeepMind announced their new Chinchilla paper. The headline result was that you could get much better performance with the same computational budget, by training a smaller model for longer:
However, it seems to have been mostly unnoticed that the "scaling law" they found was generally different from the first one in Kaplan et al. Kaplan shows a "pure" power law:
while the law in the Chinchilla paper has a large constant factor and substantially larger overall exponent. Note that N here (parameters) and D (data) are both roughly proportional to the square root of C (compute):
This results in similar behavior within a certain range, but very different asymptotic behavior:
(Graph produced by plotting L(c) = 26.387 / 10^0.05c for Kaplan and L(c) = 1.69 + 514.5/10^0.156c + 560.2/10^0.1512c for Chinchilla, where c is log10(FLOPs) and L is loss; I juggled a few terms around so the two curves could be compared directly.) This curve shape appears to be confirmed by their empirical results:
This isn't a super-surprising result IMO - page 17 of Kaplan predicts that a pure power law can't continue indefinitely, and it's not clear how loss translates into practical performance anyway - but it seemed worth noting explicitly.
EDIT: There was an earlier followup to the Kaplan paper, by many of the same authors, that also tried to break down scaling into "reducible loss" (that improved with model size) vs. "irreducible loss" (a constant factor) across several different AI domains; although unlike the Chinchilla paper, they don't seem to estimate the "irreducible loss" for language models specifically. The paper discussion on LW didn't mention this and I had missed it, thanks to Celestia for pointing it out! Here's a video discussing the results: