LESSWRONG
LW

Scaling LawsAI
Frontpage

44

A Quick Note on AI Scaling Asymptotes

by alyssavance
25th May 2022
1 min read
7

44

Scaling LawsAI
Frontpage

44

A Quick Note on AI Scaling Asymptotes
10EmpressCelestia
2alyssavance
8gwern
4alyssavance
4hold_my_fish
1Nathan Helm-Burger
2Lech Mazur
New Comment
7 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:58 PM
[-]EmpressCelestia3y100

This paper (https://arxiv.org/abs/2010.14701) shows the existence of constant terms in other generative modelling settings and relates it to the entropy of the dataset, where you can't compress beyond. It also gives empirical evidence that downstream performance in things like "finetuning a generative model to be a classifier" continues to improve as you asymptote to the constant. From a physics perspective, the constant term and coefficients on the power law pieces are "non universal data" while the exponent is going to tell you more about the model, training scheme, problem, etc.

Reply
[-]alyssavance3y20

Thanks, I hadn't seen that! Added it to the post

Reply
[-]gwern3y80

FWIW, the fact that the scaling laws were different and extrapolate very differently and also apparently resolve the contradiction were discussed a lot at the time; I dunno if they were discussed enough, but certainly it was in the discussions here & /r/MLscaling & by Daniel & Nostalgebraist & the usual suspects.

Reply
[-]alyssavance3y40

Do you have links handy?

Reply
[-]hold_my_fish3y40

Various discussion in this reddit thread: https://www.reddit.com/r/mlscaling/comments/trwkck/training_computeoptimal_large_language_models/

In particular this comment: https://www.reddit.com/r/mlscaling/comments/trwkck/comment/i2pc6bk/?utm_source=reddit&utm_medium=web2x&context=3 

Reply
[-]Nathan Helm-Burger3y10

Dang, I've been missing out on juicy Gwern comments! I better follow them on reddit...

Reply
[-]Lech Mazur1y20

Chinchilla Scaling: A replication attempt

https://arxiv.org/abs/2404.10102

Reply
Moderation Log
Curated and popular this week
7Comments

Two months ago, DeepMind announced their new Chinchilla paper. The headline result was that you could get much better performance with the same computational budget, by training a smaller model for longer:

However, it seems to have been mostly unnoticed that the "scaling law" they found was generally different from the first one in Kaplan et al. Kaplan shows a "pure" power law:

while the law in the Chinchilla paper has a large constant factor and substantially larger overall exponent. Note that N here (parameters) and D (data) are both roughly proportional to the square root of C (compute):

This results in similar behavior within a certain range, but very different asymptotic behavior:

(Graph produced by plotting L(c) = 26.387 / 10^0.05c for Kaplan and L(c) = 1.69 + 514.5/10^0.156c + 560.2/10^0.1512c for Chinchilla, where c is log10(FLOPs) and L is loss; I juggled a few terms around so the two curves could be compared directly.) This curve shape appears to be confirmed by their empirical results:

This isn't a super-surprising result IMO - page 17 of Kaplan predicts that a pure power law can't continue indefinitely, and it's not clear how loss translates into practical performance anyway - but it seemed worth noting explicitly.

EDIT: There was an earlier followup to the Kaplan paper, by many of the same authors, that also tried to break down scaling into "reducible loss" (that improved with model size) vs. "irreducible loss" (a constant factor) across several different AI domains; although unlike the Chinchilla paper, they don't seem to estimate the "irreducible loss" for language models specifically. The paper discussion on LW didn't mention this and I had missed it, thanks to Celestia for pointing it out! Here's a video discussing the results: