Broken Neural Scaling Laws presents a functional form that allows one to extrapolate the downstream (and upstream) performance of large-scale vision, language, math, generative modeling, unsupervised learning, & reinforcement learning tasks, assuming you have enough training runs before a break to extrapolate what happens until the next sharp break.
Below are some highlighted quotes from my conversation with Ethan Caballero about his paper (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.
Unless otherwise noted, the quotes below are from the conversation and in the transcript.
Broken Neural Scaling Law
The general functional form of a broken neural scaling law (BNSL) is given as follows:
where represents the performance evaluation metric and represents a quantity that is being scaled (e.g. number of model parameters, amount of compute used for training, training dataset size, or upstream performance). The remaining parameters , , ..., ..., ... are unknown constants that must be estimated by fitting the above functional form to the data points.
“Say you have a log-log plot like right here and on the y-axis you have the performance evaluation metric and then on the x-axis you have the scale, so whatever it is you're scaling. So like compute, dataset size, or number parameters. A break is a transition between one (approximately) straight line in a log-log plot and another (approximately) straight line in a log-log plot.”
“You can have various numbers of breaks. n represents the number of breaks. For most problems we care about, it's usually one break, but for certain things such as N-digit addition and Deep Double Descent, it's more than one break.”
“The y-axis can be basically anything. It can be test loss, it can be reward, it can be F1 score, it can be BLEU score, it can be ELO score. It doesn't really matter.”
Collecting the initial data by averaging training runs
“If you have enough seeds and training runs [before the first break, in black in the above left plot], you're able to perfectly extrapolate everything all the way up to the next sharp break.” [otherwise you can predict performance up until the next break only, cf. left plot]
“[The number of seeds] can vary between ten and thousands. I will say for most workloads that people care about, there currently usually is only one (large) break.”
Experiments extrapolated by Broken Neural Scaling Laws
“A ton of large scale vision and language things. And then the stuff that was advertised as unpredictable, like four digit addition and then just non-monotonic stuff that I knew everything else breaks on, like Double Descent. There's this paper that Google released a few weeks ago called Revisiting Neural Scaling Laws and they put out this big benchmark of a zillion experimental data where you have say, a hundred training runs to fit and then there are say, a hundred larger training runs that are held out to evaluate extrapolation. And they did that for a bunch of large scale vision and language things.”
“For four digit arithmetic, there are dramatic breaks for upstream performance. You can get away with million parameter models if you're just training on four digit addition.”
[Note: other experiments include generative Modeling of Images and Multi-Agent (and Single-Agent) Reinforcement Learning.]
Using Broken Neural Scaling Laws to predict unexpected behavior
Predicting sharp left turns
The sharpness of a break is related to the constant in the broken neural scaling law equation:
“Constant represents the sharpness of break between the (i)th and the (i + 1)th approximately linear region on a log-log plot; smaller (nonnegative) values of yield a sharper break and intervals (before and after the (i)th break) that are more linear on a log-log plot; larger values of yield a smoother break and intervals (before and after the (i)th break) that are less linear on a log-log plot.”
The larger is, the easier it could be to predict a sharp left turn. When (of each of the future breaks) is very large (i.e. wide and not sharp) it is seemingly possible to predict multiple future breaks although number of training runs / seeds needed can be very large. It is seemingly possible because in such scenarios the breaks "bleed" into the other breaks such that there is useful signal for making such predictions.
In practical terms, this means that if is large enough then the signal from propagates back to black points used for fitting such that there is enough useful signal for estimating via Scipy curve fitting.
(Where constant represents where on the x-axis the break between the (i)th and the (i+ 1)th approximately linear region (on a log-log plot) occurs.)
When is very small (i.e. sharp) (and nonnegative), one needs a very large number of training runs / seeds  from right before that break to perfectly extrapolate scaling behavior from that break to next sharp break.
On the risk of collecting training runs close to the break
Michaël: “Isn't there a concern that if you were running things closer to the break you would already see the deceptive behavior, the bad behavior, and that you could get a deceptive AI just from running things close to the break?”
Ethan: “I mean it sounds plausible but it doesn't get dramatic until basically after the break has happened, if you get what I mean, the break is the transition from this slope to the next slope. Assuming you have a ton of compute to get a zillion seeds, you don't need any points from when the slope is at it's full max.”
Modeling non-monotonic behavior
"[Broken Neural Scaling Laws can] model and extrapolate Double Descent. No one was even trying to model non-monotonic stuff with variants of power laws and scaling laws to the best of my knowledge.”
“Interpretability and controllability, those are two classic examples that you'd expect to be more interpretable and more controllable until it's beyond human comprehension. Because it gets smarter than humans. And at that point you'd expect the interpretability or controllability metric to start scaling in the opposite direction. So you want some kind of functional form that's able to express and extrapolate non-monotonic scaling and predict when it's about to happen.”
Takeoff Speeds and Deception
Distributed training is inefficient
“Currently [running distributed trainings] doesn't work that well if you're trying to use your compute pretty efficiently.”
Because of this inefficiency, if you had access to many more servers, you could run more inferences in parallel but your trainings would still be going at the same speed as if you had only one server.
“I view it kind of as like Paul Christiano and Andy Jones have talked about, test time compute versus training compute. There it's almost like you dramatically increase the test time compute, but the training compute kind of stayed the same.”
“If Git Re-Basin is actually real, that one has big implications. It's basically you can train multiple separate models and then merge them together to get what each of them learned. So if in the limit of it's doing the most amazing things that it could possibly be doing, it would imply all the foundation model companies go bankrupt because you can just have a zillion people train small models and open source them and then fuse them together.”
Recursive self improvement won't happen before a sinister stumble
“I don't view that recursive self-improvement is happening as fast as you do. I don't buy that you'll get really, really fast recursive self improvement before a sinister stumble had happened.”
“The very first time, it's not going to be doing it perfectly. The very first time there are going to be some humans in the loop.”
“Even when people train a gigantic trillion parameter model, they're checking on it every few days because they're like, "We got to check in on it because it's super expensive." If something went wrong along the way and you had an outer memory error or the run diverged.”
“The hardware stuff's going to come after the software stuff probably, and the hardware stuff is where it's more unbounded but continuous, I agree the software can be a little bit scary because it's more discontinuous but it's more bounded also.”
"assuming you have enough training runs" admittedly does a lot of the work here, especially when the models get large, to the point of this work being of purely theoretical interest for large language models, as of November 2022.
If is extremely small (and nonnegative) then the number of seeds needed is extremely large, which makes it so expensive that it is basically intractable to extrapolate well if one can only use points before the break.
Thanks to Daniel Paleka, Alan Chan and Max Kaufmann for feedback.