Note that, in your example, if we do see double descent, it's because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.

Yep, that's exactly my model.

As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the "best" hypothesis need not be the truth.

If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.

That first stage is not just a "likelihood descent", it is a "likelihood + prior descent", since you are choosing hypotheses based on the posterior, not based on the likelihood.

True for the Bayesian case, though unclear in the ML case—I think it's quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).

Inductive biases stick around

by evhub 2 min read18th Dec 201914 comments


Ω 23

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is a follow-up to Understanding “Deep Double Descent”.

I was talking to Rohin at NeurIPS about my post on double descent, and he asked the very reasonable question of why exactly I think double descent is so important. I realized that I hadn't fully explained that in my previous post, so the goal of this post is to further address the question of why you should care about double descent from an AI safety standpoint. This post assumes you've read my Understanding “Deep Double Descent” post, so you should read that first before reading this if you haven't already.

Specifically, I think double descent demonstrates the in my opinion very important yet counterintuitive result that larger models can actually be simpler than smaller models. On its face, this sounds somewhat crazy—how can a model with more parameters be simpler? But in fact I think this is just a very straightforward consequence of double descent: in the double descent paradigm, larger models with zero training error generalize better than smaller models with zero training error because they do better on SGD's inductive biases. And if you buy that SGD's inductive biases are approximately simplicity, that means that larger models with zero training error are simpler than smaller models with zero training error.

Obviously, larger models do have more parameters than smaller ones, so if that's your measure of simplicity, larger models will always be more complicated, but for other measures of simplicity that's not necessarily the case. For example, it could hypothetically be the case that larger models have lower Kolmogorov complexity. Though I don't actually think that's true in the case of K-complexity, I think that's only for the boring reason that model weights have a lot of noise. If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.

Really, what I I'm trying to do here is dispel what I see as the myth that as ML models get more powerful simplicity will stop mattering for them. In a Bayesian setting, it is a fact that the impact of your prior on your posterior (for those regions where your prior is non-zero[1]) becomes negligible as you update on more and more data. I have sometimes heard it claimed that as a consequence of this result, as we move to doing machine learning with ever larger datasets and ever bigger models, the impact of our training processes' inductive biases will become negligible. However, I think that's quite wrong, and I think double descent does a good job of showing why, because all of the performance gains you get past the interpolation threshold are coming from your implicit prior.[2] Thus, if you suspect modern ML to mostly be in that regime, what will matter in terms of which techniques beat out other techniques is how good they are at compressing their data into the “actually simplest” model that fits it.

Furthermore, even just from the simple Bayesian perspective, I suspect you can still get double descent. For example, suppose your training process looks like the following: you have some hypothesis class that keeps getting larger as you train and at each time step you select the best a posteriori hypothesis. I think that this setup will naturally yield a double descent for noisy data: first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you're selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood. I think this is a good model for how modern machine learning works and what double descent is doing.

All of this is only for models with zero training error, however—before you reach zero training error larger models can certainly have more essential complexity than smaller ones. That being said, if you don't do very many steps of training then your inductive biases will also matter a lot because you haven't updated that much on your data yet. In the double descent framework, the only region where your inductive biases don't matter very much is right on the interpolation threshold—before the interpolation threshold or past it they should still be quite relevant.

Why does any of this matter from a safety perspective, though? Ever since I read Belkin et al. I've had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.[3]

  1. Negating the impact of the prior not having support over some hypotheses requires realizability (see Embedded World-Models). ↩︎

  2. Note that double descent happens even without explicit regularization, so the prior we're talking about here is the implicit one imposed by the architecture you've chosen and the fact that you're training it via SGD. ↩︎

  3. Which is exactly what you should expect if you think Occam's razor is the right prior: if two hypotheses have the same likelihood but one generalizes better, according to Occam's razor it must be because it's simpler. ↩︎


Ω 23