Inductive biases stick around

evhub

Inductive biases stick around — LessWrong

64 Inductive biases stick around

by evhub

18th Dec 2019

AI Alignment Forum

4 min read

64 Ω 27

This post is a follow-up to Understanding “Deep Double Descent”.

I was talking to Rohin at NeurIPS about my post on double descent, and he asked the very reasonable question of why exactly I think double descent is so important. I realized that I hadn't fully explained that in my previous post, so the goal of this post is to further address the question of why you should care about double descent from an AI safety standpoint. This post assumes you've read my Understanding “Deep Double Descent” post, so you should read that first before reading this if you haven't already.

Specifically, I think double descent demonstrates the in my opinion very important yet counterintuitive result that larger models can actually be simpler than smaller models. On its face, this sounds somewhat crazy—how can a model with more parameters be simpler? But in fact I think this is just a very straightforward consequence of double descent: in the double descent paradigm, larger models with zero training error generalize better than smaller models with zero training error because they do better on SGD's inductive biases. And if you buy that SGD's inductive biases are approximately simplicity, that means that larger models with zero training error are simpler than smaller models with zero training error.

Obviously, larger models do have more parameters than smaller ones, so if that's your measure of simplicity, larger models will always be more complicated, but for other measures of simplicity that's not necessarily the case. For example, it could hypothetically be the case that larger models have lower Kolmogorov complexity. Though I don't actually think that's true in the case of K-complexity, I think that's only for the boring reason that model weights have a lot of noise. If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.

Really, what I I'm trying to do here is dispel what I see as the myth that as ML models get more powerful simplicity will stop mattering for them. In a Bayesian setting, it is a fact that the impact of your prior on your posterior (for those regions where your prior is non-zero^[1]) becomes negligible as you update on more and more data. I have sometimes heard it claimed that as a consequence of this result, as we move to doing machine learning with ever larger datasets and ever bigger models, the impact of our training processes' inductive biases will become negligible. However, I think that's quite wrong, and I think double descent does a good job of showing why, because all of the performance gains you get past the interpolation threshold are coming from your implicit prior.^[2] Thus, if you suspect modern ML to mostly be in that regime, what will matter in terms of which techniques beat out other techniques is how good they are at compressing their data into the “actually simplest” model that fits it.

Furthermore, even just from the simple Bayesian perspective, I suspect you can still get double descent. For example, suppose your training process looks like the following: you have some hypothesis class that keeps getting larger as you train and at each time step you select the best a posteriori hypothesis. I think that this setup will naturally yield a double descent for noisy data: first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you're selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood. I think this is a good model for how modern machine learning works and what double descent is doing.

All of this is only for models with zero training error, however—before you reach zero training error larger models can certainly have more essential complexity than smaller ones. That being said, if you don't do very many steps of training then your inductive biases will also matter a lot because you haven't updated that much on your data yet. In the double descent framework, the only region where your inductive biases don't matter very much is right on the interpolation threshold—before the interpolation threshold or past it they should still be quite relevant.

Why does any of this matter from a safety perspective, though? Ever since I read Belkin et al. I've had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.^[3]

Negating the impact of the prior not having support over some hypotheses requires realizability (see Embedded World-Models). ↩︎
Note that double descent happens even without explicit regularization, so the prior we're talking about here is the implicit one imposed by the architecture you've chosen and the fact that you're training it via SGD. ↩︎
Which is exactly what you should expect if you think Occam's razor is the right prior: if two hypotheses have the same likelihood but one generalizes better, according to Occam's razor it must be because it's simpler. ↩︎

Machine Learning (ML)AI

Frontpage

64 Ω 27

Inductive biases stick around

New Comment

16 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:18 PM

[-]samsamoa6yΩ11170

One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. "train forever".

If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.

To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.

[-]samsamoa6yΩ8100

Evan's response (copied from a direct message, before I was approved to post here):

It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there's also a meaningful sense in which bigger models have better inductive biases.

The idea I'm objecting to is that there's a sharp change from one regime (larger family of models) to the other (better inductive bias). I'd say that both factors smoothly improve performance over the full range of model sizes. I don't fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.

[-]Rohin Shah6yΩ7100

On its face, this sounds somewhat crazy—how can a model with more parameters be simpler? But in fact I think this is just a very straightforward consequence of double descent

One man's modus ponens is another man's modus tollens: while you seem to go from "double descent is real" to "larger models are simpler", I go from "larger models are more complex" to "something crazy is happening with double descent".

Possible crux: I think the empirical evidence justifies "double descent is a real effect that occurs in some situations", but not "double descent is clearly real and happens in the vast majority of realistic settings".

[-]John_Maxwell6yΩ580

I think we would benefit from tabooing the word "simple". It seems to me that when people use the word "simple" in the context of ML, they are usually referring to either smoothness/Lipschitzness or minimum description length. But it's easy to see that these metrics don't always coincide. A random walk is smooth, but its minimum description length is long. A tall square wave is not smooth, but its description length is short. L2 regularization makes a model smoother without reducing its description length. Quantization reduces a model's description length without making it smoother. I'm actually not aware of any argument that smoothness and description length are or should be related--it seems like this might be an unexamined premise.

Based on your paper, the argument for mesa-optimizers seems to be about description length. But if SGD's inductive biases target smoothness, it's not clear why we should expect SGD to discover mesa-optimizers. Perhaps you think smooth functions tend to be more compressible than functions which aren't smooth. I don't think that's enough. Imagine a Venn diagram where compressible functions are a big circle. Mesa-optimizers are a subset, and the compressible functions discovered by SGD are another subset. The question is whether these two subsets are overlapping. Pointing out that they're both compressible is not a strong argument for overlap: "all cats are mammals, and all dogs are mammals, so therefore if you see a cat, it's also likely to be a dog".

When I read your paper, I get a sense that an optimizers outperform by allowing one to collapse a lot of redundant functionality into a single general method. It seems like maybe it's the act of compression that gets you an agent, not the property of being compressible. If our model is a smooth function which could in principle be compressed using a single general method, I'm not seeing why the reapplication of that general method in a very novel context is something we should expect to happen.

BTW I actually do think minimum description length is something we'll have to contend with long term. It's just too useful as an inductive bias. (Eliminating redundancies in your cognition seems like a basic thing an AGI will need to do to stay competitive.) But I'm unconvinced SGD possesses the minimum description length inductive bias. Especially if e.g. the flat minima story is the one that's true (as opposed to e.g. the lottery ticket story).

Also, I'm less confident that what I wrote above applies to RNNs.

[-]FactorialCode6yΩ480

Does anyone know if double decent happens when you look at the posterior predictive rather than just the output of SGD? I wouldn't be too surprised if it does, but before we start talking about the bayesian perspective, I'd like to see evidence that this isn't just an artifact of using optimization instead of integration.

[-]Rohin Shah6yΩ570

Planned summary for the Alignment newsletter:

This update to Evan's <@double descent post@>(@Understanding “Deep Double Descent”@) explains why he thinks double descent is important. Specifically, Evan argues that it shows that inductive biases matter even for large, deep models. In particular, double descent shows that larger models are _simpler_ than smaller models, at least in the overparameterized setting where models are past the interpolation threshold where they can get approximately zero training error. This makes the case for <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) stronger, since mesa optimizers are simple, compressed policies.

Planned opinion:

As you might have gathered last week, I'm not sold on double descent as a clear, always-present phenomenon, though it certainly is a real effect that occurs in at least some situations. So I tend not to believe counterintuitive conclusions like "larger models are simpler" that are premised on double descent.

Regardless, I expect that powerful AI systems are going to be severely underparameterized, and so I don't think it really matters that past the interpolation threshold larger models are simpler. I don't think the case for mesa optimization should depend on this; humans are certainly "underparameterized", but should count as mesa optimizers.

[-]Rohin Shah6yΩ570

even just from the simple Bayesian perspective, I suspect you can still get double descent.

Note that, in your example, if we do see double descent, it's because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.

As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the "best" hypothesis need not be the truth. It could be that posterior(truth) < posterior(memorization hypothesis) < posterior(almost-right hypothesis that predicts the noise "by luck").

Overall, my guess is that while you could engineer this if you tried, but it wouldn't happen "naturally" in synthetic examples (though it might happen for datasets like MNIST, because maybe there's some property of those datasets that causes double descent to happen).

first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you're selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood.

That first stage is not just a "likelihood descent", it is a "likelihood + prior descent", since you are choosing hypotheses based on the posterior, not based on the likelihood. And with Bayesian methods it's quite possible that you never get to perfect likelihood, because the prior contains enough information to weed out the memorization hypotheses.

[-]evhub6y*Ω120

Note that, in your example, if we do see double descent, it's because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.

Yep, that's exactly my model.

As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the "best" hypothesis need not be the truth.

If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.

That first stage is not just a "likelihood descent", it is a "likelihood + prior descent", since you are choosing hypotheses based on the posterior, not based on the likelihood.

True for the Bayesian case, though unclear in the ML case—I think it's quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).

[-]Rohin Shah6yΩ340

If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.

Sorry, "best" meant "the one that was chosen", i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.

True for the Bayesian case, though unclear in the ML case

I agree it's unclear for the ML case, just because double descent happens and I have no idea why and "the prior doesn't start affecting things until after interpolation" does explain that even though it itself needs explaining.

[-]avturchin6y60

I think that this setup will naturally yield a double descent for noisy data: first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you're selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood. I think this is a good model for how modern machine learning works and what double descent is doing.

Reminded me about Ptolemean system and heliocentric system

[-]Rohin Shah6yΩ350

I have sometimes heard it claimed that as a consequence of this result, as we move to doing machine learning with ever larger datasets and ever bigger models, the impact of our training processes' inductive biases will become negligible.

I'm confused by this. As I see it, the consequence of that result is that as you move to larger datasets, holding model size fixed, the impact of inductive biases will decrease. This is consistent with double descent, where as you have larger and larger datasets, you get into the underparameterized regime, which follows the normal story of more data = better.

As you increase the size of your models, holding dataset size fixed, the impact of inductive biases should increase, since you need more information to pick out the right model, but the data provides exactly the same amount of information. And that's consistent with what happens in the overparameterized regime.

[-]Jason Gross3y20

If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.

This seems like a match for cross-entropy, c.f. Nate's recent post K-complexity is silly; use cross-entropy instead

[-]Matthew Barnett6yΩ120

but as ML moves to larger and larger models, why should that matter? The answer, I think, is that the reason larger models generalize better isn't because they're more complex, but because they're actually simpler.

On its face, this statement seems implausible to me. Are you saying, for instance, that the model of a dog that a human has in their head should be simpler than the model of a dog that an image classifier has?

[-]evhub6yΩ240

I just edited the last sentence to be clearer in terms of what I actually mean by it.

[-]evhub6yΩ120

What double descent definitely says is that for a fixed dataset, larger models with zero training error are simpler than smaller models with zero training error. I think it does say somewhat more than that also, which is that larger models do have a real tendency towards being better at finding simpler models in general. That being said, the dataset on which the concept of a dog in your head was trained on is presumably way larger than that of any ML model, so even if your brain is really good at implementing Occam's razor and finding simple models, your model is still probably going to be more complicated.

[-]Fiora Starlight8mo10

I've had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.

Optimization algorithms are simple in the sense that they loop a lot, repeating the same basic cycle to yield multiple candidate outputs. However, this isn't necessarily a kind of simplicity that makes them more likely to emerge inside of neural networks. In basic MLPs, each iteration of an optimization loop has to be implemented individually, because each weight only gets used once per forward pass. In that setting, looping optimization algorithms actually aren't even remotely compressible.

Transformers have somewhat better access to looping algorithms, because all of their weights get applied to every token in context, effectively constituting a loop of weight application. However, these loops are fundamentally non-iterative. At no point in the repeated weight application process do you feed in the output of a previous weight application. Instead, the repeated weight application has to be 100% parallelizable. So you still don't have an easy way to implement fundamentally sequential optimization algorithms, such as iteratively refining a neural network across multiple gradient steps. Each sequential operation has to be implemented by different weights inside the network

RNNs are a more full-blown exception. Their repeated weight applications are inherently sequential, in the way transformers' repeated weight applications are not.

Anyway, in any case where you need to implement each iteration of your looping algorithm individually, that's a strike against it appearing in the course of backpropagation-driven weight updates. This is because backprop updates weights locally, without regard for how any of the other weights are being altered by a given training example. For each location in the network where a loop has to appear, the network needs to have independently converged on implementing an iteration of the loop in that location. You can't just set up one set of weights at one location in the network and get an algorithm that gets applied repeatedly.

Moderation Log