Thanks for the very comprehensive review on generalisation!
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, "Statistical Modelling: The Two Cultures", touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
I agree with the post that approximation-generalisation-optimisation is very much entangled in this puzzle. The question should by why a model + training algorithm combo is likely to select generalising solutions among many other solutions (i.e. those with good training fit) and not just whether the best fit model generalise, whether particular optimisation algorithm can find it or how simple the model class is.
SLT gave the answer that models with very singular solutions + bayesian training will generalise very well. This still leaves open the questions of why DL models on "natural" datasets has very singular solutions and how much does the bayesian results tell us about DL models with SGD-type training.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like "generalise and be safe".
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, "Statistical Modelling: The Two Cultures", touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
Yeah it surprises me that Zhang et al. (2018) has had the impact it did when, like you point out, the ideas have been around for so long. Deep learning theorists like Telgarsky point to it as a clear turning point.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like "generalise and be safe".
This I can stand behind.
On why neural networks generalize, it's known that part of the answer is: They don't generalize nearly as much as people think they do, and there are some fairly important limitations to their generalizability:
Faith and Fate is the paper I'd read, but I think there are other results, like Neural Networks and the Chomsky Hierarchy, or Transformers can't learn to solve problems recursively, but point is that neural networks are quite a bit overhyped in their ability to generalize from certain data, so some of the answer is they don't generalize as much as people think:
It's worth noting that Jesse is mostly following the traditional "approximation, generalization, optimization" error decomposition from learning theory here - where "generalization" specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?
Ok, I understand now. You haven't misunderstood me. I'm not sure what to do with my comment above now.
I'll cross post it soon.
I actually did it: https://www.lesswrong.com/posts/gq9GR6duzcuxyxZtD/?commentId=feuGTuRRAi6r6DRRK
Repeating a question I asked Jesse earlier, since others might be interested in the answer: how come we tend to hear more about PAC bounds than MAC bounds?
I think this mostly has to do with the fact that learning theory grew up in/next to computer science where the focus is usually worst-case performance (esp. in algorithmic complexity theory). This naturally led to the mindset of uniform bounds. That and there's a bit of historical contingency: people started doing it this way, and early approaches have a habit of sticking.
In 2018, Zhang et al. showed that deep neural networks can achieve perfect training loss on randomly labeled data.
This was a Big Deal.
It meant that existing generalization theory couldn't explain why deep neural networks generalize. That's because classical approaches to proving that a given model class (=neural network architecture) would generalize involved showing that it lacks the expressivity to fit noise. If a model class can fit noise arbitrarily well, the resulting bounds break.
So something needed to change.
Evidently, you can't prove tight generalization bounds for entire model classes, so theorists turned to studying generalization bounds for individual models within a model class. If you can empirically show that a model's performance doesn't change substantially when you perturb it (by adding noise to the inputs, weights, training samples, etc.), then you can theoretically prove that that model will generalize to new data.
As a result, the bounds have gotten tighter, but they're still not exactly flattering.
What's really needed is a secret third thing. It's not about either model classes or individual models but about model subclasses. While the model class as a whole may be too complex to obtain tight generalization bounds, individual subclasses can achieve an optimal trade-off between accuracy and complexity. For singular Bayesian learning machines, this trade-off happens automatically.
This more or less answers why models are able to generalize but not how they do it.
Singular learning theory (SLT) provides one possible path towards understanding the how of generalization. This approach is grounded in the geometry of the loss landscape, which in turn is grounded in the symmetries of the model class and data. If this direction pans out, then learning theory is posed for a revolution analogous to the transition between thermodynamics and statistical physics.
The central aim of classical learning theory is to bound various kinds of error: in particular, the approximation error, generalization error, and optimization error.
In the previous post in this series, we looked at the approximation error, which measures the performance of a model class's hypothetically optimal model. If your model class isn't expressive enough to do a reasonable job of modeling the data, then it doesn't matter that your model generalizes to new data or that your learning process actually reaches that optimum in practice.
In this post, we'll look at bounding the generalization error, which measures how well a model parametrized by transfers from a finite training set to additional samples drawn from the same distribution. We won't cover the related question of out-of-distribution generalization, which asks how a model will perform on data from a different distribution.
Last time, we started to see that approximation and generalization could not be separated in practice. Here, we'll see the same holds for generalization and optimization. SLT and many other strands of generalization theory point to a deep relation, which views generalization in nearly dynamical terms involving changes to probability distributions.
In the next post in this sequence, we'll examine the optimization error, which measures how close learning processes get to global optima. Coming at the connection with generalization from the other direction, we'll view optimization in quasistatic terms as a process of selecting among qualitatively distinct generalization strategies.
This is mainly based on lecture notes by Telgarsky (2021), a recent monograph by Hellström et al. (2023), and Watanabe (2009, 2018).
This is a self-contained introduction to generalization theory, as developed in three different schools of learning theory: classical learning theory, deep learning theory, and singular learning theory. There's a lot to cover:
To recap, classical learning theory is primarily grounded in the paradigm of Empirical Risk Minimization (ERM). ERM is a story about two kinds of risk.
The empirical risk is what you might recognize as the "training loss",
where the average is taken over the samples in your dataset .[1] Learning theorists maintain a sensible distinction between the individual-sample "loss" function and the average-loss "risk" .
The population risk (or "true" risk) is the expectation of the loss over the true distribution from which the dataset is drawn:
In its broadest form, generalization is the question of how a model transfers from a finite dataset to new data. In practice, learning theorists study the generalization error (or generalization gap) ,
which addresses a more narrow question: how far apart are the empirical risk and population risk?
The generalization error, , is a function of both the learned model and dataset , so by taking expectation values over data, weights, or both, we derive a family of related generalization metrics.
The relevant distributions include:
From these distributions, we obtain:[2]
And so on.
In a Bayesian setting, we replace the deterministic prediction with a distribution , which induces a likelihood . Under certain assumptions, empirical risk minimization maps onto a corresponding Bayesian problem, in which the empirical risk is the negative log likelihood (up to a constant).[3]
The Bayesian learning algorithm is the posterior,
But we're not just interested in updating our beliefs over weights — we also want to use our new beliefs to make predictions. Theoretically, the optimal way to do this is to average novel predictions over the ensemble of weights, weighing each machine according to their posterior density. This yields the predictive distribution,
In practice, evaluating these kinds of integrals is intractable. A more tractable alternative is Gibbs estimation, where we draw a particular choice of weights and make predictions using the corresponding model . This procedure is closer to the kind of paradigm we're in with neural networks: the model we train is a single draw from the distribution of final weights over different initializations and learning schedules.
Predicting according to either the predictive distribution or Gibbs estimation respectively yields two different kinds of generalization error: