Review

TLDR; This is the second main post of Distilling Singular Learning Theory which is introduced in DSLT0. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as , and why non-true parameters can still be preferred at finite  if they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1).

It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple" models, suggesting that they obey a kind of Occam's Razor:

Plurality should not be posited without necessity.

or in modern parlance,

If two models of the world are similarly accurate, the simpler explanation should be preferred.

This allows them to achieve exceptionally low generalisation error despite classical statistics predictions that they should overfit data: 

Double descent observed by OpenAI
Neural networks seem to obey a kind of double descent where bigger is better, breaking classical statistics predictions.
(Source: OpenAI's Double Descent blogpost).

This fact has come to be known as the generalisation problem and has been discussed at length in Zhang et. al 2017 (and a 2021 supplement), and in Bengio et al., amongst countless others. 

Remarkably, Singular Learning Theory can help explain why neural networks, which are singular models, have the capacity to generalise so well.

The degeneracy of the Fisher information matrix is actually a feature of singular models, not a bug. This is because different regions of parameter space can have different complexities as measured by the RLCT , unlike regular models where the complexity is fixed to the total number of parameters in the model . This is the implicit content of Watanabe's profound free energy formula, called the Widely Applicable Bayesian Information Criterion (WBIC), which quantifies a precise asymptotic tradeoff between inaccuracy and complexity, 

giving a mathematically rigorous realisation of Occam's Razor, since  in singular models. 

In this post we will explore Watanabe's free energy formula and provide an intuitive example of why the RLCT matters so much. If you are new to statistical learning theory, I would recommend jumping straight to the examples and their related animations to gain the intuition first, and then return to the theory afterwards.

The four key points to take away are:

  1. As , true parameters with the best accuracy will always be preferred.
  2. As , if two true parameters are equally accurate but have different RLCT's, the parameter with the lower RLCT is preferred.
  3. For finite but large , non-true parameters can be preferred by the posterior because of an accuracy-complexity tradeoff as measured by the WBIC.
  4. Parameters with low inaccuracy and small RLCT's  have low generalisation error (in a Bayesian sense) since the Bayes generalisation error  is the "derivative" of the free energy, so 

Information Criteria Help Avoid Underfitting and Overfitting

In the last post, we derived the asymptotic free energy  as  for regular models, called the Bayesian Information Criterion (BIC): 

where  is the total number of datapoints in the dataset , the optimal loss is  where  is a maximum likelihood estimate (i.e. ), and  is the total dimension of parameter space .

As a statistical practitioner, given some dataset , your goal is to find a model that you hope will represent the truth from some candidate list. You only have access to the truth via your (training) dataset , but you also want to ensure that it generalises to data beyond the dataset. You can use the BIC to compare model candidates across a set of model classes that you can think to compare, since it captures a precise asymptotic tradeoff between inaccuracy  and complexity . Under this paradigm, we should choose the model that achieves the lowest BIC as it is the best option for avoiding both underfitting and overfitting the data. Let's consider a simple example in action:

Example 1: Suppose we have  datapoints drawn from a quadratic with Gaussian noise,  where [1], where  is drawn according to a uniform prior . After looking at our scatterplot of data , we could try models across the following set of model classes:

NameModel
Linear2
Quadratic3