TLDR; This is the second main post of Distilling Singular Learning Theory which is introduced in DSLT0. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as n→∞, and why non-true parameters can still be preferred at finite nif they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1).
It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple" models, suggesting that they obey a kind of Occam's Razor:
Plurality should not be posited without necessity.
or in modern parlance,
If two models of the world are similarly accurate, the simpler explanation should be preferred.
This allows them to achieve exceptionally low generalisation error despite classical statistics predictions that they should overfit data:
This fact has come to be known as the generalisation problem and has been discussed at length in Zhang et. al 2017 (and a 2021 supplement), and in Bengio et al., amongst countless others.
Remarkably, Singular Learning Theory can help explain why neural networks, which are singular models, have the capacity to generalise so well.
The degeneracy of the Fisher information matrix is actually a feature of singular models, not a bug. This is because different regions of parameter space can have different complexities as measured by the RLCT λ, unlike regular models where the complexity is fixed to the total number of parameters in the model d. This is the implicit content of Watanabe's profound free energy formula, called the Widely Applicable Bayesian Information Criterion (WBIC), which quantifies a precise asymptotic tradeoff between inaccuracy and complexity,
giving a mathematically rigorous realisation of Occam's Razor, since λ≤d2 in singular models.
In this post we will explore Watanabe's free energy formula and provide an intuitive example of why the RLCT matters so much. If you are new to statistical learning theory, I would recommend jumping straight to the examples and their related animations to gain the intuition first, and then return to the theory afterwards.
The four key points to take away are:
As n→∞, true parameters with the best accuracy will always be preferred.
As n→∞, if two true parameters are equally accurate but have different RLCT's, the parameter with the lower RLCT is preferred.
For finite but large n, non-true parameters can be preferred by the posterior because of an accuracy-complexity tradeoff as measured by the WBIC.
Parameters with low inaccuracy and small RLCT's λ have low generalisation error (in a Bayesian sense) since the Bayes generalisation error Gn is the "derivative" of the free energy, so
Gn=Ln(w(0))+λn+o(1n).
Information Criteria Help Avoid Underfitting and Overfitting
In the last post, we derived the asymptotic free energy Fn as n→∞ for regular models, called the Bayesian Information Criterion (BIC):
BIC=nLn(w(0))+d2logn,
where n is the total number of datapoints in the dataset Dn, the optimal loss is Ln(w(0)) where w(0)∈W is a maximum likelihood estimate (i.e. w(0)=argminW(L(w))), and d is the total dimension of parameter space W⊆Rd.
As a statistical practitioner, given some dataset Dn, your goal is to find a model that you hope will represent the truth from some candidate list. You only have access to the truth via your (training) dataset Dn, but you also want to ensure that it generalises to data beyond the dataset. You can use the BIC to compare model candidates across a set of model classes that you can think to compare, since it captures a precise asymptotic tradeoff between inaccuracyLn(w(0)) and complexityd2. Under this paradigm, we should choose the model that achieves the lowest BIC as it is the best option for avoiding both underfitting and overfitting the data. Let's consider a simple example in action:
Example 1: Suppose we have n=61 datapoints drawn from a quadratic with Gaussian noise, y=x2+ε where ε∼N(0,0.152)[1], where x is drawn according to a uniform prior q(x)=121(x∈[−1,1]). After looking at our scatterplot of data Dn, we could try models across the following set of model classes:
TLDR; This is the second main post of Distilling Singular Learning Theory which is introduced in DSLT0. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as n→∞, and why non-true parameters can still be preferred at finite n if they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1).
It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple" models, suggesting that they obey a kind of Occam's Razor:
or in modern parlance,
This allows them to achieve exceptionally low generalisation error despite classical statistics predictions that they should overfit data:
This fact has come to be known as the generalisation problem and has been discussed at length in Zhang et. al 2017 (and a 2021 supplement), and in Bengio et al., amongst countless others.
Remarkably, Singular Learning Theory can help explain why neural networks, which are singular models, have the capacity to generalise so well.
The degeneracy of the Fisher information matrix is actually a feature of singular models, not a bug. This is because different regions of parameter space can have different complexities as measured by the RLCT λ, unlike regular models where the complexity is fixed to the total number of parameters in the model d. This is the implicit content of Watanabe's profound free energy formula, called the Widely Applicable Bayesian Information Criterion (WBIC), which quantifies a precise asymptotic tradeoff between inaccuracy and complexity,
WBIC=nLn(w(0))inaccuracy+λcomplexitylogn,giving a mathematically rigorous realisation of Occam's Razor, since λ≤d2 in singular models.
In this post we will explore Watanabe's free energy formula and provide an intuitive example of why the RLCT matters so much. If you are new to statistical learning theory, I would recommend jumping straight to the examples and their related animations to gain the intuition first, and then return to the theory afterwards.
The four key points to take away are:
- As n→∞, true parameters with the best accuracy will always be preferred.
- As n→∞, if two true parameters are equally accurate but have different RLCT's, the parameter with the lower RLCT is preferred.
- For finite but large n, non-true parameters can be preferred by the posterior because of an accuracy-complexity tradeoff as measured by the WBIC.
- Parameters with low inaccuracy and small RLCT's λ have low generalisation error (in a Bayesian sense) since the Bayes generalisation error Gn is the "derivative" of the free energy, so
Gn=Ln(w(0))+λn+o(1n).Information Criteria Help Avoid Underfitting and Overfitting
In the last post, we derived the asymptotic free energy Fn as n→∞ for regular models, called the Bayesian Information Criterion (BIC):
BIC=nLn(w(0))+d2logn,where n is the total number of datapoints in the dataset Dn, the optimal loss is Ln(w(0)) where w(0)∈W is a maximum likelihood estimate (i.e. w(0)=argminW(L(w))), and d is the total dimension of parameter space W⊆Rd.
As a statistical practitioner, given some dataset Dn, your goal is to find a model that you hope will represent the truth from some candidate list. You only have access to the truth via your (training) dataset Dn, but you also want to ensure that it generalises to data beyond the dataset. You can use the BIC to compare model candidates across a set of model classes that you can think to compare, since it captures a precise asymptotic tradeoff between inaccuracy Ln(w(0)) and complexity d2. Under this paradigm, we should choose the model that achieves the lowest BIC as it is the best option for avoiding both underfitting and overfitting the data. Let's consider a simple example in action:
Example 1: Suppose we have n=61 datapoints drawn from a quadratic with Gaussian noise, y=x2+ε where ε∼N(0,0.152)[1], where x is drawn according to a uniform prior q(x)=121(x∈[−1,1]). After looking at our scatterplot of data Dn, we could try models across the following set of model classes: