There are some very nice resources to understand the intuition of Singular Learning Theory. However, I am quite unsatisfied with the current resources online explaining or approaching the subject, as I find them quite concise and brief - skipping many concepts that actually serve to strengthen the intuition to do research in this field, thus being confusing to me. While they are very nice to understand the subject overall, it is equally important for a resource to be there which aims to explain the field in detail. This is an attempt to change that, and I have tried to keep this sequence as comprehensive as possible. The material is directly adapted from the Watanabe Texts and Suzuki's WAIC and WBIC with python book, and solutions to some exercises as well as examples are given. I am giving out these explanations as I understand this subject, so all feedback is appreciated. We start with and do a good deal of the work with classical Bayesian framework first.
Guide: Please refer to this notebook for examples with code, some exercises and their solutions as well.
Introduction To Bayesian Statistics
We start with Bayesian Statistics. Watanabe’s theory is fundamentally based on generalizing classical results in Bayesian Statistics, so it is important to get a strong grip and understand this classical theory well before moving on. It also gives us the complete understanding of the framework we are working in, and is the first essential thing to master.
Connection with Machine Learning and Setup
Machine Learning Models are primarily consisting of two frameworks (or a combination of them): Frequentist and Bayesian.
The setup is that we have a true data generating distribution and consider a set of arbitrary samples . We take a statistical model (which is a parametric family of probability distributions) which aims to estimate the true distribution.
The likelihood function of our statistical model is defined as
The frequentist approach is to find the optimal which maximizes this likelihood function.
The KL divergence from probability distribution to is defined as
This is the main measure that we will use to associate similarity between probability distributions (even though it is not really a metric, it is clear that it is not even symmetric).
It can be easily seen that finding the optimal (called the maximum likelihood estimator) is equivalent to minimizing the KL divergence from the empirical true distribution to our statistical model, which is a function of An approximation to the local optimal parameter is often approached via (stochastic) gradient descent. This is also the case in neural networks, which are essentially function approximators. We use SGD to approximate to the local optimal parameter vector.
We will not delve into the frequentist approach more here (you may refer to Goodfellow et al). We will move on to the Bayesian approach here. Thus, when we refer to neural networks here, an important distinction is that now this is not the standard neural networks where SGD is used. Still, we gain many insights from this approach that also carry to the standard networks.
In the Bayesian approach, instead of considering just the optimal parameter, we consider a probability distribution over the space of parameters itself. Initially, this is called the prior function, and as we observe the data from the true distribution, we update this prior function to successively obtain a posterior function, which is an estimate over the entire parameter space to what generates the true distribution function.
Specifically, we consider an appropriate prior function and a statistical model p(x∣w). These are chosen by us, and this choice often determines what estimate our bayesian method will given us. We assume there is a true date generating distribution q(x), from which we draw N samples independently, . This sample induces a function which is the update of our prior function. This further induces , which is our estimate of the true distribution. This process goes on as we make more samples. As can be seen, this is more computationally intensive. However, this approach is superior in many cases, we will specifically see an example later on. We will now define everything mathematically. To summarize, here is the procedure:
Construct the universe and the mathematical laws between bayesian observables which hold for any arbitrary: true distribution, statistical model, and a prior.
Evaluate how appropriate the statistical model and the prior is using these laws.
Employ the most suitable pair.
Introduction to Bayesian Statistics
The posterior function is obtained through Bayes’ rule.
ut neither do we know the statistical model, nor do we know the prior. Thus a meaningful approach is to just start with something, evaluate how good it is, and then update it. The evaluation is done through the mathematical laws described above.
This gives rise to the estimated pdf of x, called the predictive distribution:
Expected: is appropriate for . We want to evaluate the tuple appropriateness without information about . We develop the machinery for that.
True Distribution
A realized value of in a trial is denoted . In practical applications, while we may not know q(x), we assume its existence.
Let us just revise the basics first as they will be important in the calculations that we make.
Let
Do observe that we are able to take the product here because of the independent sampling.
The average entropy of the true distribution is defined as:
The empirical entropy is defined as:
By definition, one can see that . I outline it nonetheless, so that you can get comfortable with the calculations.
Similarly, one can see that the variance of the empirical entropy is:
The average and empirical entropies of the true distribution which is a conditional distribution is defined similarly:
Model, Prior and Posterior
Let . Let be independent real random values subjected to . For an arbitrary pair , the posterior probability density is defined by
where is defined by
which is called the partition function/marginal likelihood/evidence.
Expected value over the posterior distribution is denoted . Do note that .
This expected value is a random variable as it depends on . (Better to say, it is the expected value over a conditional probability density and hence is a random variable).
The posterior gives rise to the predictive density function:
(estimate w from, estimate x from w, vary over all w).
If , the prior is called proper, because it is normalized so that . Even for an improper prior, posterior and predictive probability densities can be defined if is finite and well defined.
An Important Example - The Exponential Family
In many simple statistical models, the posterior converges to the normal distribution as . We see such a case in the example referred to below. However, even in some simple cases, and many others, this fails. This is the key problem resulting in the new theory.
At this point, I highly recommend referring to the example (given in the notebook link at the end).
We are now going to prove the formulae given in the example.
If the statistical model is of the form
where u is a real valued function (and the other two are vector valued), then this distribution is said to belong to the exponential family. Furthermore, if the distribution of the parameter θ∈Θ depends on some hyper-parameter ϕ, and can be written as
where z(ϕ) is the normalizing factor, then is said to be a conjugate prior distribution.
In the case when the distribution is of the form
we can take and . Thus it is of the exponential family.
Now, as we know, . Let us calculate the numerator.
Let us denote . Then we have the numerator is:
Let us get the for which we need to integrate the numerator with respect to θ. and here we use a nice hack. We know that the integral of the prior with respect to θ is 1, regardless of what is. So set and we see that the first term integrates out to 1, while the second term is a scalar number independent of .
Hence,
which is also from the exponential family!
Finally, the predictive probability is given by
One may notice that we are using a different formula for the predictive density, bypassing the integral definition. This comes directly from using the bayes rule in the given definition (check it yourself), and it is computationally more useful in some cases to use this instead.
For the example given at the start of the section, it is just a matter of inputting numbers into the formulae.
Estimation and Generalization
We need an objective measure which indicates the difference between true and estimated probability density to evaluate how accurate the predictive density is.
Let be a sample taken independently from and be a predictive density using a statistical model and a prior . We are going to make two definitions:
Notice how both of these quantities are random variables.
Thus
Thus, with equality iff . That is, the smalleris, the more precise our estimate is according to the KL divergence.
and are called generalization errors and training errors respectively.
An observation: As entropy does not depend on either a model / prior, smaller generalization error is equivalent to lower KL divergence.
Definition: Assume . Let be a set of random variables (leave one out).
Cross validation loss is defined by and is called the cross validation error.
We now prove an important theorem, which has three statements regarding the definitions that we made.
Theorem: Assume that is independent. Then the following holds.
Assume thatare finite values. Then
The cross validation loss satisfies the following:
For an arbitrary set of , , with equality iff is a const function of on .
Note: is just the integral with respect to the posterior distribution.
(1) Here is the proof of the first statement.
While it was not mentioned what the expectation is being taken over in the statement, the proof clarifies it. In any case, the answer to the clarification is the canonical and the most standard answer.
(2) We now prove the second statement:
Thus,
Call the integrand in the denominator . Then is:
We introduced cross validation as a measure to evaluate the accuracy of our estimation. However, there are two issues with cross validation:
1) Although the averages of are equal the variances need not be equal. However, we do have this relation:
2) In the second statement, if the average by the posterior is numerically approximated, then
is called the importance sampling cross validation loss. Importance sampling is the method of calculating an expectation more easily by writing it as a more manageable distribution.
is fundamentally different from the former. is expectation with respect to .
3) Let us prove the third statement now.
By Cauchy Schwarz. Equality holds iff as a function of , which implies that is a constant function of .
We introduce another measure now, and it is often better than the cross validation loss. There are also many cases where WAIC can be employed whereas cross validation cannot.
Definition: Letbe a set of random variables. The widely applicable information criterion (WAIC) is defined by
is called the WAIC error.
Here is a result: Ifis independent, WAIC is asymptotically equal to cross validation loss.
Remark: and WAIC can be employed to evaluate a stats model and a prior even if the prior is improper.
Just to summarize, we have introduced three instruments of measure:
Generalization Error:
Cross Validation Error:
WAIC Error:
In numerical experiments, we often care about minimizing errors instead of the loss itself due to the lower variance.
Marginal likelihood or Partition Function
If a prior satisfies , then the marginal likelihood (partition function) satisfies
We have slyly used Fubini Theorem above. Thus if the prior is proper.
can be thus be understood as an estimated probability density function of using a statistical model p(x∣w) and a prior . Thus it is sometimes written as . Thus this is an estimate of by looking at the KL divergence.
Definition: The Free energy, or the minus log marginal likelihood is defined by
We look at this quantity also as an estimate of .
Using the notation , firstly note that
Thus,
Then,
Smaller means KL divergence decreases (⇒KL≥0) hence is a better estimate for . Thus is the average KL divergence from to whereas is their sum.
We now prove yet another important theorem.
Theorem: Let. The average generalization loss is equal to the increase in free energy.
Thus .
Proof: For an arbitrary function, .
Now, .
Thus, .
Remark: As , the correspondence between free energy and marginal likelihood is one to one. However, in general, asymptotic order of the marginal likelihood as a random variable is not equal to its average, whereas for free energy that is the case. We have not proved this yet. Thus, for asymptotic statistics, free energy is a more convenient random variable.
We can illustrate the failure for the former: Let the marginal likelihood ratio for . Then . However, this is a result: in probability.
Meaning of Marginal Likelihood
Assume that is the prior distributions of a model p(x∣w) and a prior . Then .
By Bayes' Theorem, .
Thus if n is sufficiently large, maximizing is equivalent to maximizing .
Conditional independent cases
We will make the definitions for the conditionally independent case. They are quite similar.
Let us assume is dependent but is conditionally independent.
For an arbitrary function ,
which is a function of .
Everything else is defined similarly. But in this case, and .
For Example:
Regression problem for a fixed set are studied. Cross validation cannot be employed.
Consider the time series expressed by the relation
This can be understood as a regression problem
Thus is dependent, so CV cannot be employed. WAIC, however, can be. It is thus superior.
Exercises
I now refer you to the first set of exercises given in the notebook.
Further Steps
In the next post, we will introduce the concepts of realizability and regularity. We will discuss the main theorems of the regular statistical models. We will discuss MCMC methods that are a key tool for calculations, and we may do some other things as well.
Introduction
There are some very nice resources to understand the intuition of Singular Learning Theory. However, I am quite unsatisfied with the current resources online explaining or approaching the subject, as I find them quite concise and brief - skipping many concepts that actually serve to strengthen the intuition to do research in this field, thus being confusing to me. While they are very nice to understand the subject overall, it is equally important for a resource to be there which aims to explain the field in detail. This is an attempt to change that, and I have tried to keep this sequence as comprehensive as possible. The material is directly adapted from the Watanabe Texts and Suzuki's WAIC and WBIC with python book, and solutions to some exercises as well as examples are given. I am giving out these explanations as I understand this subject, so all feedback is appreciated. We start with and do a good deal of the work with classical Bayesian framework first.
Guide: Please refer to this notebook for examples with code, some exercises and their solutions as well.
Introduction To Bayesian Statistics
We start with Bayesian Statistics. Watanabe’s theory is fundamentally based on generalizing classical results in Bayesian Statistics, so it is important to get a strong grip and understand this classical theory well before moving on. It also gives us the complete understanding of the framework we are working in, and is the first essential thing to master.
Connection with Machine Learning and Setup
Machine Learning Models are primarily consisting of two frameworks (or a combination of them): Frequentist and Bayesian.
The setup is that we have a true data generating distribution and consider a set of arbitrary samples . We take a statistical model (which is a parametric family of probability distributions) which aims to estimate the true distribution.
The likelihood function of our statistical model is defined as
The frequentist approach is to find the optimal which maximizes this likelihood function.
The KL divergence from probability distribution to is defined as
This is the main measure that we will use to associate similarity between probability distributions (even though it is not really a metric, it is clear that it is not even symmetric).
It can be easily seen that finding the optimal (called the maximum likelihood estimator) is equivalent to minimizing the KL divergence from the empirical true distribution to our statistical model, which is a function of An approximation to the local optimal parameter is often approached via (stochastic) gradient descent. This is also the case in neural networks, which are essentially function approximators. We use SGD to approximate to the local optimal parameter vector.
We will not delve into the frequentist approach more here (you may refer to Goodfellow et al). We will move on to the Bayesian approach here. Thus, when we refer to neural networks here, an important distinction is that now this is not the standard neural networks where SGD is used. Still, we gain many insights from this approach that also carry to the standard networks.
In the Bayesian approach, instead of considering just the optimal parameter, we consider a probability distribution over the space of parameters itself. Initially, this is called the prior function, and as we observe the data from the true distribution, we update this prior function to successively obtain a posterior function, which is an estimate over the entire parameter space to what generates the true distribution function.
Specifically, we consider an appropriate prior function and a statistical model p(x∣w). These are chosen by us, and this choice often determines what estimate our bayesian method will given us. We assume there is a true date generating distribution q(x), from which we draw N samples independently, . This sample induces a function which is the update of our prior function. This further induces , which is our estimate of the true distribution. This process goes on as we make more samples. As can be seen, this is more computationally intensive. However, this approach is superior in many cases, we will specifically see an example later on. We will now define everything mathematically. To summarize, here is the procedure:
Introduction to Bayesian Statistics
The posterior function is obtained through Bayes’ rule.
ut neither do we know the statistical model, nor do we know the prior. Thus a meaningful approach is to just start with something, evaluate how good it is, and then update it. The evaluation is done through the mathematical laws described above.
This gives rise to the estimated pdf of x, called the predictive distribution:
Expected: is appropriate for . We want to evaluate the tuple appropriateness without information about . We develop the machinery for that.
True Distribution
A realized value of in a trial is denoted . In practical applications, while we may not know q(x), we assume its existence.
Let us just revise the basics first as they will be important in the calculations that we make.
Let
Do observe that we are able to take the product here because of the independent sampling.
The average entropy of the true distribution is defined as:
The empirical entropy is defined as:
By definition, one can see that . I outline it nonetheless, so that you can get comfortable with the calculations.
Similarly, one can see that the variance of the empirical entropy is:
The average and empirical entropies of the true distribution which is a conditional distribution is defined similarly:
Model, Prior and Posterior
Let . Let be independent real random values subjected to . For an arbitrary pair , the posterior probability density is defined by
where is defined by
which is called the partition function/marginal likelihood/evidence.
Expected value over the posterior distribution is denoted . Do note that .
This expected value is a random variable as it depends on . (Better to say, it is the expected value over a conditional probability density and hence is a random variable).
The posterior gives rise to the predictive density function:
(estimate w from , estimate x from w, vary over all w).
If , the prior is called proper, because it is normalized so that . Even for an improper prior, posterior and predictive probability densities can be defined if is finite and well defined.
An Important Example - The Exponential Family
In many simple statistical models, the posterior converges to the normal distribution as . We see such a case in the example referred to below. However, even in some simple cases, and many others, this fails. This is the key problem resulting in the new theory.
At this point, I highly recommend referring to the example (given in the notebook link at the end).
We are now going to prove the formulae given in the example.
If the statistical model is of the form
where u is a real valued function (and the other two are vector valued), then this distribution is said to belong to the exponential family. Furthermore, if the distribution of the parameter θ∈Θ depends on some hyper-parameter ϕ, and can be written as
where z(ϕ) is the normalizing factor, then is said to be a conjugate prior distribution.
In the case when the distribution is of the form
we can take and . Thus it is of the exponential family.
Now, as we know, . Let us calculate the numerator.
Let us denote . Then we have the numerator is:
Let us get the for which we need to integrate the numerator with respect to θ. and here we use a nice hack. We know that the integral of the prior with respect to θ is 1, regardless of what is. So set and we see that the first term integrates out to 1, while the second term is a scalar number independent of .
Hence,
which is also from the exponential family!
Finally, the predictive probability is given by
One may notice that we are using a different formula for the predictive density, bypassing the integral definition. This comes directly from using the bayes rule in the given definition (check it yourself), and it is computationally more useful in some cases to use this instead.
For the example given at the start of the section, it is just a matter of inputting numbers into the formulae.
Estimation and Generalization
We need an objective measure which indicates the difference between true and estimated probability density to evaluate how accurate the predictive density is.
Let be a sample taken independently from and be a predictive density using a statistical model and a prior . We are going to make two definitions:
Notice how both of these quantities are random variables.
Thus
Thus , with equality iff . That is, the smaller is, the more precise our estimate is according to the KL divergence.
An observation: As entropy does not depend on either a model / prior, smaller generalization error is equivalent to lower KL divergence.
Definition: Assume . Let be a set of random variables (leave one out).
Cross validation loss is defined by and is called the cross validation error.
We now prove an important theorem, which has three statements regarding the definitions that we made.
Note: is just the integral with respect to the posterior distribution.
(1) Here is the proof of the first statement.
While it was not mentioned what the expectation is being taken over in the statement, the proof clarifies it. In any case, the answer to the clarification is the canonical and the most standard answer.
(2) We now prove the second statement:
Thus,
Call the integrand in the denominator . Then is:
We introduced cross validation as a measure to evaluate the accuracy of our estimation. However, there are two issues with cross validation:
1) Although the averages of are equal the variances need not be equal. However, we do have this relation:
2) In the second statement, if the average by the posterior is numerically approximated, then
is called the importance sampling cross validation loss. Importance sampling is the method of calculating an expectation more easily by writing it as a more manageable distribution.
is fundamentally different from the former. is expectation with respect to .
3) Let us prove the third statement now.
By Cauchy Schwarz. Equality holds iff as a function of , which implies that is a constant function of .
We introduce another measure now, and it is often better than the cross validation loss. There are also many cases where WAIC can be employed whereas cross validation cannot.
Definition: Let be a set of random variables. The widely applicable information criterion (WAIC) is defined by
Here is a result: If is independent, WAIC is asymptotically equal to cross validation loss.
Remark: and WAIC can be employed to evaluate a stats model and a prior even if the prior is improper.
Just to summarize, we have introduced three instruments of measure:
In numerical experiments, we often care about minimizing errors instead of the loss itself due to the lower variance.
Marginal likelihood or Partition Function
If a prior satisfies , then the marginal likelihood (partition function) satisfies
We have slyly used Fubini Theorem above. Thus if the prior is proper.
Definition: The Free energy, or the minus log marginal likelihood is defined by
We look at this quantity also as an estimate of .
Using the notation , firstly note that
Thus,
Then,
Smaller means KL divergence decreases (⇒KL≥0) hence is a better estimate for . Thus is the average KL divergence from to whereas is their sum.
We now prove yet another important theorem.
Proof: For an arbitrary function, .
Now, .
Thus, .
Remark: As , the correspondence between free energy and marginal likelihood is one to one. However, in general, asymptotic order of the marginal likelihood as a random variable is not equal to its average, whereas for free energy that is the case. We have not proved this yet. Thus, for asymptotic statistics, free energy is a more convenient random variable.
We can illustrate the failure for the former: Let the marginal likelihood ratio for . Then . However, this is a result: in probability.
Meaning of Marginal Likelihood
Assume that is the prior distributions of a model p(x∣w) and a prior . Then .
By Bayes' Theorem, .
Thus if n is sufficiently large, maximizing is equivalent to maximizing .
Conditional independent cases
We will make the definitions for the conditionally independent case. They are quite similar.
Let us assume is dependent but is conditionally independent.
For an arbitrary function ,
which is a function of .
Everything else is defined similarly. But in this case, and .
For Example:
This can be understood as a regression problem
Thus is dependent, so CV cannot be employed. WAIC, however, can be. It is thus superior.
Exercises
I now refer you to the first set of exercises given in the notebook.
Further Steps
In the next post, we will introduce the concepts of realizability and regularity. We will discuss the main theorems of the regular statistical models. We will discuss MCMC methods that are a key tool for calculations, and we may do some other things as well.