A short 'derivation' of Watanabe's Free Energy Formula

Wuschel Schulz

Epistemic status: I wrote the post mostly for myself, in the process of understanding the theory behind Singular Learning Theory, based on the SLT low lecture series. This is a proof sketch with the thoroughness level of an experimental physics lecture: enough to get the intuitions but consult someone else for details.

Background: Statistical Learning Theory

In the framework of statistical learning theory, we aim to establish a model that can predict an outcome given an input $x$ with a certain level of confidence. This prediction is typically governed by a set of parameters $w$ , which need to be learned from the data. Let us begin by defining the primary components of this framework.

The Probabilistic Model

The probabilistic model $p (y | x, w)$ represents the likelihood of observing the outcome $y$ given the input $x$ and the parameters $w$ . This model is parametric, meaning that the form of the probability distribution is predefined, and the specific behavior of the model is determined by the parameters $w$ .

The True Distribution

The true distribution $q (y | x)$ is the actual distribution from which the data points are sampled. In practice, this distribution is not known and $p (y | x, w)$ attempts to approximate it through the learning process.

The Parameters

The parameters $w$ are the weights or coefficients within our model that determine its behavior. The goal of learning is to find the values of $w$ that make $p (y | x, w)$ as close as possible to the true distribution $q (y | x)$ .

The Data

The data $D = {x_{i}, y_{i}}$ consist of pairs of input values $x_{i}$ and their corresponding outcomes $y_{i}$ . These data points are used to train the model, adjusting the parameters $w$ to fit the observed outcomes.

The Negative Log-Likelihood

The negative log-likelihood is a measure of how well our model $p (y | x, w)$ fits the data $D$ . It is defined as the negative sum of the logarithms of the model probabilities assigned to the true outcomes $y_{i}$ :

$N L L (w) = - \sum_{i = 1}^{N} log p (y_{i} | x_{i}, w),$

where $N$ is the number of data points in $D$ . Minimizing the negative log-likelihood is equivalent to maximizing the likelihood of the data under the model, which is a central objective in the training of probabilistic models.

Prior over the Parameters

In a Bayesian framework, we incorporate our prior beliefs about the parameters $w$ before observing the data through a prior distribution $φ (w)$ . This prior distribution encapsulates our assumptions about the values that $w$ can take based on prior knowledge or intuition. For instance, a common choice is a Gaussian distribution, which encodes a preference for smaller (in magnitude) parameter values, promoting smoother model functions.

Posterior Distribution

Once we have observed the data $D$ , we can update our belief about the parameters $w$ using Bayes' theorem. The updated belief is captured by the posterior distribution, which combines the likelihood of the data given the parameters with our prior beliefs about the parameters. The posterior distribution is defined as:

$P (w | D) = \frac{1}{Z_{n}} exp (- N L L (w)) Φ (w),$

where $N L L (w)$ is the negative log-likelihood of the parameters given the data, $Φ (w)$ is the prior over the parameters, and $Z_{n}$ is the normalization constant, also known as the partition function. The partition function ensures that the posterior distribution is a valid probability distribution by integrating (or summing) over all possible values of $w$ :

$Z_{n} = \int exp (- N L L (w)) φ (w) d w .$

The posterior distribution reflects how likely different parameter values are after taking into account the observed data and our prior beliefs.

Free Energy

This formulation looks like the Boltzmann distribution from statistical physics. This distribution also describes a probability: the probability that a system can be found in a particular microstate is $e^{- E}$ with $E$ being the energy (here assuming, that the temperature is 1). This formalism can be expanded to macrostates, when changing the Energy to Free Energy, taking into account that microstates that have many realizations are more likely.
We can do a similar step with the Posterior of our model weights, by considering neighborhoods around local minima as macrostates. This opens up the question of what the correct formula for the Free Energy of these macrostates are, so we can make predictions about the behavior of learning machines.
First, let us consider the case of large datasets. In that case, we can replace the negative Log-Likelihood with n times its average:

$\begin{matrix} N L L (w) & = - N \sum i = 1 log p (y_{i} | x_{i}, w) \approx n \int log (p (y | x, w)) q (y | x) q (x) d x = n (\int log (\frac{p (y | x, w)}{q (y | x)}) q (y | x) q (x) d x - \int log (q (y | x)) q (y | x) q (x) d x) = n < K L (p_{w} | q) - S (q) >_{q (x)} \end{matrix}$

We see that we can compose the negative Log likelihood into the number of data points n, the expected Kullback–Leibler divergence of the model with the true distribution, and the expected entropy of the true distribution. As we see here, the expected entropy term does not contribute to the posterior of the parameters:
$\begin{matrix} P (w | n) & = \frac{1}{Z_{n}} exp (- N L L (w)) Φ (w) = \frac{exp ((- N L L (w)) Φ (w)}{\int exp (- N L L (w)) Φ (w) d w} = \frac{exp (- n ⟨ K L (p_{w} ∥ q) - S (q) ⟩_{q (x)}) Φ (w)}{\int exp (- n ⟨ K L (p_{w} ∥ q) - S (q) ⟩_{q (x)}) Φ (w) d w} = \frac{exp (- n ⟨ K L (p_{w} ∥ q) ⟩_{q (x)}) Φ (w)}{\int exp (- n ⟨ K L (p_{w} ∥ q) ⟩_{q (x)}) Φ (w) d w} \end{matrix}$
The macrostates, aka the regions in parameter space we are now interested in analyzing are the vicinities of local minima in the Loss landscape. Let us assume we decompose our parameter space into macrostates $W = \cup_{α} W_{α}$ .
$\begin{matrix} P (w \in W_{α}) & = \int_{w \in W_{α}} p (w | n) d w = \frac{\int_{w \in W_{α}} \frac{1}{Z_{n}} exp (- L (w)) Φ (w) d w}{\sum_{α} \int_{w \in W_{α}} \frac{1}{Z_{n}} exp (- L (w)) Φ (w) d w} = \frac{\int_{w \in W_{α}} exp (- n ⟨ K L (p_{w} ∥ q) ⟩_{q (x)}) Φ (w) d w}{\sum_{α} \int_{w \in W_{α}} exp (- n ⟨ K L (p_{w} ∥ q) ⟩_{q (x)}) Φ (w) d w} = \frac{exp - F (W_{α}, n)}{\sum_{α} exp - F (W_{α}, n)} F (W_{α}, n) & = - log (\int_{w \in W_{α}} exp (- n ⟨ K L (p_{w} ∥ q) ⟩_{q (x)}) Φ (w) d w) \end{matrix}$

We call this $F$ the free energy of this area, as it serves the same function as the Free Energy in statistical physics. We have constructed $W_{α}$ such that it only contains one minimum of the expected Kulberg Leibler divergence (from here on we abbreviate $K L (p_{w} ∥ q) ⟩_{q (x)}$ as $K (ω)$ ). These minima don't have to be points but can be different sets within $W_{α}$ .

Examples of sets where $K$ has a minimum. a): A point b): A submanifold. c): An algebraic set with Singularity.

Since we assume, that $K (ω)$ is smooth, it can locally be approximated by its Taylor series at any point. How the leading order terms of the Taylor series look at its minimum depends on the geometry of the minimum. Let's look at an example of how that might look like from the figure above. In a) we have a simple minimum point, so the Taylor series of $K$ at this point might look like $x^{2} + y^{2}$ , going up in each direction. In b) our minimum is a one-dimensional submanifold, and so the Taylor expansion of $K$ will look like $x^{0} + y^{2}$ , increasing in one dimension, but staying flat when going along another. The situation in c) looks almost the same, except in the center, where the line crosses itself. There the Taylor expansion will look like $x^{2} \cdot y^{2}$ , locally staying flat in either direction. Points like these, where the directions you can go in while staying on the set suddenly change, are called singularities.
Let us now calculate the Free Energy around a point $ω^{*}$ , where $K$ has a minimum. In general the leading order term of the $K$ at a minimum looks like this:
$\begin{matrix} K (ω^{*} + x) = K_{m i n} + a \prod i x_{i}^{2 k_{i}} := K_{m i n} + a {\to x}^{2 \to k} \end{matrix}$
Here $K_{m i n}$ is the value of the KL-divergence at its minimum.
To get it into this form, we have to make a change of coordinates for $x$ (for example when we are at a singularity where the different parts of the minimum come in from an angle). A process called blowup guarantees us, that we are always able to find such coordinates. Since we are at a minimum, we know that all leading order exponents are even.
Let's also approximate our prior over the weights $Φ (w^{*} + x)$ on this point in the same basis as its leading order term:
$\begin{matrix} Φ (ω) = b \prod i x_{i}^{h_{i}} := b {\to x}^{\to h} \end{matrix}$
For large $n$ , the Free Energy at any point far away from the minima gets exponentially suppressed. To see how much the minimal point $ω^{*}$ contributes to the integral we can integrate it in its vicinity. Here we choose to integrate from 0 to 1.

Why not from $- 1$ to $1$ ?^[1]
$\begin{matrix} F (ω^{*}, n) = - l o g (\int_{0}^{1} exp (- n (K_{m i n} + a {\to x}^{2 \to k}) (b {\to x}^{\to h} d \to x) \end{matrix}$
The first thing we can do is pull out the $K_{m i n}$ term as it does not depend on $x$ .
$\begin{matrix} F (ω^{*}, n) = n K_{m i n} - l o g (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) \end{matrix}$
To solve the remaining integral, we first do a Laplace and then a Mellin transform, as this brings the integral into a form, that is easier to solve.
$\begin{matrix} L (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) & = \int_{0}^{\infty} (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) e^{- n t} d t = \int_{0}^{1} δ (t - a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x M (L (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x)) & = \int_{0}^{\infty} (\int_{0}^{1} δ (t - a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) t^{z} d t = \int_{0}^{1} (a {\to x}^{2 \to k})^{z} b {\to x}^{\to h} d \to x = a^{z} b \prod i \int_{0}^{1} x_{i}^{2 z k_{i} + h_{i}} = a^{z} b \frac{1}{\prod_{i} (2 z k_{i} + h_{i} + 1)} \end{matrix}$

Next, we group the dimensions $i$ along, such that within each group $I_{α}$ , the value $λ_{α} = \frac{h_{i} + 1}{2 k_{i}}$ is the same. Later in the derivation, it will turn out, that only the biggest $λ$ will contribute for large $n$ .
$\begin{matrix} M (L (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x)) & = a^{z} b \frac{1}{\prod_{α} \prod_{i \in I_{α}} 2 k_{i} (z + λ_{i})} \end{matrix}$
Now we go backward, and reverse the Mellin and the Laplace transform:
$\begin{matrix} L (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) & = \int_{- i \infty}^{+ i \infty} \frac{a^{z} b}{2 π i} \frac{1}{\prod_{α} \prod_{i \in I_{α}} 2 k_{i} (z + λ_{i})} t^{- z - 1} d z = \frac{b}{2 π i t} \int_{- i \infty}^{+ i \infty} \frac{e^{z log (\frac{a}{t})}}{\prod_{α} \prod_{i \in I_{α}} 2 k_{i} (z + λ_{i})} d z \end{matrix}$
We can solve this integral with the Residue Theorem, closing the integral path over the negative half of the if $a > t$ . Since all $λ_{α}$ have to be positive, when we close the integral over the negative half, we enclose no residues and the integral comes out as zero. Otherwise each unique $λ_{i}$ , which appears $m$ times, being a residual of degree $m$ :
$\begin{matrix} L (\int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) & = \frac{b}{2 π i t} \sum α \frac{1}{m_{α} - 1} {(\frac{d}{d z})}^{m_{α} - 1} e^{z log (\frac{a}{t})} {∣ ∣ ∣}_{z = - λ_{α}} = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} \frac{b}{2 π i t} \sum_{α} \frac{log {(\frac{a}{t})}^{m_{α} - 1}}{m_{α} - 1} e^{- λ_{α} log (\frac{a}{t})}, & if a > t, 0, & if a \leq t . \end{matrix} \end{matrix}$
When we now reverse the Laplace transform, we only have to integrate up to $a$ .
$\begin{matrix} \int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x & = \int_{0}^{a} \frac{b}{2 π i} \sum α a^{- λ_{α}} \frac{log {(\frac{a}{t})}^{m_{α} - 1}}{m_{α} - 1} t^{λ_{α} - 1} e^{- n t} d t ∣ ∣ ∣ τ = \frac{n}{a} t = \frac{b}{2 π i} \sum α \frac{a^{- λ_{α}}}{m_{α} - 1} \int_{0}^{n} log {(\frac{n}{τ})}^{m_{α} - 1} {(\frac{a τ}{n})}^{λ_{α} - 1} e^{- τ a} \frac{a}{n} d τ = \frac{b}{2 π i} \sum α \frac{n^{- λ_{α}}}{m_{α} - 1} \int_{0}^{n} τ^{λ_{α} - 1} e^{- τ a} {(log (n) - log (τ))}^{m_{α} - 1} d τ = \frac{b}{2 π i} \sum α \frac{n^{- λ_{α}}}{m_{α} - 1} \int_{0}^{n} τ^{λ_{α} - 1} e^{- τ a} m_{α} - 1 \sum j = 0 (\frac{m_{α} - 1}{j}) log (n)^{m_{α} - 1 - j} (- log (τ)^{j}) d τ = \frac{b}{2 π i} \sum α \frac{n^{- λ_{α}}}{m_{α} - 1} m_{α} - 1 \sum j = 0 (\frac{m_{α} - 1}{j}) log (n)^{m_{α} - 1 - j} \int_{0}^{n} τ^{λ_{α} - 1} e^{- τ a} (- log (τ)^{j}) d τ \end{matrix}$
This integral now is asymptotically going towards some number as $n$ goes to infinity. So we can pick out the term that is the highest order in $n$ as the one with the smallest $λ_{α}$ and $j = 0$ .
$\begin{matrix} \int_{0}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x \propto n^{- λ_{α *}} log (n)^{m_{α *} - 1} + O (n^{- λ_{α *}} log (n)^{m_{α *} - 2}) \end{matrix}$
We can combine this into the Formula for Free Energy.
$\begin{matrix} F (ω^{*}, n) = n K_{m i n} + λ l o g (n) - (m - 1) log (log (n)) \end{matrix}$
This expression tells us now, how much mass of the posterior distribution is centered around certain points in parameter space. The leading order term consists of the KL divergence between the training distribution and the distribution of our model. Intuitively, this corresponds to our training loss. In the second term the $λ$ inversely scales with the multiplicity of the minimum. Intuitively it corresponds to symmetries in the algorithm that the Network implements. The third term scales with the number of parameters that have a minimum of the same multiplicity. Intuitively this gives a set of parameters an extra advantage when multiple parameter sets implement algorithms that independently lead to minimal loss.

^{^}
I would have expected, that we integrate the whole vicinity:
$\begin{matrix} F (ω^{*}, n) = - l o g (\int_{- 1}^{1} exp (- n (K_{m i n} + a {\to x}^{2 \to k}) (b {\to x}^{\to h} d \to x) \end{matrix}$
This, however, makes the derivation quite ugly:
$\begin{matrix} L (\int_{- 1}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) & = \int_{0}^{\infty} (\int_{- 1}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) e^{- n t} d t = \int_{- 1}^{1} δ (t - a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x M (L (\int_{- 1}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x)) & = \int_{0}^{\infty} (\int_{- 1}^{1} δ (t - a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) t^{z} d t = \int_{- 1}^{1} (a {\to x}^{2 \to k})^{z} b {\to x}^{\to h} d \to x = a^{z} b \prod i \int_{- 1}^{1} x_{i}^{2 z k_{i} + h_{i}} = \frac{a^{z} b}{\prod_{i} (2 z k_{i} + h_{i} + 1)} \prod i (1^{2 z k_{i} + h_{i}} - (- 1)^{2 z k_{i} + h_{i}}) \end{matrix}$
$\begin{matrix} L (\int_{- 1}^{1} exp (- n a {\to x}^{2 \to k}) b {\to x}^{\to h} d \to x) & = \frac{b}{2 π i} \oint \frac{a^{z} t^{- z - 1}}{\prod_{i} (2 z k_{i} + h_{i} + 1)} \prod i (1 - (- 1)^{2 z k_{i} + h_{i}}) d z = \frac{b}{2 π i t} \int_{- i \infty}^{+ i \infty} \frac{e^{z log (\frac{a}{t})}}{\prod_{α} \prod_{i \in I_{α}} 2 k_{i} (z + λ_{i})} \prod i (1 - (- 1)^{2 k_{i} (z + λ_{i})}) d z = \frac{b}{2 π i t} \sum α \frac{1}{m_{α} - 1} {(\frac{d}{d z})}^{m_{α} - 1} e^{z log (\frac{a}{t})} (1 - (- 1)^{2 k_{i} (z + λ_{i})}) {∣ ∣ ∣}_{z = - λ_{α}} = ⎧ ⎨ ⎩ \begin{matrix} \frac{b}{2 π i t} \sum_{α} \frac{1}{m_{α} - 1} e^{- λ_{α} log (\frac{a}{t})} \sum_{j} (\frac{m_{α} - 1}{j}) (log {(\frac{a}{t})}^{m_{α} - 1 - j} (2 π i k_{i})^{j}, & if a > t, 0, & if a \leq t . \end{matrix} \end{matrix}$
We can proceed with the rest of the calculation with each summand $j$ and just have a smaller $m_{α}$ . This does not change anything about the leading terms, that are relevant for the Free Energy formula.
I am still confused, about why the lecture went from 0 to 1, and would be grateful for an explanation in the comments.

[-]Daniel Murfet3mo40

There was a sign error somewhere, you should be getting + lambda and - (m-1). Regarding the integral from 0 to 1, since the powers involved are even you can do that and double it rather than -1 to 1 (sorry if this doesn't map exactly onto your calculation, I didn't read all the details).

[-]Wuschel Schulz3mo10

Ok, the sign error was just in the end, taking the -log of the result of the integral vs. taking the log. fixed it, thanks.

Thanks, Ill look for the sign-error!

I agree, that K is symmetric around our point of integration, big the prior phi is not. We integrate over e-(nk)*phi, wich does not have have to be symetric, right?

[-]Daniel Murfet3mo30

Yes, good point, but if the prior is positive it drops out of the asymptotic as it doesn't contribute to the order of vanishing, so you can just ignore it from the start.

[-]James Camacho3mo10

To see how much the minimal point contributes to the integral we can integrate it in its vicinity

I think you should be looking at the entire stable island, not just integrating from zero to one. I expect you could get a decent approximation with Lie transform perturbation theory, and this looks similar to the idea of macro-states in condensed matter physics, but I'm not knowledgeable in these areas.

−N∑i=1logp(yi|xi,w)

You have a typo, the equation after Free Energy should start with

Also the third line should be $\int \dots + \int$ , not minus.

Also, usually people use $θ$ for model parameters (rather than $w$ ). I don't know the etymology, but game theorists use the same letter (for "types" = models of players).

LESSWRONG
LW