DSLT 4. Phase Transitions in Neural Networks

5Edmund Lau

1Leon Lang

6Liam Carroll

New Comment

Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)

At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in

time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

Which altered the posterior geometry, but not that of since (up to a normalisation factor).

I didn't understand this footnote.

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.

Are you maybe trying to say the following? *The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior*.

We will use the label

weight annihilation phaseto refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

The weight annihilation phase is

neverpreferred by the posterior

I think there is a typo and you meant .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it's an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model's behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there's definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.

Which altered the posterior geometry, but not that of since (up to a normalisation factor).

I didn't understand this footnote.

I just mean that is not affected by (even though of course or is), but the posterior is still affected by . So the phase transition merely concerns the posterior and not the loss landscape.

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.

My use of the word "symmetry" here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are *only in the set of true parameters in this setup* when the truth is configured in a particular way. In other words, they are always local minima of , but not always global minima. (This is what PT1 shows when ). Thanks for pointing this out.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

Huh, I'd never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.

I think there is a typo and you meant .

Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.

TLDR; This is the fourth main post ofDistilling Singular Learning Theorywhich is introduced inDSLT0. I explain how to relate SLT to thermodynamics, and therefore how to think about phases and phase transitions in the posterior in statistical learning. I then provide intuitive examples of first and second order phase transitions in a simpleK(w)loss function. Finally, I experimentally demonstrate phase transitions in two layer ReLU neural networks associated to the node-degeneracy and orientation-reversing phases established inDSLT3, which we can understand precisely through the lens of SLT.In deep learning, the terms "phase" and "phase transition" are often used in an informal manner to refer to a steep change in a metric we care about, like the training or test loss, as a function of SGD steps, or alternatively some hyperparameter like the number of samples from the truth n.

But what exactly are the phases? And why do phase transitions even occur? SLT provides us a solid theoretical framework for understanding phases and phase transitions in deep learning. In this post, we will argue that in the Bayesian setting,

The hyperparameter θ could be the number of samples from the truth n, some way of varying the model function f(x,w)=f(x,w;θ) or something about the true distribution Dn=Dn(θ), amongst other things. At some critical value θ=θc, we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error Gn=E[Fn+1]−E[Fn].

In this post, we will present experiments that observe precise phase transitions in the toy neural network models we studied in DSLT3, for which we understand the set of true parameters W0 and therefore the phases. By the end of this post, you will have a framework for thinking about phase transitions in singular models and an intuition for why SLT predicts them to occur in learning.

## Phases Correspond to Singularities

## The Story Starts in Physics

This subsection is modelled on [Callen, Ch9], but it is only intended to be a high level discussion of the concepts grounded in some basic physics - don't get too bogged down in the details of the thermodynamics.

Fundamentally, a phase describes an aggregate state of a complex system of many interacting components, where the state retains particular qualities with variations in some hyperparameter. To explain the concept in detail, it is natural to start in physics (thermodynamics in particular), where these ideas originally arose. But there is a deeper reason to build from here: every human has an intuitive understanding of the phases of water and how they change with temperature

^{[1]}, which serves as the base mental model for what a phase is.One of the main goals of thermodynamics is to study how the equilibrium state of a system changes as a function of macroscopic parameters. In the case of a vessel of water at 1atm of pressure in constant contact with a thermal and pressure reservoir, the equilibrium state of the system corresponds to a state that is minimised by the Gibbs free energy F

^{[2]}. The phases, then, are the equilibrium states, which describe qualitative physical properties of the system. The states of matter - solid, liquid, and gas - are all phases of water, which are characterised by variables like their volume and crystal structure. As anybody that has boiled water before knows, these phases undergo transitions as a function of temperature. Let's make this more precise.## The Thermodynamic Setup

Consider a system of K water molecules moving in a 2D container, each with equal mass m. To each particle i∈[K]={1,…,K} we can associate a set of

W={w|(wi,x,wi,y)∈R2for eachi∈[K]}.microstatesdescribing its physical properties at a point in time, for example its position xi and its velocity vi. In our discussion we will simply focus on the position, which we will relabel w=(w1,…,wK) (for reasons that will become clear), so our configuration space W⊆R2K of possible microstates isSince it is physically infeasible to know or model the positions of all molecules, we instead reason about the dynamics of the system by calculating

Wv={w∈W|V(w)=v}⊆W.macroscopicvariables associated to a microstate, for example the temperature or total volume of the molecules. We will focus on the volume V(w) of a microstate w. Importantly, a macroscopic state is an aggregate over the system (for example, temperature being related toaveragesquared velocity), meaning there are many possible configurations of microstates that result in the same macrostate. To this end, we can define regions of our configuration space according to their volume v,In our toy example, we want to study how the system changes as a function of temperature, which we will denote with θ. In a Gibbs ensemble, we can associate an energy functional, the Hamiltonian H(w;θ), to any given microstate w at temperature θ. The fundamental postulate of such a Gibbs ensemble is that probability of the system being in a particular micro state w is determined by a Gibbs distribution

p(w;θ)=e−H(w;θ)ZwhereZ=∫We−H(w;θ)dw.^{[3]}This should look pretty familiar from our statistical learning setup! Indeed, we can then calculate the free energy of the ensemble for different volumes v at temperature θ,

Fθ(v)=−log(∫Wve−H(w;θ)dw).For a Gibbs ensemble, the equilibrium state of a given system is that state which minimises the free energy. In the context of bringing water to a boiling point, there are two minima of the free energy characterised by the liquid and gaseous states, which for ease we will characterise by their volumes vliquid and vgas. Then the equilibrium state changes at the critical temperature θc=100°C,

{Wvliquid0°C<θ<100°CWvgasθ>100°C.Importantly, while small variations in the temperature away from θc will change the free energy of each state, it will

notchange the configuration of these minima with respect to the free energy. In other words, the system will still be a liquid for any θ∈(0,100) - its qualitative properties are stable. This is the content of a phase.## What is a

phase?A

phaseof a system is a region of configuration space W⊂W that minimises the free energy, and is invariant to small perturbations in a relevant hyperparameter θ. Typically, phases are distinguished by some macroscopic variable, in our case the volume V(w) distinguishing subsets Wv. More generally though, a phase describes some qualitative aggregate state of a system - like, as we've discussed in our example, the states of matter.In some sense, you can define a phase to be

anyregion that induces an equilibrium state with qualities you care about. But what makes phases a powerful concept is their relation to phase transitions - when there is a sudden jump in which state is preferred by the system.## What is a

phase transition?Phase transitions are changes in the structure of the global minima of the free energy, and often arise as non-analyticities of Fn. This is a fancy way of saying they correspond to discontinuities in the free energy or one of its derivatives

^{[4]}.A

first order phase transitionat a critical temperature θc corresponds to a reconfiguration of which phase is the global minima of the free energy.As we discussed above, heating water to boiling point θc=100°C is a classic example of a first order phase transition.

Two examples of

second order phase transitionsare where:mergetransition occurs at θc when two phases that are initially disjoint for θ<θc merge to become the same state for θ≥θc, or;creationtransition occurs at θc when a local minima exists for θ≥θc but does not exist for θ<θc. (If the directions are reversed, we call this adestructiontransition).(Note that we have not given a full classification of phase transitions here, because to do so one needs to study the possible types of

catastrophesthat can occur, as presented in [Gilmore]).## Phases in Statistical Learning

The notation and concepts in the previous section were not presented without reason. For starters, the Gibbs ensemble view of statistical learning is actually quite a rich analogy because, when the prior is uniform, the (random) Hamiltonian is equal to the empirical KL divergence

Hn(w)=nKn(w).^{[5]},The configuration space of microstates of the physical system then corresponds to parameter space W with microstates given by different parameters w∈W. This means the posterior is equivalent to the Gibbs probability distribution of the system being in a certain microstate, meaning the definition of free energy is identical. So, what exactly are the phases then?

In statistical learning then,

To say that W minimises the free energy is equivalent to saying that it has non-negligible posterior mass. The reason for this, as we explored in DSLT2, is that the singularity structure of a most singular optimal point w(0)W∈Wopt dominates the behaviour of the free energy, because it minimises the loss L(w) and has the smallest RLCT λW.

You can, in principal, define a phase to be any region of W. But the analysis of phases in the posterior only gets interesting when you have a set of phases that have fundamentally different geometric properties. The free energy formula tells us that these geometric properties correspond to different accuracy-complexity tradeoffs.

Consequently, in statistical learning, Watanabe states in [Wat18, §9.4] that

Our definitions of first and second order phase transitions carry over perfectly from the physics discussion above.

It's important to clarify here that phase transitions in deep learning have many flavours. If one believes that SGD is effectively just "sampling from the posterior", then the conception that phase transitions are related to changes in the geometry of the posterior carries over. There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in

time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints n, but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.^{[6]}The hyperparameter θ can affect any number of objects involved in the posterior. Remembering that the posterior is

p(w|Dn)=φ(w)e−nLn(w)Zn,we could include hyperparameter θ dependence in any of:

## Intuitive Examples to Interpret Phase Transitions

In DSLT2 we studied an example of a very simple one-dimensional K(w) curve and got a feel for how the accuracy and complexity of a singularity affect the free energy of different neighbourhoods. Having now learned about phase transitions, we can cast new light on this example.

## Example 1: First Order Phase Transition in n

K(w)=(w+1)2((w−(1+hC))4−kC)Example 4.1:Consider again a KL divergence given bywhere w(0)−1=−1 and w(0)1=1 are the singularities, but the accuracy of w(0)1 is worse, K(w(0)1)=C>0. Then we can identify two phases corresponding to the two singularities,

W−1=B(w(0)−1,δ)andW1=B(w(0)1,δ)for some radius δ>0 such that the accuracy of W−1 is better, but the complexity of W1was smaller,

L(w(0)−1)<L(w(0)1),butλW−1>λW1.As the hyperparameter θ=n

^{[7]}varies, we see afirst order phase transitionat the critical value of nc≈17 where the two free energy curves intersect, causing an exchange which phase is the global minima of the free energy. As we argued in that post, this is largely due to the accuracy-complexity tradeoff of the free energy. Notice also how the free energy of the global minima is non-differentiable at nc, showing an example of the "non-analyticity" of Fn that we mentioned above.## Example 2: Second Order Merge Phase Transition

K(w;θ)=(w+(1−θ))2(w−(1−θ))4Example 4.2:We can modify our example slightly to observe a second order phase transition. Let's considerwhere θ∈[0,1] is a hyperparmeter that shifts the two singularities w(0)−1=−1+θ and w(0)1=1−θ towards the origin. We will continue to label these phases W−1 and W1, noting their θ dependence.

^{[8]}Thus, at θc=1 the two phases will

K(w;1)=w6.mergeand the KL divergence will beTherefore, at θ=1 the singularity w(0)0=0 will have an RLCT of

λ0=16.There is a new most singular point caused by the merging of two phases! Again, we can visually depict this phase transition:

Now that we have the basic intuitions of SLT and phase transitions down pat, let's apply these concepts to the case of two layer feedforward ReLU neural networks.

## Phase Transitions in Two Layer ReLU Neural Networks

The main claim of this sequence is that Singular Learning Theory is a solid theoretical framework for understanding phases and phase transitions in neural networks. It's now time to make good on that promise and bring all of the pieces together to understand an actual example of phase transitions in neural networks. The full details of these experiments are explained in my thesis, [Carroll, §5.2], but I will briefly outline some points here for the interested reader. All notation and terminology is explained in detail in DSLT3, so use that section as a reference.

If you are uninterested, just skip to the next subsection to see the results.

## Experimental Setup

We will consider a (model, truth) pair defined by the simple two layer feedforward ReLU neural network models we studied in DSLT3. Phase transitions will be induced by varying true distribution by a hyperparameter θ, meaning Dn=Dn(θ). Since we have a full classification of W0 from DSLT3, we understand the phases of the system, and therefore we want to study how their differing geometries affect the posterior. As we explained in that post, the scaling and permutation symmetries are generic (they occur for all parameters w∈W), but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth. Thus, we are interested in studying the how the posterior changes as we vary the truth to induce these alternative true parameters - the phases of our setup.

The posterior sampling procedure uses an MCMC variant called HMC NUTS, which is brilliantly explained and interpreted here. Estimating precise nominal free energy values, and particularly those of the RLCT λ, using sampling methods is currently very challenging (as explained in [Wei22]). So, for these experiments, our inference about phases and phase transitions will be based on

visualisingthe posterior and observing the posterior concentrations of different phases. With this in mind, the posteriors below are averaged over four trials, 20,000 samples each, for each fixed true distribution defined by θ. (Bayesian sampling is very computationally expensive, even in simple settings).To isolate the phases we care about, we can use the fact that the scaling symmetry and permutation symmetries of our networks are generic. To this end we will normalise the weights by defining the

effective weight^wi=|qi|wi^{[9]}, which preserves functional equivalence f(x,w)=f(x,^w)^{[10]}. We will say a node isdegenerateif ^wi=0. We also project different node indices on to the same (^wi,1,^wi,2) axes as follows:The prior on inputs q(x) is uniform on the square [−1,1]2, and the prior on parameters φ(w) is the standard multidimensional normal N(0,1).

## Phase Transition 1 - Deforming to Degeneracy

In this experiment we will see a first order phase transition induced by deforming a true network from having no degenerate nodes to having one (possibility of a) degenerate node, as discussed in DSLT3 - Node Degeneracy. This example will reinforce the key messages of Watanabe's free energy formula: true parameters are preferred according to their RLCT, and at finite n non-true parameters can be preferred due to the accuracy-complexity tradeoff.

## Defining the Model, Truth, and Phases

We are going to consider a model network with d=2 nodes,

f(x,w)=ReLU(⟨^w1,x⟩+^b1)+ReLU(⟨^w2,x⟩+^b2)+cand a realisable true network f(x,w(0)) with m=2 nodes, which we will denote by f2(x,θ):=f(x,w(0)) to signify its hyperparameter θ dependence (and distinguish it from the next experiment),

f2(x,θ)=ReLU(⟨^w(0)1,x⟩−13)+ReLU(⟨^w(0)2,x⟩−13).The true weights rotate towards one another by a hyperparameter θ=[0,π2], so

w(0)1=(cosθ,sinθ),w(0)2=(−cosθ,sinθ).^{[11]}As we explained in DSLT3, we can depict the function and its activation boundaries pictorially:

At θ=π2, the truth could be expressed by a network with only one node, m=1,

f2(x,π2)=ReLU(x2−13)+ReLU(x2−13)=2ReLU(x2−13).This degeneracy is what we are interested in studying. The WBIC tells us to expect the posterior to prefer the one-degenerate-node configuration since it has less effective parameters.

^{[12]}To identify our phases, at θ=π2 there are two possible configurations of the effective model weights that are true parameters:

Both non-degenerate but share the same activation boundary: Both ^w1,^w2≠0 such that ^w1+^w2=(0,2).One degenerate, one non-degenerate: Either ^w1=(0,0) and ^w2=(0,2), or vice versa by permutation symmetry.To study these configurations we thus define phases based on annuli in the plane centred on the circle of radius r with annuli radius of ε,

A(r,ε)={(^w_,1,^w_,2)∈R2|r−ε≤∥(^w_,1,^w_,2)∥≤r+ε}.Then we define the two phases containing the singularities of interest to be

ANonDegen=A(1,ε)×A(1,ε)ADegen=(A(0,ε)×A(2,ε))∪(A(2,ε)×A(0,ε)).The union is due to the permutation symmetry - which precise node is degenerate doesn't matter. We will let Ac=R2∖(ANonDegen∪ANonDegen).

There are two questions we seek to answer:

## Results

There is also a static facet grid of the key frames if you want a closer inspection.

The results of our experiments show:

It is unsurprising (yet satisfying) that the degenerate phase ADegen is preferred at θ=π2, in line with what the WBIC tells us to expect. What might be more surprising, though, is that ANonDegen has extremely little posterior density at this θ value.

^{[13]}As we have argued throughout the sequence, the free energy formula suggests that first order phase transitions happen when there is a change in the accuracy-complexity tradeoff such that the posterior newly preferences one phase over the other. Here, the first order phase transition at θc=1.26c can be understood in these terms with the following graph that depicts how the accuracy of ADegen improves with θ.

## A Complexity Measure for Non-Analytic ReLU Networks

One last thing to point out here is that since K(w) is not analytic for ReLU neural networks, the RLCT is not a well defined object. Nonetheless, Watanabe has recently proven in this paper that there is a

Fn≤nSn+λReLUlognboundon the free energy,where complexity λReLU∈Q>0 is measured by the number of parameters in the smallest compressed network possible to represent the function, as a kind of 'pseudo'-RLCT. In our case the the complexity is

2λReLU={9for 0<θ<π25for θ=π2since there are five parameters required in the degenerate phase and nine in the non-degenerate phase

^{[14]}. In this way, Watanabe's work predicts the results we see. This also shows us how the theory of SLT may be generalisable to the non-analytic setting and still give approximately the same essential insights into singular models.## Phase Transition 2 - Orientation Reversing Symmetry

## Defining the Model, Truth, and Phases

This time we are going to consider a model network with d=3 nodes,

f(x,w)=c+3∑i=1ReLU(⟨^wi,x⟩+^bi),and a realisable true network f3(x,ϑ) with m=3 nodes,

f3(x,ϑ)=3∑i=1ReLU(⟨^w(0)i,x⟩−13),where the weights are defined by an order parameter ϑ∈[1,3] that scales one gradient,

w(0)1=(cosπ3,sinπ3),w(0)2(ϑ)=ϑ(cosπ,sinπ),w(0)3=(cos5π3,sin5π3).At ϑ=1, the weights satisfy the weight annihilation property,

w(0)1+w(0)2+w(0)3=(0,0),meaning that reversing the orientation of the weights, w(0)i↦−w(0)i (which is equal to a rotation by π), will preserve the function as discussed in DSLT3 - Orientation Reversal. We will use the label

weight annihilation phaseto refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.^{[15]}Our key question thus becomes: does the posterior prefers the weight annihilation phase, or the non-weight annihilation phase, at ϑ=1?To depict the phases on the (wi,1,wi,2) plane, let R(θ)=(cosθ,sinθ), let B(x,ε) be the closed ball of radius epsilon centred at x∈R2, and let S3 denote the permutation group of order 3. Then the two phases of interest are

ENonWA=⋃σ∈S32∏k=0B(R(π3+2σ(k)π3),ε)EWA=⋃σ∈S32∏k=0B(R(2σ(k)π3),ε).Since w(0)2 is being scaled by ϑ, we will understand the centre of each ball corresponding to σ(k)=1 in ENonWA as being multiplied by the scalar ϑ. (It is easier to state in words that writing down in gory notation).

In this experiment our two questions are:

## Results

The results of this experiment show that:

neverpreferred by the posterior, thus there is no first order phase transition. But thereisa second order phase transition at ϑc≈2 where EWA is destroyed.In [Carroll, §5.4.3], I perform a calculation on an even simpler orientation-reversing example which shows that the relative error of inner cancellation region strongly dictates the preference of the two phases. This relative error can be made smaller by increasing the size of the prior q(x). That result suggests that the two phases may have the same RLCT, but differing lower order geometry. This is speculative though, and it would be interesting to better understand the RLCT of both phases.

The second order phase transition is unsurprising since we specifically deform the network so that EWA doesn't contain a true parameter for ϑ∈(1,3]. At ϑc, its inaccuracy is too highly penalised and the posterior contains no samples from the region.

## References

[Callen] - H. Callen, Thermodynamics and an Introduction to Thermostatistics, 1991

[Gilmore] - R. Gilmore, Catastrophe Theory for Scientists and Engineers, 1981

[Wat18] - S. Watanabe, Mathematical Theory of Bayesian Statistics, 2018

[Carroll] - L. Carroll, Phase Transitions in Neural Networks, 2021

[Wei22] - S. Wei, D. Murfet, et al., Deep learning is singular, and that’s good, 2022

^{^}At constant atmospheric pressure, that is.

^{^}Yes, in any physics or chemistry textbook you will see the Gibbs free energy denotes by G. I am writing F to keep it consistent with our later statistical learning discussion.

^{^}At this point, this is a slight abuse of the physics notions. Typically the probability distribution is proportional to e−βH(w) where β is the inverse temperature. In this case we are going to absorb the β into the H(w;θ) term and not get too caught up in the actual physics - we're just painting a conceptual picture to apply later on.

^{^}Which often correspond to the moments (mean, variance, etc.) of quantities like H(w).

^{^}More precisely, considering the tempered posterior at inverse temperature β>0, the Hamiltonian has the form

Hn(w)=nβLn(w)−logφ(w).(Since Kn(w)=Ln(w)−Sn, the constant Sn in w is irrelevant).

^{^}Note here that a phase transitions of a dynamical system (i.e. SGD, which we can imagine as a particle moving subject to a potential well) is a slightly more subtle concept. One imagines the loss landscape to be fixed, and the "phase transition" corresponding to the particle moving from one particular phase in W to another. In this sense, there isn't exactly a phase

transitionin the general sense, but there is a change in which phase a system finds itself in.^{^}Which altered the posterior geometry, but not that of K(w) since p(w|Dn)≈e−nK(w) (up to a normalisation factor).

^{^}It is a little bit disingenuous to continue to call these phases when δ is very close to 1, as the singularity w(0)1 has a non-negligible effect on W2, and vice-versa, meaning the phases lose their individual identities. Alternatively, one defines W0 to centre on w(0)0=0, and observes how the free energy changes with δ. But, I have kept the two "phases" W1 and W2 in the animation below to illustrate the general idea with minimum fuss.

^{^}You might wonder why we still endow the model with the qi parameters in the first place if we just normalise them out after the fact. We assumed it was more important to let the sampling procedure take place on an earnest neural network model without restricting its parameter space, thus trying to keep it in line with neural networks actually used in practice. But, it is likely that these results would hold otherwise, too.

^{^}The astute observer will notice that this is a white lie - the functional equivalence is true

as long aseach qi≥0. However, in our experiments, the true outgoing weights are qi=1, meaning a good sample will only ever have positive weights, i.e. any sample with a negative q(k)i will be removed by the outlier validation.^{^}Explicitly, the truth is defined by

f2(x,θ)=ReLU(cos(θ)x1+sin(θ)x2−13)+ReLU(−cos(θ)x1+sin(θ)x2−13),^{^}Relatedly, the plot of the KL divergence in Example 3.3 tells us to expect that the degenerate phase may be preferred.

^{^}It is worth briefly mentioning the effect of the prior here. The free energy formula tells us that as n→∞, the effects of the prior on learning become negligible. But of course, we are only ever in the finite n regime, at which point the prior does have effects on the posterior. In our case, since the prior is a Gaussian centred at w=(0,0) with standard deviation 1, it is reasonable to say that it has some bearing on the degenerate phase being preferred. However, further experiments showed that this behaviour is still retained for a flatter prior with increased standard deviation. The problem, however, is that the Markov chains can become very unstable on these priors, producing posterior samples with very high loss, indicating that the chains aren't converging to the correct long-term distribution. In the interest of time, I decided not to continue to fine-tune the experiments on non-converging chains for a flatter prior, but it would be interesting to see to what extent the prior does affect these results.

^{^}In other words, the degenerate phase requires a truth with five parameters

q(0)1ReLU(w(0)1,1x1+w(0)1,2x2+b(0)1)+c,whereas the non-degenerate phase requires nine,

q(0)1ReLU(w(0)1,1x1+w(0)1,2x2+b(0)1)+q(0)2ReLU(w(0)2,1x1+w(0)2,2x2+b(0)2)+c.^{^}@Leon Lang correctly pointed out that this is slightly weird terminology to use. Instead these should really be referred to as weight-cancellation instead of weight-annihilation, since both initial configurations obey the weight-annihilation property as I defined it, whereas what I am really referring to is the fact that in one configuration all weights are active and cancel in a region. It's too late to change the terminology throughout, but do keep this in mind.