DSLT 3. Neural Networks are Singular

1lewis smith

1Liam Carroll

1lewis smith

1Leon Lang

8Liam Carroll

New Comment

I'm trying to read through this more carefully this time: how load-bearing is the use of ReLU nonlinearities in the proof? This doesn't intuitively seem like it should be that important (e.g a sigmoid/gelu/tanh network feels like it is probably singular, and it certainly has to be if SLT is going to tell us something important about NN behaviour because changing the nonlinearity doesn't change how NNs behave that much imo), but it does seem to be an important part of the construction you use.

Good question! The proof of the exact symmetries of this setup, i.e. the *precise *form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:

**Other non-linearities induce singular models.**As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.**Even if****is uninteresting, the loss landscape might be "nearly singular".**The ReLU has an analytic approximation, the Swish function , where , which does*not*yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is "always active" in all subsets of the input domain), which breaks down a lot of the analysis used here.

Suppose, however, that we take a that is so large that from the point of view of your computer, (i.e. their difference is within machine-epsilon). Even though is now a very different object to on paper, the loss landscape will be approximately equal , meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.

So, whilst the*precise*functional-equivalences might be very different across activation functions (differing ), there might be many*approximate*functional equivalences. This is also the sense in which we can wave our arms about "well, SLT only applies to analytic functions, and ReLU isn't analytic, but who cares". Making precise mathematical statements about this "nearly singular" phenomena - for example, how does the posterior change as you lower in ? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.

Yeah I agree with everything you say; it's just I was trying to remind myself of enough of SLT to give a a 'five minute pitch' for SLT to other people, and I didn't like the idea that I'm hanging it of the ReLU.

I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.

I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I'm interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about 'oh well the posterior is probably-sorta-gaussian' played a big role in it's longevity as an idea.

yeah it's not totally clear what this 'nearly singular' thing would mean? Intuitively, it might be that there's a kind of 'hidden singularity' in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing

Thanks Liam also for this nice post! The explanations were quite clear.

The property of being singular is specific to a model class , regardless of the underlying truth.

This holds for singularities that come from symmetries where the model doesn't change. However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

Both configurations, non-weight-annihilation (left) and weight-annihilation (right)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

The definition of the Fisher information matrix does not refer to the truth whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution , meaning the model is , which is why the shows up in the formula I just linked to. The derivative terms do not explicitly include because it just vanishes in the derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution in .)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

You're right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!

TLDR; This is the third main post ofDistilling Singular Learning Theorywhich is introduced inDSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied inDSLT4.As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that

makesthem singular. In this post, we will explain that in essence:In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters W0 is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models.

The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters W0 in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed.

## Outline of Classification

To understand the different regions that minimise the free energy (and thus, as we'll see in DSLT4, the

phases), one needs to first understand the singularities in the set of optimal parameters of K(w).In the realisable regression case with a model neural network f(x,w) and true neural network defined by f0(x)=f(x,w(0)) for some w(0)∈W, the set of true parameters has the form

W0={w∈W|f(x,w)=f0(x)}.^{[1]}Thus, classifying the true parameters is a matter of establishing which parameters w∈W yield functional equivalence between the model and the truth f(x,w)=f0(x). The property of being singular is specific to a model class f(x,w), regardless of the underlying truth. But, classifying W0 in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class.

Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are

highly singular.In this post we are going to explore a

full characterisationof the symmetries of W0 when the model is a two layer feedforward ReLU neural networks with d hidden nodes, and the truth is the same architecture but with m≤d nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study W0 with full precision. We will see that:If the model and truth have thesamenumber of nodes,m=d: There are three forms of symmetry of W0:Scaling symmetryof the incoming and outgoing weights to any node.Permutation symmetryof the hidden nodes in a layer.Orientation reversing symmetryof the weights, only when some subset of weights sum to zero (i.e. "annihilate" one another).If the model hasmorenodes than the truth,m<d:Without loss of generality, the first m nodes of the model must have the same symmetries as in the first case. Then each excess node i∈{m+1,…,d} is eitherDegenerate, meaning its total weight (gradient) is 0 (thus the node is always constant).same activation boundaryas another already in the model such that the weights sum to the total gradient in a region^{[2]}.In [Carroll, Chapter 4], I give rigorous proofs that in both cases, W0 is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.

## Two layer Feedforward ReLU Neural Networks

Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.

## Defining the Networks and Terminology

Let W⊆R4d+1 be a compact parameter space. We will let [d]={1,…,d} denote the set of hidden nodes in the first layer of our network, and ⟨wi,x⟩ denote the standard dot product between two vectors. Also recall that

ReLU(x)={xifx≥00ifx<0.We let f:R2×W→R1 denote a

f(x,w)=c+d∑i=1qiReLU(⟨wi,x⟩+bi)two layer feedforward ReLU neural networkwith two inputs x1,x2 and one output y, defined by a parameter w∈W. The function is given bywhere for each i∈[d]:

These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.

Let fw(x)=f(x,w) be defined by a fixed w∈W. We say a particular node i∈[d] is

degeneratein fw if either of the weights are zero, so wi=0 or qi=0.^{[3]}We say a non-degenerate node i is

⟨wi,x⟩+bi=wi,1x1+wi,2x2+bi>0.activatedin some linear domain^{[4]}U⊆R2 when the ReLU is non-zero for all x∈U , that is,The

Hi={x∈R2|⟨wi,x⟩+bi=0}.activation boundaryassociated to node i is thus the lineOne of the key accounting tools in the symmetry classification is identifying the foldsets of fw (in the terminology of [PL19]), which are the regions where fw is non-differentiable in x, and noticing that these equate to the union of non-degenerate activation boundaries Hi. Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by Hi.

## Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes

fw(x)=ReLU(x1−1)+ReLU(x2−1)+ReLU(−x1−1)+ReLU(−x2−1).Example 3.1:Consider the following two layer feedforward ReLU neural network:defined by biases bi=−1 and c=0, second layer weights qi=1, and first layer weights

w1=(10),w2=(01),w3=(−10),w4=(0−1).Its graphical structure and activation boundaries in the (x1,x2) plane can be seen below:

Conceptually, it's helpful to notice that when anchored on its corresponding activation boundary, each weight vector wi "points" into its region of activation.

## The Symmetries of Two Layer Feedforward ReLU Neural Networks

In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the

fullset of symmetries in generality requires a bit more work, which we relegate to the appendix.## Scaling Inner and Outer Weights of a Node

The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any α>0, the ReLU satisfies a scale invariance

1αReLU(αx)=ReLU(x).^{[5]}Say we had the simplest model possible with just one node:

f(x,w)=q1ReLU(⟨w1,x⟩+b1)+c.Then we could define an alternative parameter w′ with

q′1=q1α,w′1=αw1,b′1=αb1,c′=c,which gives

f(x,w′)=q′1ReLU(⟨w′1,x⟩+b′1)+c′=q1αReLU(⟨αw1,x⟩+αb1)+c=q1αReLU(α(⟨w1,x⟩+b1))+c=q1ReLU(⟨w1,x⟩+b1)+c=f(x,w).functional equivalencebecause,For a model with d hidden nodes, the same scaling symmetry applies to each individual node i∈[d] with a set of scaling factors αi>0.

The fact that we can define such a w′ for

anyset of positive scalars means that the Fisher information matrix of these models is degenerate at all points w∈W. We prove this in generality in Appendix 1, but I'll spell it out explicitly for a simple example here.## Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix

f(x,w)=qReLU(wx),f0(x)=θ0ReLU(x),Example 3.2:It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape K(w). The mental model to have here is that it results invalleysin K(w), where the set of true parameters W0 is like a river on the valley floor. To see this, say we defined a model with parameter w=(w,q) and truth as:where θ0>0 is some fixed constant. If q(x) is uniform on [−√3,√3] then it is easy to calculate that when w,q≥0 we have

K(w)=(wq−θ0)2,soW0={(w,q)|wq=θ0}.We can depict this valley and its effect on the posterior for θ0=15:

Looking at this K(w), it's easy to intuit that the Fisher information matrix I(w) is degenerate for all w. But, for clarity, let me spell this out for the true parameters in the case where θ0=1, so K(w)=(wq−1)2.

Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form

J(w)=(2q24wq−24wq−22w2).In particular, let w(0)∈W0 be a fixed true parameter parameterised by a fixed α>0, so w(0)=(α,1α). Then the Fisher information matrix has the form

I(w(0))=(2α2222α2).Setting I1(w(0)) and I2(w(0)) to be the rows of the matrix, there is clearly a linear dependence relation

−α2I1(w(0))+I2(w(0))=0and since α is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.

## Permutation of Nodes

This one is easy to see. If we have a model with d=2 nodes,

f(x,w)=q1ReLU(⟨w1,x⟩+b1)+q2ReLU(⟨w2,x⟩+b2)+c,and we define a new model f(x,w′) where w′ is a permutation of the nodes in f(x,w),

(w′1,b′1,q′1)=(w2,b2,q2),(w′2,b′2,q′2)=(w1,b1,q1),andc′=c,then

f(x,w′)=q′1ReLU(⟨w′1,x⟩+b′1)+q′2ReLU(⟨w′2,x⟩+b′2)+c′=q2ReLU(⟨w2,x⟩+b2)+q1ReLU(⟨w1,x⟩+b1)+c=f(x,w).This easily generalises to d hidden nodes by taking any permutation σ∈Sd in the permutation group Sd and letting each node i′ of f(x,w′) satisfy i′=σ(i), so

f(x,w)=c+d∑i=1qiReLU(⟨wi,x⟩+bi)=c+d∑i=1qσ(i)ReLU(⟨wσ(i),x⟩+bσ(i))=f(x,w′).## Orientation Reversal

This one is a bit trickier to observe as the symmetry depends on a very specific condition of weight annihilation. Let's look at a simple example first.

## Motivating Example

f0(x)=ReLU(x−1)+ReLU(−x−1)+2=⎧⎨⎩−x+1x≤−12−1≤x≤1x+1x≥1Example 3.3:Consider a true distribution defined by a (one-input) feedforward ReLU given bywhere w(0)1=1, w(0)2=−1, and the activation boundaries are H(0)1={x=1} and H(0)2={x=−1}.

Surprisingly, though it may appear our linear regions and activation boundaries must uniquely define the function (up to the scaling and permutation symmetries), there is a particular symmetry that arises by

w1=−w(0)1=−1andw2=−w(0)2=1,reversing the orientationof the weights and first layer biases, and adjusting the total bias accordingly. When we say reverse the orientation, we mean negating their direction,and ditto for the biases. If we adjust the total bias c accordingly, then following function

f(x,w)=ReLU(−x+1)+ReLU(x+1)gives the same functional output!

There is a very specific reason we can do this: in the middle region −1≤x≤1, both nodes are active and cancel out to give a constant function,

f(x,w)=(−x+1)+(x+1)=2,because the total gradients of the underlying truth sum to zero, w(0)1+w(0)2=0.

## General case

Suppose the true network f0(x) is defined by a fixed w(0)=(w(0)1,…,b(0)1,…q(0)1,…,c) for m nodes.

∑i∈Fq(0)iw(0)i=0Ifthere is a set F⊆[m] of total gradients that sum to 0,then the model can produce functional equivalence by reversing the orientation of those particular weights (associated to those activation boundaries), biases, and adjusting the total bias. In other words, modulo permutation and scaling symmetry, there is a functionally equivalent network to f0(x) where the weights of every i∈F satisfy

wi=−w(0)i.We call the condition ∑i∈Fq(0)iw(0)i=0

weight annihilation.In [Carroll21, §4.5] we define m-

symmetricnetworks where the weights are progressive rotations by the angle 2πm, thus their total sum is zero. In DSLT4, we will study whether the posterior prefers configurations of weight-annihilation or not. (The answer is: not).^{[6]}## Node Degeneracy

This is possibly the most important symmetry of all: neural network models can have more nodes than they need to represent a particular function. In essence, this degeneracy is the reason that different regions of the loss-landscape K(w) of neural networks have fundamentally different accuracy-complexity tradeoffs. In other words, if the model has d nodes in the hidden layer available to it, then all possible subnetwork configurations with

lessthan d nodes arealso contained within the loss landscape.Thus, increasing the width of the network can only serve to increase the accuracy of these models, without sacrificing its ability to generalise, since the posterior will just prefer that number of hidden nodes with the best accuracy-complexity tradeoff.## Motivating Example

f0(x)=ReLU(x)Example 3.4:Suppose we had a (one-input) true network given byand our model had d=2 nodes (with fixed biases b1=b2=c=0 and outgoing weights q1=q2=1),

f(x,w)=ReLU(w1x)+ReLU(w2x).Since f0(x)=0 for x≤0, both weights must be positive, w1,w2≥0, to have any hope of being functionally equivalent. If f(x,w)=f0(x), we are in one of two configurations:

f(x,w)=ReLU(1x)+ReLU(0x)=ReLU(x)=f0(x).One node is degenerate:Either (w1,w2)=(1,0) or (w1,w2)=(0,1), meaning

w1+w2=1,Both nodes are non-degenerate, but the total gradient is the same as the truth:So long as the weights satisfyfor w1,w2>0, we will have functional equivalence since, setting w2=1−w1,

f(x,w)=ReLU(w1x)+ReLU((1−w1)x)={w1x+(1−w1)xx≥00x≤0=ReLU(x)=f0(x).## Node-degeneracies Correspond to Different Phases

We could of course encapsulate both of these configurations into the one statement that w1+w2=1 for w1,w2≥0, but there is a key reason we have delineated them: they represent two different

phasesand have different geometry on K(w). Intuitively, the degenerate phase is a simpler model with less complexity, thus we expect it has a lower RLCT^{[7]}, and for the posterior to prefer it. In DSLT4 we will discuss phases in statistical learning more broadly, and display experimental evidence for this latter claim.To foreshadow this, we can actually calculate K(w) for Example 3.4. Setting the prior q(x) to be uniform on [−√6,√6] we find

K(w1,w2)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(w1+w2−1)2w1,w2≥0w21+(w2−1)2w1≤0,w2≥0(w1−1)2+w22w1≥0,w2≤0(w1+w2)2+1w1,w2≤0.Notice how there are ever so slightly wider bowls at either end of the line w1+w2=1, thus suggesting the posterior has more density at the degenerate phase (w1,w2)=(0,1)(or vice versa). Intuitively, imagine a tiny ball being with random kinetic energy rolling around the bottom of the surface - it will spend more time in the ends since there is more catchment area. (Don't take the physics analogy too seriously, though).

We see once again that the

singularitystructure of the minima has a big impact on the geometry of K(w), and therefore the posterior.## General Case

Suppose we have a truth f0(x) and model f(x,w) that are both defined by two layer feedforward ReLU neural networks, where the model has d nodes and the truth has m nodes (assumed to all be non-degenerate and with distinct activation boundaries) such that m<d. Then the model is overparameterised compared to the truth it is trying to model.

Performing the appropriate analysis (which we do in Appendix 2), one finds that:

degenerate, so wi=0 or qi=0, or;same activation boundaryas one already in [m] such that the total gradients sum to the correct gradient in each region. (In our above example, this is saying that necessarily w1+w2=1 since these nodes share the same activation boundary).In DSLT4 we will test which of the phases in the above figure is preferred by the posterior for this simple two layer feedforward ReLU neural networks setup.

## Node degeneracy is the same as lossless network compressibility

The fact that neural networks can contain these node-degeneracies is well known and often goes under the guise of

lossless network compressibility. There are many notions of compressibility, but the one that makes the most sense in our setup is to say that if the model has d>m hidden nodes compared to the truth, then it can be compressed to a network with only m hidden nodes and still produce the same input-output map.For an excellent introduction to lossless network compressibility, see Farrugia-Roberts' recent paper Computational Complexity of Detecting Proximity to Losslessly Compressible Neural Network Parameters, where he studies the problem for tanhnetworks.

## There are More True Parameters if the Input Domain is Bounded

Let me make an important remark here. In both of the above cases, we have considered the symmetries of W0 when the input domain of the model and the truth is all of R2. As we explain in Appendix 2, this allows us to compare the gradients and biases of hyperplanes, similar to comparing polynomial coefficients, to make our conclusions. However, if the domain of the input prior q(x) is restricted to some open bounded domain Z⊆R2, there could in principle be more degeneracies and symmetries of W0, since the functional equivalence only needs to be on Z.

For example, consider a true network defined by f0(x)=0 and a single-node single-input model f(x,w)=qReLU(⟨w,x⟩+b) defined on Z=(−a,a), so q(x)=12a1(−a<x<a). If the activation boundary falls outside of Z and the vector w points away from Z, then any value of q,w,b satisfying these constraints would give f(x,w)=0, thus there is an entirely new class of symmetry in W0.

Whilst important to keep this mind, we won't discuss this any further as it opens up an entirely different can of worms.

## Even if Singularities Occur with Probability Zero, they Affect Global Behaviour in Learning

I want to make a quick comment on the work of Phuong and Lampert in [PL19]. In this paper, they prove equivalent results to these for arbitrary depth feedforward ReLU neural networks (with non-increasing widths), but with a key distinction: they consider

generalmodels. In their words,They then show that almost all feedforward ReLU networks with this architecture are general, and then show that general networks only satisfy scaling and permutation symmetries, thus excluding our orientation-reversing and degenerate node singularities since they occur on a set of measure zero. Importantly, this implies that almost all parameters w∈W have no degenerate nodes, or equivalently, no opportunity for lossless compression.

However, even though scaling and permutation symmetries may be the only generic symmetries (in the sense of measure theory) that occur with non-zero probability, SLT tells us that the singularities of K(w) have

global effectson the loss landscape, as we discussed at length in DSLT2. If a parameter is near a non-generic singularity (i.e. one that occurs with probability zero), it computes a function that is almost identical to the one computed by that of a non-generic singularity. If we shift our language to that of compressibility of a network, SLT tells us that:Just because a particular pointw∈Wsampled from a posterior (or, notionally, obtained via running SGD) is not directly compressible itself, that doesn't mean that it isn't extremely close to one that is.In this sense, SLT tells us that to understand the geometry of the loss landscape, we need to consider singularities even though they are not generic points. As Watanabe says,

singularities contain knowledge.## Appendix 1 - Formal Proof that Neural Networks are Singular

If, like me, you are mathematically inclined, you probably want to see a proof that these neural networks are, indeed, singular models, to tie together the various concepts and intuitions that we have built in this sequence so far. So let's turn into math mode briefly.

Recall that the Fisher information matrix I(w) is degenerate if and only if the set

{∂∂wjf(x,w)}Dj=1is linearly dependent. Here, ∂∂wj refers to the partial derivative with respect to the jth component of the total parameter w∈W, not to be confused with the specific weight vector wj in the neural network definition. Thus, to prove that feedforward ReLU networks are singular, our task is to find this linear dependence relation. The scaling symmetry alone is enough for this.

{wi,1∂∂wi,1+wi,2∂∂wi,2+bi∂∂bi−qi∂∂qi}f=0.Theorem:Given a two layer feedforward neural network f:R2×W→R with d hidden nodes, for any domain on which f is differentiable, f satisfies the differential equation for a fixed node i∈[d]:

∂f∂w1,k=q1xk1(ai>0),∂f∂bi=q11(ai>0),∂f∂qi=ReLU(ai),Proof:Since ddxReLU(x)=1{x>0}, and letting ai=⟨wi,x⟩+bi, the set of derivatives with respect to our parameters areand so since we can write ReLU(ai)=ai1(ai>0) we have

{wi,1∂∂wi,1+wi,2∂∂wi,2+bi∂∂bi−qi∂∂qi}f=qiwi,1x11(ai>0)+qiwi,2x21(ai>0)+qibi1(ai>0)−qiReLU(ai)=qiReLU(⟨wi,x⟩+bi)−qiReLU(⟨wi,x⟩+bi)=0.□Corollary:Feedforward ReLU neural networks aresingularmodels.Proof:For the two layer case, for any fixed w0∈W, there is a linear dependence relation given by the above differential equation evaluated at w0, thus the Fisher information is degenerate at w0, so the model is singular.The equivalent proof for arbitrary depths and widths is given in Lemma A.1 of [Wei22], following from other work on functional equivalence in [PL19]. □

The degenerate node symmetries also give rise to a degenerate Fisher information matrix, though I haven't formally written out this alternate proof yet. If you are interested, do it as an exercise and leave it as a comment!

## Appendix 2 - Proof Sketch for Fully Classifying W0 for Two Layer Feedforward ReLU Networks

This section is going to be slightly more technical, and in the grand scheme of the SLT story I am telling in this sequence, this may be seen as an unnecessary side-plot. But, other readers, particularly those with a pure mathematical bent, may find it interesting to consider the process of fully classifying W0 and how one might understand all phases present, so I am providing a sketch of these proofs for completeness. Understanding the full form of W0 was a vital part of performing the phase transition experiments that we will see in DSLT4. These models are simple enough that we can perfectly classify

alltrue parameters in W0. Thus, we can precisely understand all of its phases.We are going to classify the symmetries of W0 when both the model f(x,w) and truth f0(x) are two-layer feedforward ReLU neural networks, with d and m hidden nodes respectively, giving

W0={w∈W|f(x,w)=f0(x)},meaning the task is to classify functional equivalence of the two networks. To avoid some annoying fringe cases, we assume that the true network is

minimal, which means there is no network with fewer nodes that could also represent it (which also means every node is non-degenerate), andactivation-distinguished, meaning every node of the truth corresponds to a unique activation boundary.We will see that the set of symmetries explained above comprise

allof the symmetries in W0 - there can be no more^{[8]}. This result rests mainly on the fact that the activation boundaries are the core piece of data that defines a neural network. The rest is then just performing accounting of the gradients and biases in each region.This is a sketch of the proofs in Chapter 4 of my thesis, and all lemmas and theorems that are referenced in the following section come from here.

## Case 1: The model has the same number of nodes as the truth, m=d

Let f(x,w) be a two layer feedforward ReLU neural network model with d hidden nodes, and let f0(x)=f(x,w(0)) be the realisable true network with m hidden nodes defined by a fixed parameter w(0), denoted by

f0(x)=c(0)+m∑j=1q(0)iReLU(⟨w(0)j,x⟩+b(0)j),which we assume is minimal and activation-distinguished as explained above.

We start by comparing the foldsets, which are the activation boundaries [Lemma 4.1], between the truth and the model. Let Hi be the activation boundary of the node i∈[d] in the model, and H(0)j be the activation boundary of the node j∈[m] in the truth. Then by comparing the sets of linear lines in [Lemma 4.2], we can show that for every node of the model i∈[d] there exists a permutation σ∈Sm such that

Hi=H(0)σ(i).By [Lemma 4.3], two activation boundaries H,H′ are equal if and only if there is some non-zero scalar α∈R∖{0} such that w=αw′ and b=αb′.

Using our relation Hi=H(0)σ(i), in [Lemma 4.4] we analyse how the gradients and biases change across each activation boundary, and what this means for the relation between weights and biases in the model versus the truth. We show that there exists a unique σ∈Sm, and for each i∈[d] an ϵi∈Z2 and αi∈R>0 such that

wi=(−1)ϵiαiw(0)σ(i),andbi=(−1)ϵiαib(0)σ(i),whereαi=q(0)σ(i)qi,meaning qi and q(0)σ(i) necessarily have the same sign.

However, there is a restriction on which weights can have reversed orientation, ϵi=1 (thus wi=−αiw(0)σ(i)). Letting E={i∈[d]|ϵi=1}, we show in [Lemma 4.5] that the weights and biases of the true network must satisfy

∑i∈Eq(0)σ(i)w(0)σ(i)=0andc(0)+∑i∈Eq(0)σ(i)b(0)σ(i)=c.^{[9]}The crux of this proof rests in comparing the gradients in regions either side of the activation boundary Hi.

In [Theorem 4.7] we show that these scaling, permutation and orientation reversing symmetries are the

onlysuch symmetries by piecing together all of these aforementioned Lemmas, with emphasis on the importance of the activation boundaries in defining the topology of f0(x).^{[10]}## Case 2: The model has more nodes than the truth, m<d

We now suppose that the model is over-parameterised compared to the true network, so m<d.

The key piece of data is once again the foldsets defining the model and the truth. Since they must be equal, the model can only have m unique foldsets, and thus activation boundaries. Without loss of generality (i.e. up to permutation symmetry), the first [m]⊂[d] nodes in the model have the same activation boundaries as the truth, ⋃mi=1Hi=⋃mj=1H(0)j. Thus, these [m] nodes in the model must satisfy the same symmetries as in the m=d case.

By comparing the fold sets on each excess node in {m+1,…,d}, we must have

d⋃i=m+1{Hi|iis non-degenerate}⊆m⋃j=1H(0)j.In comparing linear lines again, this means there are two possible situations:

Let d′≥m the number of non-degenerate nodes of the model. We can thus define a surjective finite set map

π:{1,…,m,m+1,…,d′}→{1,…,m}relating the non-degenerate nodes in the model to those in the truth, which is a bijection (i.e. a permutation σ∈Sm) on the first [m]⊂[d′].

We can then compare the gradients and biases in each region to show that the total gradients calculated by each non-degenerate node at each unique activation boundary must sum to the gradient in the truth. Precisely, for each node j∈[m] of the truth, let Mj={i∈[d′]|π(i)=j} be the set of nodes in the model that share the same activation boundary. Then for each i∈[d′] there exists an ϵi∈Z2 and αi∈R>0 such that

wi=(−1)ϵiαiw(0)π(i),bi=(−1)ϵiαib(0)π(i)with the constraint that

∑i∈Mjqiαi=q(0)j.A similar orientation reversing symmetry also applies as in case 1, just by accounting for the nodes that share the same activation boundaries.

## Resources

[Carroll21] - L. Carroll, Phase Transitions in Neural Networks, 2021

[Wei22] - S. Wei, D. Murfet, et al., Deep Learning is Singular, and That's Good, 2022

[PL19] - M. Phuong, C. Lampert, Functional vs Parametric Equivalence of ReLU networks, 2019

^{^}Since K(w)=0 if and only if q(y|x)=p(y|x,w) for some w∈W.

^{^}e.g. ReLU(x)+ReLU(x)=ReLU(2x). ↩︎

^{^}For ease of classification, we exclude the case where qi≠0 and bi≠0 since we can just absorb the total bias contribution into c. ↩︎

^{^}A

linear domainU⊆R2 is just a connected open set where fw is a plane with constant gradient and bias when restricted to U, and U is the maximal such set for which that plane is defined. In other words, the set of linear domains are the set of different regions the piecewise affine function are carved up into. ↩︎^{^}But don't forget, ReLU(−x)≠−ReLU(x) as the domain of activation is completely different. ↩︎

^{^}Though I have not been able to formally prove it, I believe that this symmetry on its own (i.e. modulo scaling symmetry) does not result in a degeneracy of the Fisher information matrix, at least in our simple case. This, I think, is because the weights must cancel out in the region where both nodes are active,

andthe gradients in the other regions must be retained. Feel free to prove me wrong, though! ↩︎^{^}This statement is a bit disingenuous. Watanabe's free energy formula only applies to the case where K(w) is analytic, but ReLU neural networks are certainly not analytic, as we can see in the below example. With that said, Watanabe has recently proved a bound on the free energy for ReLU neural networks, showing that the complexity term is essentially related to the number of non-degenerate nodes in the truth, even if it isn't a true RLCT. We will look at this in more depth in DSLT4. ↩︎

^{^}Aside from the technical caveat discussed about the restricted input prior q(x) above.

^{^}Our convention is to take the empty sum to be 0, so all weight orientations being preserved, E=∅, is perfectly fine. ↩︎

^{^}The activation distinguished condition on the truth allows us to uniquely identify the permutation σ∈Sm relating activation boundaries, and ensures only one node changes across each boundary. ↩︎