DSLT 3. Neural Networks are Singular

Liam Carroll

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4.

As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that makes them singular. In this post, we will explain that in essence:

Neural networks are singular because there are often ways to vary their parameters without changing the function they compute.

In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models.

The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters $W_{0}$ in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed.

Outline of Classification

To understand the different regions that minimise the free energy (and thus, as we'll see in DSLT4, the phases), one needs to first understand the singularities in the set of optimal parameters of $K (w)$ .

In the realisable regression case with a model neural network $f (x, w)$ and true neural network defined by $f_{0} (x) = f (x, w^{(0)})$ for some $w^{(0)} \in W$ , the set of true parameters has the form ^[1]

W_{0} = {w \in W | f (x, w) = f_{0} (x)} .

Thus, classifying the true parameters is a matter of establishing which parameters $w \in W$ yield functional equivalence between the model and the truth $f (x, w) = f_{0} (x)$ . The property of being singular is specific to a model class $f (x, w)$ , regardless of the underlying truth. But, classifying $W_{0}$ in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class.

Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are highly singular.

In this post we are going to explore a full characterisation of the symmetries of $W_{0}$ when the model is a two layer feedforward ReLU neural networks with $d$ hidden nodes, and the truth is the same architecture but with $m \leq d$ nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study $W_{0}$ with full precision. We will see that:

If the model and truth have the same number of nodes, $m = d$ : There are three forms of symmetry of $W_{0}$ :
- Scaling symmetry of the incoming and outgoing weights to any node.
- Permutation symmetry of the hidden nodes in a layer.
- Orientation reversing symmetry of the weights, only when some subset of weights sum to zero (i.e. "annihilate" one another).
If the model has more nodes than the truth, $m < d$ : Without loss of generality, the first $m$ nodes of the model must have the same symmetries as in the first case. Then each excess node $i \in {m + 1, \dots, d}$ is either
- Degenerate, meaning its total weight (gradient) is 0 (thus the node is always constant).
- Or it has the same activation boundary as another already in the model such that the weights sum to the total gradient in a region ^[2].

In [Carroll, Chapter 4], I give rigorous proofs that in both cases, $W_{0}$ is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.

Two layer Feedforward ReLU Neural Networks

Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.

Defining the Networks and Terminology

Let $W \subseteq R^{4 d + 1}$ be a compact parameter space. We will let $[d] = {1, \dots, d}$ denote the set of hidden nodes in the first layer of our network, and $⟨ w_{i}, x ⟩$ denote the standard dot product between two vectors. Also recall that

R e L U (x) = {\begin{matrix} x & if x \geq 0 0 & if x < 0 \end{matrix} .

We let $f : R^{2} \times W \to R^{1}$ denote a two layer feedforward ReLU neural network with two inputs $x_{1}, x_{2}$ and one output $y$ , defined by a parameter $w \in W$ . The function is given by

f (x, w) = c + d \sum i = 1 q_{i} R e L U (⟨ w_{i}, x ⟩ + b_{i})

where for each $i \in [d]$ :

the first layer weights are $w_{i} \in R^{2}$ and the biases are $b_{i} \in R$
the second layer weights are $q_{i} \in R$ and the bias is $c \in R$ .

These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.

Let $f_{w} (x) = f (x, w)$ be defined by a fixed $w \in W$ . We say a particular node $i \in [d]$ is degenerate in $f_{w}$ if either of the weights are zero, so $w_{i} = 0$ or $q_{i} = 0$ . ^[3]

We say a non-degenerate node $i$ is activated in some linear domain ^[4] $U \subseteq R^{2}$ when the ReLU is non-zero for all $x \in U$ , that is,

⟨ w_{i}, x ⟩ + b_{i} = w_{i, 1} x_{1} + w_{i, 2} x_{2} + b_{i} > 0 .

The activation boundary associated to node $i$ is thus the line

H_{i} = {x \in R^{2} | ⟨ w_{i}, x ⟩ + b_{i} = 0} .

One of the key accounting tools in the symmetry classification is identifying the foldsets of $f_{w}$ (in the terminology of [PL19]), which are the regions where $f_{w}$ is non-differentiable in $x$ , and noticing that these equate to the union of non-degenerate activation boundaries $H_{i}$ . Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by $H_{i}$ .

Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes

Example 3.1: Consider the following two layer feedforward ReLU neural network:

\begin{matrix} f_{w} (x) & = R e L U (x_{1} - 1) + R e L U (x_{2} - 1) + R e L U (- x_{1} - 1) + R e L U (- x_{2} - 1) . \end{matrix}

defined by biases $b_{i} = - 1$ and $c = 0$ , second layer weights $q_{i} = 1$ , and first layer weights

w_{1} = (\begin{matrix} 10 \end{matrix}), w_{2} = (\begin{matrix} 01 \end{matrix}), w_{3} = (\begin{matrix} - 1 0 \end{matrix}), w_{4} = (\begin{matrix} 0 - 1 \end{matrix}) .

Its graphical structure and activation boundaries in the $(x_{1}, x_{2})$ plane can be seen below:

Conceptually, it's helpful to notice that when anchored on its corresponding activation boundary, each weight vector $w_{i}$ "points" into its region of activation.

The Symmetries of Two Layer Feedforward ReLU Neural Networks

In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the full set of symmetries in generality requires a bit more work, which we relegate to the appendix.

Scaling Inner and Outer Weights of a Node

The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any $α > 0$ , the ReLU satisfies a scale invariance ^[5]

\frac{1}{α} R e L U (α x) = R e L U (x) .

Say we had the simplest model possible with just one node:

f (x, w) = q_{1} R e L U (⟨ w_{1}, x ⟩ + b_{1}) + c .

Then we could define an alternative parameter $w^{'}$ with

q_{1}^{'} = \frac{q_{1}}{α}, w_{1}^{'} = α w_{1}, b_{1}^{'} = α b_{1}, c^{'} = c,

which gives functional equivalence because,

\begin{matrix} f (x, w^{'}) & = q_{1}^{'} R e L U (⟨ w_{1}^{'}, x ⟩ + b_{1}^{'}) + c^{'} = \frac{q_{1}}{α} R e L U (⟨ α w_{1}, x ⟩ + α b_{1}) + c = \frac{q_{1}}{α} R e L U (α (⟨ w_{1}, x ⟩ + b_{1})) + c = q_{1} R e L U (⟨ w_{1}, x ⟩ + b_{1}) + c = f (x, w) . \end{matrix}

For a model with $d$ hidden nodes, the same scaling symmetry applies to each individual node $i \in [d]$ with a set of scaling factors $α_{i} > 0$ .

The fact that we can define such a $w^{'}$ for any set of positive scalars means that the Fisher information matrix of these models is degenerate at all points $w \in W$ . We prove this in generality in Appendix 1, but I'll spell it out explicitly for a simple example here.

Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix

Example 3.2: It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape $K (w)$ . The mental model to have here is that it results in valleys in $K (w)$ , where the set of true parameters $W_{0}$ is like a river on the valley floor. To see this, say we defined a model with parameter $w = (w, q)$ and truth as:

f (x, w) = q R e L U (w x), f_{0} (x) = θ_{0} R e L U (x),

where $θ_{0} > 0$ is some fixed constant. If $q (x)$ is uniform on $[- \sqrt{3}, \sqrt{3}]$ then it is easy to calculate that when $w, q \geq 0$ we have

K (w) = {(w q - θ_{0})}_{0}^{2}, so W_{0} = {(w, q) | w q = θ_{0}} .

We can depict this valley and its effect on the posterior for $θ_{0} = \frac{1}{5}$ :

K(w) is a valley — Setting $θ_{0} = \frac{1}{5}$ , we see that $K (w)$ is a valley due to the scaling symmetry (left), thus there is no unique maximum a posterior (right). Remember that, up to a scaling factor, $e^{- n K_{n} (w)}$ is the posterior when the prior $φ (w)$ is uniform, and $e^{- n K_{n} (w)} \approx e^{- n K (w)}$ for large $n$ since $E [K_{n} (w)] = K (w)$ .

Looking at this $K (w)$ , it's easy to intuit that the Fisher information matrix $I (w)$ is degenerate for all $w$ . But, for clarity, let me spell this out for the true parameters in the case where $θ_{0} = 1$ , so $K (w) = (w q - 1)^{2}$ .

Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form

J (w) = (\begin{matrix} 2 q^{2} & 4 w q - 2 4 w q - 2 & 2 w^{2} \end{matrix}) .

In particular, let $w^{(0)} \in W_{0}$ be a fixed true parameter parameterised by a fixed $α > 0$ , so $w^{(0)} = (α, \frac{1}{α})$ . Then the Fisher information matrix has the form

I (w^{(0)}) = (\begin{matrix} \frac{2}{α^{2}} & 2 2 & 2 α^{2} \end{matrix}) .

Setting $I_{1} (w^{(0)})$ and $I_{2} (w^{(0)})$ to be the rows of the matrix, there is clearly a linear dependence relation

- α^{2} I_{1} (w^{(0)}) + I_{2} (w^{(0)}) = 0

and since $α$ is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.

Permutation of Nodes

This one is easy to see. If we have a model with $d = 2$ nodes,

f (x, w) = q_{1} R e L U (⟨ w_{1}, x ⟩ + b_{1}) + q_{2} R e L U (⟨ w_{2}, x ⟩ + b_{2}) + c,

and we define a new model $f (x, w^{'})$ where $w^{'}$ is a permutation of the nodes in $f (x, w)$ ,

(w_{1}^{'}, b_{1}^{'}, q_{1}^{'}) = (w_{2}, b_{2}, q_{2}), (w_{2}^{'}, b_{2}^{'}, q_{2}^{'}) = (w_{1}, b_{1}, q_{1}), and c^{'} = c,

then

\begin{matrix} f (x, w^{'}) & = q_{1}^{'} R e L U (⟨ w_{1}^{'}, x ⟩ + b_{1}^{'}) + q_{2}^{'} R e L U (⟨ w_{2}^{'}, x ⟩ + b_{2}^{'}) + c^{'} = q_{2} R e L U (⟨ w_{2}, x ⟩ + b_{2}) + q_{1} R e L U (⟨ w_{1}, x ⟩ + b_{1}) + c = f (x, w) . \end{matrix}

This easily generalises to $d$ hidden nodes by taking any permutation $σ \in S_{d}$ in the permutation group $S_{d}$ and letting each node $i^{'}$ of $f (x, w^{'})$ satisfy $i^{'} = σ (i)$ , so

\begin{matrix} f (x, w) & = c + d \sum i = 1 q_{i} R e L U (⟨ w_{i}, x ⟩ + b_{i}) = c + d \sum i = 1 q_{σ (i)} R e L U (⟨ w_{σ (i)}, x ⟩ + b_{σ (i)}) = f (x, w^{'}) . \end{matrix}

Permutation symmetry — Permuting nodes induces functional equivalence, here depicted for $σ = (1, 3) (2, 4)$ .

Orientation Reversal

This one is a bit trickier to observe as the symmetry depends on a very specific condition of weight annihilation. Let's look at a simple example first.

Motivating Example

Example 3.3: Consider a true distribution defined by a (one-input) feedforward ReLU given by

\begin{matrix} f_{0} (x) & = R e L U (x - 1) + R e L U (- x - 1) + 2 = ⎧ ⎨ ⎩ \begin{matrix} - x + 1 & x \leq - 1 2 & - 1 \leq x \leq 1 x + 1 & x \geq 1 \end{matrix} \end{matrix}

where $w_{1}^{(0)} = 1$ , $w_{2}^{(0)} = - 1$ , and the activation boundaries are $H_{1}^{(0)} = {x = 1}$ and $H_{2}^{(0)} = {x = - 1}$ .

Surprisingly, though it may appear our linear regions and activation boundaries must uniquely define the function (up to the scaling and permutation symmetries), there is a particular symmetry that arises by reversing the orientation of the weights and first layer biases, and adjusting the total bias accordingly. When we say reverse the orientation, we mean negating their direction,

w_{1} = - w_{1}^{(0)} = - 1 and w_{2} = - w_{2}^{(0)} = 1,

and ditto for the biases. If we adjust the total bias $c$ accordingly, then following function

f (x, w) = R e L U (- x + 1) + R e L U (x + 1)

gives the same functional output!

Weight annihilation in 1D — Reversing the orientation of the true weights preserves this function because the true weights annihilate one another.

There is a very specific reason we can do this: in the middle region $- 1 \leq x \leq 1$ , both nodes are active and cancel out to give a constant function,

f (x, w) = (- x + 1) + (x + 1) = 2,

because the total gradients of the underlying truth sum to zero, $w_{1}^{(0)} + w_{2}^{(0)} = 0$ .

General case

Suppose the true network $f_{0} (x)$ is defined by a fixed $w^{(0)} = (w_{1}^{(0)}, \dots, b_{1}^{(0)}, \dots q_{1}^{(0)}, \dots, c)$ for $m$ nodes. If there is a set $F \subseteq [m]$ of total gradients that sum to 0,

\sum i \in F q_{i}^{(0)} w_{i}^{(0)} = 0

then the model can produce functional equivalence by reversing the orientation of those particular weights (associated to those activation boundaries), biases, and adjusting the total bias. In other words, modulo permutation and scaling symmetry, there is a functionally equivalent network to $f_{0} (x)$ where the weights of every $i \in F$ satisfy

w_{i} = - w_{i}^{(0)} .

We call the condition $\sum_{i \in F} q_{i}^{(0)} w_{i}^{(0)} = 0$ weight annihilation.

In [Carroll21, $§$ 4.5] we define $m$ -symmetric networks where the weights are progressive rotations by the angle $\frac{2 π}{m}$ , thus their total sum is zero. In DSLT4, we will study whether the posterior prefers configurations of weight-annihilation or not. (The answer is: not). ^[6]

An $m$ -symmetric network for $m = 3$ with $q_{i}^{(0)} = 1$ and $c^{(0)} = 0$ . Both configurations, non-weight-cancellation (left) and weight-cancellation (right), are functionally equivalent since $\sum_{i = 1}^{3} w_{i}^{(0)} = 0$ . Here, weight cancellation refers to the configuration where all three nodes are active in the central linear domain, but cancel to give an effective gradient of zero there.

Node Degeneracy

This is possibly the most important symmetry of all: neural network models can have more nodes than they need to represent a particular function. In essence, this degeneracy is the reason that different regions of the loss-landscape $K (w)$ of neural networks have fundamentally different accuracy-complexity tradeoffs. In other words, if the model has $d$ nodes in the hidden layer available to it, then all possible subnetwork configurations with less than $d$ nodes are also contained within the loss landscape. Thus, increasing the width of the network can only serve to increase the accuracy of these models, without sacrificing its ability to generalise, since the posterior will just prefer that number of hidden nodes with the best accuracy-complexity tradeoff.

Motivating Example

Example 3.4: Suppose we had a (one-input) true network given by

f_{0} (x) = R e L U (x)

and our model had $d = 2$ nodes (with fixed biases $b_{1} = b_{2} = c = 0$ and outgoing weights $q_{1} = q_{2} = 1$ ),

f (x, w) = R e L U (w_{1} x) + R e L U (w_{2} x) .

Since $f_{0} (x) = 0$ for $x \leq 0$ , both weights must be positive, $w_{1}, w_{2} \geq 0$ , to have any hope of being functionally equivalent. If $f (x, w) = f_{0} (x)$ , we are in one of two configurations:

One node is degenerate: Either $(w_{1}, w_{2}) = (1, 0)$ or $(w_{1}, w_{2}) = (0, 1)$ , meaning

f (x, w) = R e L U (1 x) + R e L U (0 x) = R e L U (x) = f_{0} (x) .

Both nodes are non-degenerate, but the total gradient is the same as the truth: So long as the weights satisfy

w_{1} + w_{2} = 1,

for $w_{1}, w_{2} > 0$ , we will have functional equivalence since, setting $w_{2} = 1 - w_{1}$ ,

\begin{matrix} f (x, w) & = R e L U (w_{1} x) + R e L U ((1 - w_{1}) x) = {\begin{matrix} w_{1} x + (1 - w_{1}) x & x \geq 0 0 & x \leq 0 \end{matrix} = R e L U (x) = f_{0} (x) . \end{matrix}

Node-degeneracies Correspond to Different Phases

We could of course encapsulate both of these configurations into the one statement that $w_{1} + w_{2} = 1$ for $w_{1}, w_{2} \geq 0$ , but there is a key reason we have delineated them: they represent two different phases and have different geometry on $K (w)$ . Intuitively, the degenerate phase is a simpler model with less complexity, thus we expect it has a lower RLCT ^[7], and for the posterior to prefer it. In DSLT4 we will discuss phases in statistical learning more broadly, and display experimental evidence for this latter claim.

To foreshadow this, we can actually calculate $K (w)$ for Example 3.4. Setting the prior $q (x)$ to be uniform on $[- \sqrt{6}, \sqrt{6}]$ we find

K (w_{1}, w_{2}) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} (w_{1} + w_{2} - 1)^{2} & w_{1}, w_{2} \geq 0 w_{1}^{2} + (w_{2} - 1)^{2} & w_{1} \leq 0, w_{2} \geq 0 (w_{1} - 1)^{2} + w_{2}^{2} & w_{1} \geq 0, w_{2} \leq 0 (w_{1} + w_{2})^{2} + 1 & w_{1}, w_{2} \leq 0 \end{matrix} .

K(w) for the simple example — $K (w)$ for the above example with slightly wider bowls at the degenerate-node phases.

Notice how there are ever so slightly wider bowls at either end of the line $w_{1} + w_{2} = 1$ , thus suggesting the posterior has more density at the degenerate phase $(w_{1}, w_{2}) = (0, 1)$ (or vice versa). Intuitively, imagine a tiny ball being with random kinetic energy rolling around the bottom of the surface - it will spend more time in the ends since there is more catchment area. (Don't take the physics analogy too seriously, though).

We see once again that the singularity structure of the minima has a big impact on the geometry of $K (w)$ , and therefore the posterior.

General Case

Suppose we have a truth $f_{0} (x)$ and model $f (x, w)$ that are both defined by two layer feedforward ReLU neural networks, where the model has $d$ nodes and the truth has $m$ nodes (assumed to all be non-degenerate and with distinct activation boundaries) such that $m < d$ . Then the model is overparameterised compared to the truth it is trying to model.

Performing the appropriate analysis (which we do in Appendix 2), one finds that:

Without loss of generality (i.e. up to permutation symmetry), the first $m$ nodes of $[d]$ must have the same activation boundaries as those in $f_{0} (x)$ , and satisfy the same scaling, permutation and orientation reversing symmetries as discussed above.
For the remaining nodes $i \in {m + 1, \dots, d}$ of the model, each one either:
- Is degenerate, so $w_{i} = 0$ or $q_{i} = 0$ , or;
- Shares the same activation boundary as one already in $[m]$ such that the total gradients sum to the correct gradient in each region. (In our above example, this is saying that necessarily $w_{1} + w_{2} = 1$ since these nodes share the same activation boundary).

Degenerate contour plot — The function $f_{0} (x) = 2 R e L U (x_{2} - \frac{1}{3})$ can also be represented by a two node model where both nodes share the same boundary, $f (x, w) = R e L U (x - \frac{1}{3}) + R e L U (x - \frac{1}{3})$ .

In DSLT4 we will test which of the phases in the above figure is preferred by the posterior for this simple two layer feedforward ReLU neural networks setup.

Node degeneracy is the same as lossless network compressibility

The fact that neural networks can contain these node-degeneracies is well known and often goes under the guise of lossless network compressibility. There are many notions of compressibility, but the one that makes the most sense in our setup is to say that if the model has $d > m$ hidden nodes compared to the truth, then it can be compressed to a network with only $m$ hidden nodes and still produce the same input-output map.

For an excellent introduction to lossless network compressibility, see Farrugia-Roberts' recent paper Computational Complexity of Detecting Proximity to Losslessly Compressible Neural Network Parameters, where he studies the problem for $tanh$ networks.

There are More True Parameters if the Input Domain is Bounded

Let me make an important remark here. In both of the above cases, we have considered the symmetries of $W_{0}$ when the input domain of the model and the truth is all of $R^{2}$ . As we explain in Appendix 2, this allows us to compare the gradients and biases of hyperplanes, similar to comparing polynomial coefficients, to make our conclusions. However, if the domain of the input prior $q (x)$ is restricted to some open bounded domain $Z \subseteq R^{2}$ , there could in principle be more degeneracies and symmetries of $W_{0}$ , since the functional equivalence only needs to be on $Z$ .

For example, consider a true network defined by $f_{0} (x) = 0$ and a single-node single-input model $f (x, w) = q R e L U (⟨ w, x ⟩ + b)$ defined on $Z = (- a, a)$ , so $q (x) = \frac{1}{2 a} 1 (- a < x < a)$ . If the activation boundary falls outside of $Z$ and the vector $w$ points away from $Z$ , then any value of $q, w, b$ satisfying these constraints would give $f (x, w) = 0$ , thus there is an entirely new class of symmetry in $W_{0}$ .

Whilst important to keep this mind, we won't discuss this any further as it opens up an entirely different can of worms.

Even if Singularities Occur with Probability Zero, they Affect Global Behaviour in Learning

I want to make a quick comment on the work of Phuong and Lampert in [PL19]. In this paper, they prove equivalent results to these for arbitrary depth feedforward ReLU neural networks (with non-increasing widths), but with a key distinction: they consider general models. In their words,

A sufficient condition for a network to be general with probability one is that the weights are sampled from a distribution with a density.

They then show that almost all feedforward ReLU networks with this architecture are general, and then show that general networks only satisfy scaling and permutation symmetries, thus excluding our orientation-reversing and degenerate node singularities since they occur on a set of measure zero. Importantly, this implies that almost all parameters $w \in W$ have no degenerate nodes, or equivalently, no opportunity for lossless compression.

However, even though scaling and permutation symmetries may be the only generic symmetries (in the sense of measure theory) that occur with non-zero probability, SLT tells us that the singularities of $K (w)$ have global effects on the loss landscape, as we discussed at length in DSLT2. If a parameter is near a non-generic singularity (i.e. one that occurs with probability zero), it computes a function that is almost identical to the one computed by that of a non-generic singularity. If we shift our language to that of compressibility of a network, SLT tells us that:

Just because a particular point $w \in W$ sampled from a posterior (or, notionally, obtained via running SGD) is not directly compressible itself, that doesn't mean that it isn't extremely close to one that is.

In this sense, SLT tells us that to understand the geometry of the loss landscape, we need to consider singularities even though they are not generic points. As Watanabe says, singularities contain knowledge.

Appendix 1 - Formal Proof that Neural Networks are Singular

If, like me, you are mathematically inclined, you probably want to see a proof that these neural networks are, indeed, singular models, to tie together the various concepts and intuitions that we have built in this sequence so far. So let's turn into math mode briefly.

Recall that the Fisher information matrix $I (w)$ is degenerate if and only if the set

{\frac{\partial}{\partial w_{j}} f (x, w)}_{j = 1}^{D}

is linearly dependent. Here, $\frac{\partial}{\partial w_{j}}$ refers to the partial derivative with respect to the $j$ th component of the total parameter $w \in W$ , not to be confused with the specific weight vector $w_{j}$ in the neural network definition. Thus, to prove that feedforward ReLU networks are singular, our task is to find this linear dependence relation. The scaling symmetry alone is enough for this.

Theorem: Given a two layer feedforward neural network $f : R^{2} \times W \to R$ with $d$ hidden nodes, for any domain on which $f$ is differentiable, $f$ satisfies the differential equation for a fixed node $i \in [d]$ :

{w_{i, 1} \frac{\partial}{\partial w_{i, 1}} + w_{i, 2} \frac{\partial}{\partial w_{i, 2}} + b_{i} \frac{\partial}{\partial b_{i}} - q_{i} \frac{\partial}{\partial q_{i}}} f = 0 .

Proof: Since $\frac{d}{d x} R e L U (x) = 1 {x > 0}$ , and letting $a_{i} = ⟨ w_{i}, x ⟩ + b_{i}$ , the set of derivatives with respect to our parameters are

\frac{\partial f}{\partial w_{1, k}} = q_{1} x_{k} 1 (a_{i} > 0), \frac{\partial f}{\partial b_{i}} = q_{1} 1 (a_{i} > 0), \frac{\partial f}{\partial q_{i}} = R e L U (a_{i}),

and so since we can write $R e L U (a_{i}) = a_{i} 1 (a_{i} > 0)$ we have

\begin{matrix} {w_{i, 1} \frac{\partial}{\partial w_{i, 1}} + w_{i, 2} \frac{\partial}{\partial w_{i, 2}} + b_{i} \frac{\partial}{\partial b_{i}} - q_{i} \frac{\partial}{\partial q_{i}}} f = q_{i} w_{i, 1} x_{1} 1 (a_{i} > 0) + q_{i} w_{i, 2} x_{2} 1 (a_{i} > 0) + q_{i} b_{i} 1 (a_{i} > 0) - q_{i} R e L U (a_{i}) = q_{i} R e L U (⟨ w_{i}, x ⟩ + b_{i}) - q_{i} R e L U (⟨ w_{i}, x ⟩ + b_{i}) = 0 . □ \end{matrix}

Corollary: Feedforward ReLU neural networks are singular models.

Proof: For the two layer case, for any fixed $w^{0} \in W$ , there is a linear dependence relation given by the above differential equation evaluated at $w^{0}$ , thus the Fisher information is degenerate at $w^{0}$ , so the model is singular.

The equivalent proof for arbitrary depths and widths is given in Lemma A.1 of [Wei22], following from other work on functional equivalence in [PL19]. $□$

The degenerate node symmetries also give rise to a degenerate Fisher information matrix, though I haven't formally written out this alternate proof yet. If you are interested, do it as an exercise and leave it as a comment!

Appendix 2 - Proof Sketch for Fully Classifying $W_{0}$ for Two Layer Feedforward ReLU Networks

This section is going to be slightly more technical, and in the grand scheme of the SLT story I am telling in this sequence, this may be seen as an unnecessary side-plot. But, other readers, particularly those with a pure mathematical bent, may find it interesting to consider the process of fully classifying $W_{0}$ and how one might understand all phases present, so I am providing a sketch of these proofs for completeness. Understanding the full form of $W_{0}$ was a vital part of performing the phase transition experiments that we will see in DSLT4. These models are simple enough that we can perfectly classify all true parameters in $W_{0}$ . Thus, we can precisely understand all of its phases.

We are going to classify the symmetries of $W_{0}$ when both the model $f (x, w)$ and truth $f_{0} (x)$ are two-layer feedforward ReLU neural networks, with $d$ and $m$ hidden nodes respectively, giving

W_{0} = {w \in W | f (x, w) = f_{0} (x)},

meaning the task is to classify functional equivalence of the two networks. To avoid some annoying fringe cases, we assume that the true network is minimal, which means there is no network with fewer nodes that could also represent it (which also means every node is non-degenerate), and activation-distinguished, meaning every node of the truth corresponds to a unique activation boundary.

We will see that the set of symmetries explained above comprise all of the symmetries in $W_{0}$ - there can be no more ^[8]. This result rests mainly on the fact that the activation boundaries are the core piece of data that defines a neural network. The rest is then just performing accounting of the gradients and biases in each region.

This is a sketch of the proofs in Chapter 4 of my thesis, and all lemmas and theorems that are referenced in the following section come from here.

Case 1: The model has the same number of nodes as the truth, $m = d$

Let $f (x, w)$ be a two layer feedforward ReLU neural network model with $d$ hidden nodes, and let $f_{0} (x) = f (x, w^{(0)})$ be the realisable true network with $m$ hidden nodes defined by a fixed parameter $w^{(0)}$ , denoted by

f_{0} (x) = c^{(0)} + m \sum j = 1 q_{i}^{(0)} R e L U (⟨ w_{j}^{(0)}, x ⟩ + b_{j}^{(0)}),

which we assume is minimal and activation-distinguished as explained above.

We start by comparing the foldsets, which are the activation boundaries [Lemma 4.1], between the truth and the model. Let $H_{i}$ be the activation boundary of the node $i \in [d]$ in the model, and $H_{j}^{(0)}$ be the activation boundary of the node $j \in [m]$ in the truth. Then by comparing the sets of linear lines in [Lemma 4.2], we can show that for every node of the model $i \in [d]$ there exists a permutation $σ \in S_{m}$ such that

H_{i} = H_{σ (i)}^{(0)} .

By [Lemma 4.3], two activation boundaries $H, H^{'}$ are equal if and only if there is some non-zero scalar $α \in R ∖ {0}$ such that $w = α w^{'}$ and $b = α b^{'}$ .

Using our relation $H_{i} = H_{σ (i)}^{(0)}$ , in [Lemma 4.4] we analyse how the gradients and biases change across each activation boundary, and what this means for the relation between weights and biases in the model versus the truth. We show that there exists a unique $σ \in S_{m}$ , and for each $i \in [d]$ an $ϵ_{i} \in Z_{2}$ and $α_{i} \in R_{> 0}$ such that

w_{i} = (- 1)^{ϵ_{i}} α_{i} w_{σ (i)}^{(0)}, and b_{i} = (- 1)^{ϵ_{i}} α_{i} b_{σ (i)}^{(0)}, where α_{i} = \frac{q_{σ (i)}^{(0)}}{q_{i}},

meaning $q_{i}$ and $q_{σ (i)}^{(0)}$ necessarily have the same sign.

However, there is a restriction on which weights can have reversed orientation, $ϵ_{i} = 1$ (thus $w_{i} = - α_{i} w_{σ (i)}^{(0)}$ ). Letting $E = {i \in [d] | ϵ_{i} = 1}$ , we show in [Lemma 4.5] that the weights and biases of the true network must satisfy ^[9]

\sum i \in E q_{σ (i)}^{(0)} w_{σ (i)}^{(0)} = 0 and c^{(0)} + \sum i \in E q_{σ (i)}^{(0)} b_{σ (i)}^{(0)} = c .

The crux of this proof rests in comparing the gradients in regions either side of the activation boundary $H_{i}$ .

In [Theorem 4.7] we show that these scaling, permutation and orientation reversing symmetries are the only such symmetries by piecing together all of these aforementioned Lemmas, with emphasis on the importance of the activation boundaries in defining the topology of $f_{0} (x)$ . ^[10]

Case 2: The model has more nodes than the truth, $m < d$

We now suppose that the model is over-parameterised compared to the true network, so $m < d$ .

The key piece of data is once again the foldsets defining the model and the truth. Since they must be equal, the model can only have $m$ unique foldsets, and thus activation boundaries. Without loss of generality (i.e. up to permutation symmetry), the first $[m] \subset [d]$ nodes in the model have the same activation boundaries as the truth, $⋃_{i = 1}^{m} H_{i} = ⋃_{j = 1}^{m} H_{j}^{(0)}$ . Thus, these $[m]$ nodes in the model must satisfy the same symmetries as in the $m = d$ case.

By comparing the fold sets on each excess node in ${m + 1, \dots, d}$ , we must have

d ⋃ i = m + 1 {H_{i} | i is non-degenerate} \subseteq m ⋃ j = 1 H_{j}^{(0)} .

In comparing linear lines again, this means there are two possible situations:

${H_{i} | i is non-degenerate}$ is empty, so node $i$ is degenerate, meaning $q_{i} = 0$ or $w_{i} = 0$ , or;
$H_{i} = H_{j}^{(0)}$ for some $j \in [m]$ , so node $i$ shares an activation boundary already in the first $[m]$ nodes of the model.

Let $d^{'} \geq m$ the number of non-degenerate nodes of the model. We can thus define a surjective finite set map

π : {1, \dots, m, m + 1, \dots, d^{'}} \to {1, \dots, m}

relating the non-degenerate nodes in the model to those in the truth, which is a bijection (i.e. a permutation $σ \in S_{m}$ ) on the first $[m] \subset [d^{'}]$ .

We can then compare the gradients and biases in each region to show that the total gradients calculated by each non-degenerate node at each unique activation boundary must sum to the gradient in the truth. Precisely, for each node $j \in [m]$ of the truth, let $M_{j} = {i \in [d^{'}] | π (i) = j}$ be the set of nodes in the model that share the same activation boundary. Then for each $i \in [d^{'}]$ there exists an $ϵ_{i} \in Z_{2}$ and $α_{i} \in R_{> 0}$ such that

w_{i} = (- 1)^{ϵ_{i}} α_{i} w_{π (i)}^{(0)}, b_{i} = (- 1)^{ϵ_{i}} α_{i} b_{π (i)}^{(0)}

with the constraint that

\sum i \in M_{j} q_{i} α_{i} = q_{j}^{(0)} .

A similar orientation reversing symmetry also applies as in case 1, just by accounting for the nodes that share the same activation boundaries.

Resources

[Carroll21] - L. Carroll, Phase Transitions in Neural Networks, 2021

[Wei22] - S. Wei, D. Murfet, et al., Deep Learning is Singular, and That's Good, 2022

[PL19] - M. Phuong, C. Lampert, Functional vs Parametric Equivalence of ReLU networks, 2019

^{^}
Since $K (w) = 0$ if and only if $q (y | x) = p (y | x, w)$ for some $w \in W$ .
^{^}
e.g. $R e L U (x) + R e L U (x) = R e L U (2 x)$ . ↩︎
^{^}
For ease of classification, we exclude the case where $q_{i} \neq 0$ and $b_{i} \neq 0$ since we can just absorb the total bias contribution into $c$ . ↩︎
^{^}
A linear domain $U \subseteq R^{2}$ is just a connected open set where $f_{w}$ is a plane with constant gradient and bias when restricted to $U$ , and $U$ is the maximal such set for which that plane is defined. In other words, the set of linear domains are the set of different regions the piecewise affine function are carved up into. ↩︎
^{^}
But don't forget, $R e L U (- x) \neq - R e L U (x)$ as the domain of activation is completely different. ↩︎
^{^}
Though I have not been able to formally prove it, I believe that this symmetry on its own (i.e. modulo scaling symmetry) does not result in a degeneracy of the Fisher information matrix, at least in our simple case. This, I think, is because the weights must cancel out in the region where both nodes are active, and the gradients in the other regions must be retained. Feel free to prove me wrong, though! ↩︎
^{^}
This statement is a bit disingenuous. Watanabe's free energy formula only applies to the case where $K (w)$ is analytic, but ReLU neural networks are certainly not analytic, as we can see in the below example. With that said, Watanabe has recently proved a bound on the free energy for ReLU neural networks, showing that the complexity term is essentially related to the number of non-degenerate nodes in the truth, even if it isn't a true RLCT. We will look at this in more depth in DSLT4. ↩︎
^{^}
Aside from the technical caveat discussed about the restricted input prior $q (x)$ above.
^{^}
Our convention is to take the empty sum to be 0, so all weight orientations being preserved, $E = \emptyset$ , is perfectly fine. ↩︎
^{^}
The activation distinguished condition on the truth allows us to uniquely identify the permutation $σ \in S_{m}$ relating activation boundaries, and ensures only one node changes across each boundary. ↩︎

[-]lsgos5mo10

I'm trying to read through this more carefully this time: how load-bearing is the use of ReLU nonlinearities in the proof? This doesn't intuitively seem like it should be that important (e.g a sigmoid/gelu/tanh network feels like it is probably singular, and it certainly has to be if SLT is going to tell us something important about NN behaviour because changing the nonlinearity doesn't change how NNs behave that much imo), but it does seem to be an important part of the construction you use.

[-]Liam Carroll5mo11

Good question! The proof of the exact symmetries of this setup, i.e. the precise form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:

Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.
Even if $W_{0}$ is uninteresting, the loss landscape might be "nearly singular". The ReLU has an analytic approximation, the Swish function $σ_{β} (x) = \frac{x}{1 + e^{- β x}}$ , where ${lim}_{β \to \infty} σ_{β} (x) = R e L U (x)$ , which does not yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is "always active" in all subsets of the input domain), which breaks down a lot of the analysis used here.

Suppose, however, that we take a $β_{0}$ that is so large that from the point of view of your computer, $σ_{β_{0}} (x) = R e L U (x)$ (i.e. their difference is within machine-epsilon). Even though $W_{0}^{swish}$ is now a very different object to $W_{0}^{ReLU}$ on paper, the loss landscape will be approximately equal $L_{swish} (w) \approx L_{ReLU} (w)$ , meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.

So, whilst the precise functional-equivalences might be very different across activation functions (differing $W_{0}$ ), there might be many approximate functional equivalences. This is also the sense in which we can wave our arms about "well, SLT only applies to analytic functions, and ReLU isn't analytic, but who cares". Making precise mathematical statements about this "nearly singular" phenomena - for example, how does the posterior change as you lower $β$ in $σ_{β} (x)$ ? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.

Yeah I agree with everything you say; it's just I was trying to remind myself of enough of SLT to give a a 'five minute pitch' for SLT to other people, and I didn't like the idea that I'm hanging it of the ReLU.

I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.

I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I'm interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about 'oh well the posterior is probably-sorta-gaussian' played a big role in it's longevity as an idea.

yeah it's not totally clear what this 'nearly singular' thing would mean? Intuitively, it might be that there's a kind of 'hidden singularity' in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing

[-]Leon Lang10mo10

Thanks Liam also for this nice post! The explanations were quite clear.

The property of being singular is specific to a model class , regardless of the underlying truth.

This holds for singularities that come from symmetries where the model doesn't change. However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

Both configurations, non-weight-annihilation (left) and weight-annihilation (right)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

[-]Liam Carroll10mo80

However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

The definition of the Fisher information matrix does not refer to the truth whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution $q (x)$ , meaning the model is $p (y, x | w) = p (y | x, w) q (x)$ , which is why the $q (x)$ shows up in the formula I just linked to. The derivative terms do not explicitly include $q (x)$ because it just vanishes in the $w_{j}$ derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution $q (y | x)$ in $q (y, x) = q (y | x) q (x)$ .)

You're right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!

LESSWRONG
LW

DSLT 3. Neural Networks are Singular

26

Ω 10

Outline of Classification

Two layer Feedforward ReLU Neural Networks

Defining the Networks and Terminology

Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes

The Symmetries of Two Layer Feedforward ReLU Neural Networks

Scaling Inner and Outer Weights of a Node

Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix

Permutation of Nodes

Orientation Reversal

Motivating Example

General case

Node Degeneracy

Motivating Example

Node-degeneracies Correspond to Different Phases

General Case

Node degeneracy is the same as lossless network compressibility

There are More True Parameters if the Input Domain is Bounded

Even if Singularities Occur with Probability Zero, they Affect Global Behaviour in Learning

Appendix 1 - Formal Proof that Neural Networks are Singular

Appendix 2 - Proof Sketch for Fully Classifying $W_{0}$ for Two Layer Feedforward ReLU Networks

Case 1: The model has the same number of nodes as the truth, $m = d$

Case 2: The model has more nodes than the truth, $m < d$

Resources

New to LessWrong?

26

Ω 10

DSLT 3. Neural Networks are Singular

26

Ω 10

Outline of Classification

Two layer Feedforward ReLU Neural Networks

Defining the Networks and Terminology

Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes

The Symmetries of Two Layer Feedforward ReLU Neural Networks

Scaling Inner and Outer Weights of a Node

Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix

Permutation of Nodes

Orientation Reversal

Motivating Example

General case

Node Degeneracy

Motivating Example

Node-degeneracies Correspond to Different Phases

General Case

Node degeneracy is the same as lossless network compressibility

There are More True Parameters if the Input Domain is Bounded

Even if Singularities Occur with Probability Zero, they Affect Global Behaviour in Learning

Appendix 1 - Formal Proof that Neural Networks are Singular

Appendix 2 - Proof Sketch for Fully Classifying W0 for Two Layer Feedforward ReLU Networks

Case 1: The model has the same number of nodes as the truth, m=d

Case 2: The model has more nodes than the truth, m<d

Resources

New to LessWrong?

26

Ω 10

Appendix 2 - Proof Sketch for Fully Classifying $W_{0}$ for Two Layer Feedforward ReLU Networks

Case 1: The model has the same number of nodes as the truth, $m = d$

Case 2: The model has more nodes than the truth, $m < d$