Liam Carroll

DSLT 0. Distilling Singular Learning Theory

Ah! Thanks for that - it seems the general playlist organising them has splintered a bit, so here is the channel containing the lectures, the structure of which is explained here. I'll update this post accordingly.

Replying toMy Criticism of Singular Learning Theory

My Criticism of Singular Learning Theory

Thanks for writing this out Joar, it is a good exercise of clarification for all of us.

Perhaps a boring comment, but I do want to push back on the title ever so slightly: imo it should be My Criticism of SLT Proponents, i.e. people (like me) who have interpreted some aspects in perhaps an erroneous fashion (according to you).

Sumio Watanabe is incredibly careful to provide highly precise mathematical statements with rigorous proofs and at no point does he make claims about the kind of "real world deep learning" phenomena being discussed here. The only sense in which it seems you critique the theory of SLT itself is that perhaps it isn't as... (read more)

Replying toMy Criticism of Singular Learning Theory

My Criticism of Singular Learning Theory

I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.

At the time of writing, basically nobody knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the... (read more)

Replying toDSLT 3. Neural Networks are Singular

DSLT 3. Neural Networks are Singular

Good question! The proof of the exact symmetries of this setup, i.e. the precise form of $W_{0}$ , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:

Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation

Liam Carroll, Edmund Lau

TLDR: This post distills Dynamical and Bayesian Phase Transitions in a Toy Model of Superposition by Chen et al. (2023), where they study developmental stages of the Toy Model of Superposition, understanding growth and form from the perspective of Singular Learning Theory (SLT).

Ernst Haeckel's *Kunstformen der Natur* (1904), plate 1: Phaeodaria

Where do the bewildering and intricate structures of Nature come from? What purpose do they serve? In his famous 1917 book "On Growth and Form" the Scottish biologist and mathematician D'Arcy Wentworth Thompson wrote the following about the geometric forms of Phaeodaria, shown above:

Great efforts have been made to attach a "biological meaning" to these elaborate structures and "to justify the hope

... (read 4101 more words →)

Replying toDSLT 0. Distilling Singular Learning Theory

DSLT 0. Distilling Singular Learning Theory

Edit: Originally the sequence was going to contain a post about SLT for Alignment, but this can now be found here instead, where a new research agenda, Developmental Interpretability, is introduced. I have also now included references to the lectures from the recent SLT for Alignment Workshop in June 2023.

Replying toDSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

Only in the illegal ways, unfortunately. Perhaps your university has access?

Replying toDSLT 4. Phase Transitions in Neural Networks

DSLT 4. Phase Transitions in Neural Networks

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it's an interesting question to consider whats going on over that long period of time where the error... (read 663 more words →)

Replying toDSLT 3. Neural Networks are Singular

DSLT 3. Neural Networks are Singular

However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

The definition of the Fisher information matrix does not refer to the truth $q (y, x)$ whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution $q (x)$ , meaning the model is $p (y, x | w) = p (y | x, w) q (x)$ , which is why the $q (x)$ shows up in the formula I just linked to. The derivative terms do not explicitly include $q (x)$ because it just vanishes in the $w_{j}$ derivative anyway, so its irrelevant there. But remember, we... (read more)

Replying toDSLT 2. Why Neural Networks obey Occam's Razor

DSLT 2. Why Neural Networks obey Occam's Razor

Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of $K$ , and thus, the derivative disappears at every point in the set $W_{0}$ . This suggests every point in $W_{0}$ is singular. Is this correct?

Correct! So, the point is that things get interesting when $W_{0}$ is more than just a single point (which is the regular case). In essence, singularities are local minima of $K (w)$ . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of $K (w)$ as a singularity. The TLDR of this is:

singularities of K (w) = critical points of K (w)

So far, I thought "being

DSLT 2. Why Neural Networks obey Occam's Razor

Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?

Given a small neighbourhood $W \subset W$ , the free energy is $F_{n} (W) = - log Z_{n} (W)$ and $Z_{n} (W)$ measures the posterior concentration in $W$ since

Z_{n} (W) = \int_{W} e^{- n L_{n} (w)} φ (w) d w

where the inner term is the posterior, modulo its normalisation constant $Z_{n}$ . The key here is that if we are comparing different regions of parameter space $W$ , then the free energy doesn't care about that normalisation constant as it is just a shift in $F_{n} (W)$ by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these... (read 641 more words →)

DSLT 4. Phase Transitions in Neural Networks

DSLT 3. Neural Networks are Singular

TLDR; This is the fourth main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain how to relate SLT to thermodynamics, and therefore how to think about phases and phase transitions in the posterior in statistical learning. I then provide intuitive examples of first and second order phase transitions in a simple $K (w)$ loss function. Finally, I experimentally demonstrate phase transitions in two layer ReLU neural networks associated to the node-degeneracy and orientation-reversing phases established in DSLT3, which we can understand precisely through the lens of SLT.

In deep learning, the terms "phase" and "phase transition" are often used in an informal manner to refer to a steep change in a... (read 4661 more words →)

DSLT 2. Why Neural Networks obey Occam's Razor

TLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4.

As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less... (read 5478 more words →)

DSLT 0. Distilling Singular Learning Theory

TLDR; This is the second main post of Distilling Singular Learning Theory which is introduced in DSLT0. I synthesise why Watanabe's free energy formula explains why neural networks have the capacity to generalise well, since different regions of the loss landscape have different accuracy-complexity tradeoffs. I also provide some simple intuitive examples that visually demonstrate why true parameters (i.e. optimally accurate parameters) are preferred according to the RLCT as $n \to \infty$ , and why non-true parameters can still be preferred at finite $n$ if they have lower RLCT's, due to the accuracy-complexity tradeoff. (The RLCT is introduced and explained in DSLT1).

It is an amazing fact that deep neural networks seem to have an inductive bias towards "simple"... (read 5006 more words →)

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

TLDR; In this sequence I distill Sumio Watanabe's Singular Learning Theory (SLT) by explaining the essence of its main theorem - Watanabe's Free Energy Formula for Singular Models - and illustrating its implications with intuition-building examples. I then show why neural networks are singular models, and demonstrate how SLT provides a framework for understanding phases and phase transitions in neural networks.

Epistemic status: The core theorems of Singular Learning Theory have been rigorously proven and published by Sumio Watanabe across 20 years of research. Precisely what it says about modern deep learning, and its potential application to alignment, is still speculative.

Acknowledgements: This sequence has been produced with the support of a grant from... (read 1497 more words →)