All of Liam Carroll's Comments + Replies

Ah! Thanks for that - it seems the general playlist organising them has splintered a bit, so here is the channel containing the lectures, the structure of which is explained here. I'll update this post accordingly.

Thanks for writing this out Joar, it is a good exercise of clarification for all of us.

Perhaps a boring comment, but I do want to push back on the title ever so slightly: imo it should be My Criticism of SLT Proponents, i.e. people (like me) who have interpreted some aspects in perhaps an erroneous fashion (according to you).

Sumio Watanabe is incredibly careful to provide highly precise mathematical statements with rigorous proofs and at no point does he make claims about the kind of "real world deep learning" phenomena being discussed here. The only sense... (read more)

1Joar Skalse3mo
Yes, I completely agree. The theorems that have been proven by Watanabe are of course true and non-trivial facts of mathematics; I do not mean to dispute this. What I do criticise is the magnitude of the significance of these results for the problem of understanding the behaviour of deep learning systems.

I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space. 

At the time of writing, basically nobody knew anyt... (read more)

2Joar Skalse3mo
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam's razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam's razor (or inductive bias more broadly), further assumptions and claims would be required. Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT. I'm very glad to hear that! :) Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low λ almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks. 

Good question! The proof of the exact symmetries of this setup, i.e. the precise form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this: 

  1. Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that
... (read more)
Yeah I agree with everything you say; it's just I was trying to remind myself of enough of SLT to give a a 'five minute pitch' for SLT to other people, and I didn't like the idea that I'm hanging it of the ReLU. I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand. I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I'm interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about 'oh well the posterior is probably-sorta-gaussian' played a big role in it's longevity as an idea. yeah it's not totally clear what this 'nearly singular' thing would mean? Intuitively, it might be that there's a kind of 'hidden singularity' in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing

Edit: Originally the sequence was going to contain a post about SLT for Alignment, but this can now be found here instead, where a new research agenda, Developmental Interpretability, is introduced. I have also now included references to the lectures from the recent SLT for Alignment Workshop in June 2023. 

Only in the illegal ways, unfortunately. Perhaps your university has access?

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it... (read more)

However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

The definition of the Fisher information matrix does not refer to the truth  whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution , meaning the model is , which ... (read more)

Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in  is singular. Is this correct?

Correct! So, the point is that things get interesting when  is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse ... (read more)

2Leon Lang7mo
Thanks for the reply! So if K(w)=w2, then w=0 is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective. 

Can you tell more about why it is a measure of posterior concentration.


Are you claiming that most of that work happens very localized in a small parameter region?

Given a small neighbourhood , the free energy is  and  measures the posterior concentration in  since 

where the inner term is the posterior, modulo its normalisation constant . The key here is that if we are comparing different regions of parameter space , then the free energy doesn't care about t... (read more)

2Leon Lang8mo
Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :)  Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in f(x,y)=xy, the zero set consists of the two axes, which form an algebraic variety, but only at (0,0) is there a singularity because the derivative disappears. Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of K, and thus, the derivative disappears at every point in the set W0. This suggests every point in W0 is singular. Is this correct? So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.  I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for n→∞, but I'm not certain of that. 

Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.

Should I think of this as being equal to , and would you call this quantity ? I was a bit confu

... (read more)
1Leon Lang8mo
Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers: I think I still disagree. I think everything in these formulas needs to be conditioned on the X-part of the dataset. In particular, I think the notation p(Dn) is slightly misleading, but maybe I'm missing something here. I'll walk you through my reasoning: When I write (Xi) or (Yi), I mean the whole vectors, e.g., (Xi)i=1,…,n. Then I think the posterior compuation works as follows: p(w∣Dn)=p(w∣(Yi),(Xi))=p((Yi)∣(Xi),w)⋅p(w∣(Xi))p((Yi)∣(Xi)). That is just Bayes rule, conditioned on (Xi) in every term. Then, p(w∣(Xi))=φ(w) because from Xalone you don't get any new information about the conditional q(Y∣X) (A more formal way to see this is to write down the Bayesian network of the model and to see that w and Xi are d-separated). Also, conditioned on w, p is independent over data points, and so we obtain p(w∣Dn)=1p((Yi)∣(Xi))⋅e−nLn(w)⋅φ(w). So, comparing with your equations, we must have Zn=p((Yi)∣(Xi)). Do you think this is correct? Btw., I still don't think this "factors over i". I think that Zn≠∏ni=1p(Yi∣Xi). The reason is that old data points should inform the parameter w, which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on w.  Right. that makes sense, thank you! (I think you missed a factor of n/2, but that doesn't change the conclusion) Thanks also for the corrected volume formula, it makes sense now :) 

Thanks for writing that, I look forward to reading. 

As for nomenclature, I did not define it - the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: "what is the probability of this output given this input and given the data of the posterior". The Bayes training loss  (which I haven't delved into in this post) is the e... (read more)

With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is

its relation to SGD dynamics is certainly an open question.

What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post,  strongly depends on the singularity structure of the minima of , as measured by the RLCT . This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is a... (read more)

I said I'd write a post, and I wrote a post. I think calling that "Bayes generalisation error" is where you went wrong. I see no good basis for saying that's true in the sense people normally mean "generalization". I understand some things about a Free Energy Formula are proved, but I don't think you've shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.

What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena. 

The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within... (read more)


The way I've structured the sequence means these points are interspersed throughout the broader narrative, but its a great question so I'll provide a brief summary here, and as they are released I will link to the relevant sections in this comment.

  • In regular model classes, the set of true parameters  that minimise the loss  is a single point. In singular model classes, it can be significantly more than a single point. Generally, it is a higher-dimensional structure. See here in DSLT1.
  • In regular model classes, every point on 
... (read more)

If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.

In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also ... (read more)

This is all pretty complicated compared to my understanding of why neural networks generalize, and I'm not sure why I should prefer it. Does this complex and detailed theory have any concrete predictions about NN design or performance in different circumstances? Can you accurately predict which activation functions work well? My view is that this "singularity" of networks - which I don't think is a good term, it's already overloaded with far too many meanings - is applicable to convergence properties but not to generalization ability.