## LESSWRONGLW

Liam Carroll

Mathematician, musician, hiking guide.

Website: lemmykc.github.io

# Sequences

Distilling Singular Learning Theory

Sorted by New

# Wiki Contributions

Thanks for writing this out Joar, it is a good exercise of clarification for all of us.

Perhaps a boring comment, but I do want to push back on the title ever so slightly: imo it should be My Criticism of SLT Proponents, i.e. people (like me) who have interpreted some aspects in perhaps an erroneous fashion (according to you).

Sumio Watanabe is incredibly careful to provide highly precise mathematical statements with rigorous proofs and at no point does he make claims about the kind of "real world deep learning" phenomena being discussed here. The only sense in which it seems you critique the theory of SLT itself is that perhaps it isn't as interesting as the number of pages taken to prove its main theorems suggests it should be, but even then it seems you agree that these are non-trivial statements.

I think it is important for people coming to the field for the first time to understand that the mathematical theory is incredibly solid, whilst its interpretation and applicability to broader "real world" problems is still an open question that we are actively working on.

Reply 2

I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.

At the time of writing, basically nobody knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the introductory paragraphs and then explaining in detail further on with "we can now understand why singular models have the capacity to generalise well", instead of caveating the whole topic out of existence before the reader knows what is going on.

As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that. My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice (e.g. the one-node subnetwork preferred by the posterior compared to the two-node network in DSLT4), but I don't know enough about Kolmogorov complexity to say much more than that.

Reply 2

Good question! The proof of the exact symmetries of this setup, i.e. the precise form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:

1. Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.
2. Even if  is uninteresting, the loss landscape might be "nearly singular". The ReLU has an analytic approximation, the Swish function , where , which does not yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is "always active" in all subsets of the input domain), which breaks down a lot of the analysis used here.

Suppose, however, that we take a  that is so large that from the point of view of your computer,  (i.e. their difference is within machine-epsilon). Even though  is now a very different object to  on paper, the loss landscape will be approximately equal , meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.

So, whilst the precise functional-equivalences might be very different across activation functions (differing ), there might be many approximate functional equivalences. This is also the sense in which we can wave our arms about "well, SLT only applies to analytic functions, and ReLU isn't analytic, but who cares". Making precise mathematical statements about this "nearly singular" phenomena - for example, how does the posterior change as you lower  in ? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.

Edit: Originally the sequence was going to contain a post about SLT for Alignment, but this can now be found here instead, where a new research agenda, Developmental Interpretability, is introduced. I have also now included references to the lectures from the recent SLT for Alignment Workshop in June 2023.

Only in the illegal ways, unfortunately. Perhaps your university has access?

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it's an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model's behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there's definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.

Which altered the posterior geometry, but not that of  since  (up to a normalisation factor).

I didn't understand this footnote.

I just mean that  is not affected by  (even though of course  or  is), but the posterior is still affected by . So the phase transition merely concerns the posterior and not the loss landscape.

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.

My use of the word "symmetry" here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are only in the set of true parameters in this setup when the truth is configured in a particular way. In other words, they are always local minima of , but not always global minima. (This is what PT1 shows when ). Thanks for pointing this out.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

Huh, I'd never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.

I think there is a typo and you meant .

Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.

However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

The definition of the Fisher information matrix does not refer to the truth  whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution , meaning the model is , which is why the  shows up in the formula I just linked to. The derivative terms do not explicitly include  because it just vanishes in the  derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution  in .)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

You're right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!

Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in  is singular. Is this correct?

Correct! So, the point is that things get interesting when  is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of  as a singularity. The TLDR of this is:

So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose  is the number of parameters, then you are in the regular case if  can be expressed as a full-rank quadratic form near each singularity,

Anything less than this is a strictly singular case.

I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.

Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey's prior. I haven't studied it in detail but to the best of my reading he is basically saying "from the point of view of SLT, the Jeffrey's prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be  if the Jeffrey's prior is employed." (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey's prior is employed).

Can you tell more about why it is a measure of posterior concentration.

...

Are you claiming that most of that work happens very localized in a small parameter region?

Given a small neighbourhood , the free energy is  and  measures the posterior concentration in  since

where the inner term is the posterior, modulo its normalisation constant . The key here is that if we are comparing different regions of parameter space , then the free energy doesn't care about that normalisation constant as it is just a shift in  by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions  are the same "size". Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn't really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of  around the singularity).

I am not quite sure what you mean with "it tells us something about the information geometry of the posterior"

This is sloppily written by me, apologies. I merely mean to say "the free energy tells us what models the posterior likes".

I didn't find a definition of the left expression.

I mean, the relation between  and  tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to  (instead of  in the integral).

Purposefully naive question: can I just choose a region  that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.

Yes - and as you say, this would be very uninteresting (and in general you wouldn't know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of  in DSLT3]). The point is that at no point are you just magically "choosing" a  anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry  depends on the singularity structure and this varies across parameter space.

Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).

As discussed in the comment in your DSLT1 question, they are both singularities of  since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT

How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?

I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a "regularisation" term (and can be thought of as adding the "simple harmonic oscillator" part to the story). The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the  limit. Also if you were concerned with the requirement for  to be compact, you can just define it to be compact on the space of "numbers that my computer can deal with".

Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?

Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells , you find that it just gets stuck in the basin it was closest to. So, what's going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it's very much an open problem.

Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.

Should I think of this as being equal to , and would you call this quantity ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model .

The partition function is equal to the model evidence , yep. It isn’t equal to (I assume  is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),

and then under this supervised learning setup where we know , we have . Also note that this does “factor over ” (if I’m interpreting you correctly) since the data is independent and identically distributed.

But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different "phases" with their own free energy?

Yep, you caught me - I was one step ahead. The free energy over the whole space  is still a very useful quantity as it tells you “how good” the best model in the model class is. But  by itself doesn’t tell you much about what else is going on in the loss landscape. For that, you need to localise to smaller regions and analyse their phase structure, as presented in DSLT2

I think the first expression should either be an expectation over , or have the conditional entropy  within the parantheses.

Ah, yes, you are right - this is a notational hangover from my thesis where I defined  to be equal to expectation with respect to the true distribution . (Things get a little bit sloppy when you have this known  floating around everywhere - you eventually just make a few calls on how to write the cleanest notation, but I agree that in the context of this post it’s a little confusing so I apologise).

I briefly tried showing this and somehow failed. I didn't quite manage to get rid of the integral over . Is this simple? (You don't need to show me how it's done, but maybe mentioning the key idea could be useful)

See Lemma A.2 in my thesis. One uses a fairly standard argument involving the first central moment of a Gaussian.

The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.

Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular?

Yep, the rest of the article does focus on the case where the Fisher information matrix is degenerate because it is far more interesting and gives rise to an interesting singularity structure (i.e. most of the time it will yield an RLCT ). Unless my topology is horrendously mistaken, if one has a singular model class for which every parameter has a positive definite Fisher information, then this implies the non-identifiability condition simply means you have a set of isolated points  that all have the same RLCT . Thus the free energy will only depend on their inaccuracy , meaning every optimal parameter has the same free energy - not particularly interesting! An example of this would be something like the permutation symmetry of ReLU neural networks that I discuss in DSLT3

I found this slightly ambiguous, also because under your definitions further down, it seems like "singular" (degenerate Fisher information matrix) is a stronger condition then "strictly singular" (degenerate Fisher information matrix OR non-injective map from parameters to distributions).

I have clarified the terminology in the section where they are defined - thanks for picking me up on that. In particular, a singular model class can be either strictly singular or regular - Watanabe’s results hold regardless of identifiability or the degeneracy of the Fisher information. (Sometimes I might accidentally use the word "singular" to emphasise a model which "has non-regular points" - the context should make it relatively clear).

What is  in this formula? Is it fixed? Or do we average the derivatives over the input distribution?

Refer to Theorem 3.1 and Lemma 3.2 in my thesis. The Fisher information involves an integral wrt , so the Fisher information is degenerate iff that set is dependent as a function of , in other words, for all  values in the domain specified by  (well, more precisely, for all non-measure-zero regions as specified by ).

Hhm, I thought having a singular model just means that some singularities are degenerate.

Typo - thanks for that.

One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?

Correct! When we use the word “singularity”, we are specifically referring to singularities of  in the sense of algebraic geometry, so they are zeroes (or zeroes of a level set), and critical points with . So, even in regular models, the single optimal parameter is a singularity of  - it just a really, really uninteresting one. In SLT, every singularity needs to be put into normal crossing form via the resolution of singularities, regardless of whether it is a singularity in the sense that you describe (drawing self-intersecting curves, looking at cusps, etc.). But for cartoon purposes, those sorts of curves are good visualisation tools.

I think you forgot a  in the term of degree 1.

Typo - thanks.

Could you explain why that is? I may have missed some assumption on  or not paid attention to something.

If you expand that term out you find that

because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.

Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?

This is a fair question. When concerning the zeroes, by the formula for  when the truth is realisable one shows that

so any path in the set of true parameters (i.e. in this case the set ) will indeed produce the same input-output function. In general (away from the zeroes of ), I don’t think this is necessarily true but I’d have to think a bit harder about it. In this pathological case it is, but I wouldn’t get bogged down in it - I’m just saying “ tells us one parameter can literally be thrown out without changing anything about the model”. (Note here that  is literally a free parameter across all of ).

Are you sure this is the correct formula? When I tried computing this by hand it resulted in , but maybe I made a mistake.

Ah! Another typo - thank you very much. It should be

General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters  around , the more  blows up around  in all directions because we get variation in all directions, and so the smaller the region where  is below . So  contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small

I think that's a very reasonable intuition to have, yep! Moreover, if one wants to compare the "flatness" between  versus , the point is that within a small neighbourhood of the singularity, a higher exponent (RLCTs of  and  respectively here) is "much flatter" than a low coefficient (the ). This is what the RLCT is picking up.

Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What's the state of the theory regarding this?

We do expect that SGD is roughly equivalent to sampling from the Bayesian posterior and therefore that it moves towards regions of low RLCT, yes! But this is nonetheless just a postulate for the moment. If one treats  as a Hamiltonian energy function, then you can apply a full-throated physics lens to this entire setup (see DSLT4) and see that the critical points of  strongly affect the trajectories of the particles. Then the connection between SGD and SLT is really just the extent to which SGD is “acting like a particle subject to a Hamiltonian potential”. (A variant called SGLD seems to be just that, so maybe the question is under what conditions / to what extent does SGD = SGLD?). Running experiments that test whether variants of SGD end up in low RLCT regions of  is definitely a fruitful path forward.