Liam Carroll

Mathematician, musician, hiking guide.

Website: lemmykc.github.io

3mo64

Thanks for writing this out Joar, it is a good exercise of clarification for all of us.

Perhaps a boring comment, but I do want to push back on the title ever so slightly: imo it should be My Criticism of SLT *Proponents*, i.e. people (like me) who have interpreted some aspects in perhaps an erroneous fashion (according to you).

Sumio Watanabe is incredibly careful to provide highly precise mathematical statements with rigorous proofs and at no point does he make claims about the kind of "real world deep learning" phenomena being discussed here. The only sense in which it seems you critique the theory of SLT itself is that perhaps it isn't as interesting as the number of pages taken to prove its main theorems suggests it should be, but even then it seems you agree that these are non-trivial statements.

I think it is important for people coming to the field for the first time to understand that the mathematical theory is incredibly solid, whilst its *interpretation* and applicability to broader "real world" problems is still an open question that we are actively working on.

I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks *can *obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.

At the time of writing, basically *nobody* knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the introductory paragraphs and then explaining in detail further on with "we can now understand why singular models have the capacity to generalise well", instead of caveating the whole topic out of existence before the reader knows what is going on.

As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that. My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice (e.g. the one-node subnetwork preferred by the posterior compared to the two-node network in DSLT4), but I don't know enough about Kolmogorov complexity to say much more than that.

Good question! The proof of the exact symmetries of this setup, i.e. the *precise *form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:

**Other non-linearities induce singular models.**As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.**Even if****is uninteresting, the loss landscape might be "nearly singular".**The ReLU has an analytic approximation, the Swish function , where , which does*not*yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is "always active" in all subsets of the input domain), which breaks down a lot of the analysis used here.

Suppose, however, that we take a that is so large that from the point of view of your computer, (i.e. their difference is within machine-epsilon). Even though is now a very different object to on paper, the loss landscape will be approximately equal , meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.

So, whilst the*precise*functional-equivalences might be very different across activation functions (differing ), there might be many*approximate*functional equivalences. This is also the sense in which we can wave our arms about "well, SLT only applies to analytic functions, and ReLU isn't analytic, but who cares". Making precise mathematical statements about this "nearly singular" phenomena - for example, how does the posterior change as you lower in ? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.

8mo20

**Edit**: Originally the sequence was going to contain a post about SLT for Alignment, but this can now be found here instead, where a new research agenda, Developmental Interpretability, is introduced. I have also now included references to the lectures from the recent SLT for Alignment Workshop in June 2023.

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it's an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model's behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there's definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.

Which altered the posterior geometry, but not that of since (up to a normalisation factor).

I didn't understand this footnote.

I just mean that is not affected by (even though of course or is), but the posterior is still affected by . So the phase transition merely concerns the posterior and not the loss landscape.

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.

My use of the word "symmetry" here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are *only in the set of true parameters in this setup* when the truth is configured in a particular way. In other words, they are always local minima of , but not always global minima. (This is what PT1 shows when ). Thanks for pointing this out.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

Huh, I'd never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.

I think there is a typo and you meant .

Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.

However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

The definition of the Fisher information matrix does not refer to the truth whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution , meaning the model is , which is why the shows up in the formula I just linked to. The derivative terms do not explicitly include because it just vanishes in the derivative anyway, so its irrelevant there. But remember, we are ultimately interested in modelling the conditional true distribution in .)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

You're right, thats sloppy terminology from me. What I mean is, in the right hand picture (that I originally labelled WA), there is a region in which all nodes are active, but cancel out to give zero effective gradient, which is markedly different to the left hand picture. I have edited this to NonWC and WC instead to clarify, thanks!

Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests

everypoint in is singular. Is this correct?

Correct! So, the point is that things get interesting when is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of as a singularity. The TLDR of this is:

So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,

Anything less than this is a strictly singular case.

I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.

Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey's prior. I haven't studied it in detail but to the best of my reading he is basically saying "from the point of view of SLT, the Jeffrey's prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be if the Jeffrey's prior is employed." (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey's prior is employed).

Can you tell more about why it is a measure of posterior concentration.

...

Are you claiming that most of that work happens very localized in a small parameter region?

Given a small neighbourhood , the free energy is and measures the posterior concentration in since

where the inner term *is *the posterior, modulo its normalisation constant . The key here is that if we are *comparing *different regions of parameter space , then the free energy doesn't care about that normalisation constant as it is just a shift in by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions are the same "size". Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn't really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of around the singularity).

I am not quite sure what you mean with "it tells us something about the information geometry of the posterior"

This is sloppily written by me, apologies. I merely mean to say "the free energy tells us what models the posterior likes".

I didn't find a definition of the left expression.

I mean, the relation between and tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to (instead of in the integral).

Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.

Yes - and as you say, this would be very uninteresting (and in general you wouldn't *know *what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of in DSLT3]). The point is that at no point are you just magically "choosing" a anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry depends on the singularity structure and this varies across parameter space.

Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).

As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT .

How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?

I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a "regularisation" term (and can be thought of as adding the "simple harmonic oscillator" part to the story). The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the limit. Also if you were concerned with the requirement for to be compact, you can just define it to be compact on the space of "numbers that my computer can deal with".

Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?

Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells , you find that it just gets stuck in the basin it was closest to. So, what's going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it's very much an open problem.

Ah! Thanks for that - it seems the general playlist organising them has splintered a bit, so here is the channel containing the lectures, the structure of which is explained here. I'll update this post accordingly.