Trained neural networks only generalize at all with certain types of activation functions. So if your theory doesn't consider activation functions it's probably wrong.
If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.
In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also discuss similar results for tanh networks, alluding to the fact that there are many (but not all) activation functions that produce these symmetries, thus making neural networks with those activation functions singular models, thus meaning the content and interpretation of Watanabe’s free energy formula is applicable.
This is all pretty complicated compared to my understanding of why neural networks generalize, and I'm not sure why I should prefer it. Does this complex and detailed theory have any concrete predictions about NN design or performance in different circumstances? Can you accurately predict which activation functions work well?
My view is that this "singularity" of networks - which I don't think is a good term, it's already overloaded with far too many meanings - is applicable to convergence properties but not to generalization ability.
What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena.
The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within AI research, and these sorts of facts are where SLT might take us (modulo AI capabilities concerns). The point is to demonstrate the key insights of the theory and signpost the fact that “hey, there might be something very meaningful here.” What we can predict with the theory is why certain phase transitions happen, in particular the two layer feedforward ReLU nets I will show in DSLT4. This is a seed from which to generalise to deeper nets and more intricate architectures - the natural way of doing good mathematics.
As to the “singularity” problem, you will have to take that up with the algebraic geometers who have been studying singularities for over 50 years. The fact is that optimal parameters are singularities of K(w) in non-trivial neural networks - hence, singular learning theory.
What do “convergence properties” and “generalisation ability“ mean to you, precisely? It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation. As I say, its relation to SGD dynamics is certainly an open question. But in the Bayesian setting, the case is really quite closed.
What is your understanding?
I guess I'll write a post.
It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation
No. What was proven is that there are some points which can be represented by lots of possible configurations, more so than other points. There is no proof or even evidence that those are reached by NN training by SGD, or that those points represent good solutions to problems. As far as I can tell, you're just assuming that because it seems to you like a logical reason for NN generalization.
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
its relation to SGD dynamics is certainly an open question.
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of , as measured by the RLCT . This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
I said I'd write a post, and I wrote a post.
the Bayes generalisation error Gn is the "derivative" of the free energy
I think calling that "Bayes generalisation error" is where you went wrong. I see no good basis for saying that's true in the sense people normally mean "generalization".
I understand some things about a Free Energy Formula are proved, but I don't think you've shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it - the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: "what is the probability of this output given this input and given the data of the posterior". The Bayes training loss (which I haven't delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
and so it adds up the "entropy" of the predictive distribution over the training datapoints - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies , and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it "generalisation". As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not "find" anything themselves - either an optimisation method like SGD does, or the Bayesian posterior "finds" regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does "find" low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks "finds" low RLCT points.
Does this mean SGD does? We don't know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.
Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it.
Fundamentally, we care about the free energy because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior.
Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question nonetheless to express that this statement isn't locally clear to me here)? I may lack some background in Bayesian statistics here. In the first post, you wrote the posterior as
and it seems like you want to say that if free energy is low, then the posterior is more concentrated. If I look at this formula, then low free energy corresponds to high , meaning the prior and likelihood have to "work quite a bit" to ensure that this expression overall integrates to . Are you claiming that most of that work happens very localized in a small parameter region?
Additionally, I am not quite sure what you mean with "it tells us something about the information geometry of the posterior", or even what you mean by "information geometry" here. I guess one answer is that you showed in post 1 that the Fisher information matrix appears in the formula for the free energy, which contains geometric information about the loss landscape. But then in the proof, you regarded that as a constant that you ignored in the final BIC formula, so I'm not sure if that's what you are referring to here. More explicit references would be useful to me.
Since there is a correspondence
we say the posterior prefers a region when it has low free energy relative to other regions of .
Note to other readers (as this wasn't clear to me immediately): That correspondence holds because one can show that
Here, is the global partition function.
The Bayes generalisation loss is then given by
I believe the first expression should be an expectation over .
It follows immediately that the generalisation loss of a region is
I didn't find a definition of the left expression.
So, the region in that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.
So I guess you also want to choose small regions. You hinted at that already by saying that should be compact. But now I of course wonder if sometimes just all of lies within a compact set.
There are two singularities in the set of true parameters,
which we will label as and respectively.
Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).
Let's consider a one parameter model with KL divergence defined by
on the region with uniform prior
The prior seems to do some work here: if it doesn't properly support the region with low RLCT, then the posterior cannot converge there. I guess a similar story might a priori hold for SGD, where how you initialize your neural network might matter for convergence.
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?
Given a small neighbourhood , the free energy is and measures the posterior concentration in since
where the inner term is the posterior, modulo its normalisation constant . The key here is that if we are comparing different regions of parameter space , then the free energy doesn't care about that normalisation constant as it is just a shift in by a constant. So the free energy gives you a tool for comparing different regions of the posterior. (To make this comparison rigorous, I suppose one would want to make sure that these regions are the same "size". Another perspective, and really the main SLT perspective, is that if they are sufficiently small and localised around different singularities then this size problem isn't really relevant, and the free energy is telling you something about the structure of the singularity and the local geometry of around the singularity).
I am not quite sure what you mean with "it tells us something about the information geometry of the posterior"
This is sloppily written by me, apologies. I merely mean to say "the free energy tells us what models the posterior likes".
I didn't find a definition of the left expression.
I mean, the relation between and tells you that this is a sensible thing to write down, and if you reconstructed the left side from the right side you would simply find some definition in terms of the predictive distribution restricted to (instead of in the integral).
Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.
Yes - and as you say, this would be very uninteresting (and in general you wouldn't know what to pick necessarily [although we did in the phase transition DSLT4 because of the classification of in DSLT3]). The point is that at no point are you just magically "choosing" a anyway. If you really want to calculate the free energy of some model setup then you would have a reason to choose different phases to analyse. Otherwise, the premise of this section of the post is to show that the geometry depends on the singularity structure and this varies across parameter space.
Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).
As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima). But they are not both true parameters, nor are they both regular points with RLCT .
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
I think sensible choices of priors has an interesting and not-interesting angle to it. The interesting answer might involve something along the lines of reformulating the Jeffreys prior, as well as noticing that a Gaussian prior gives you a "regularisation" term (and can be thought of as adding the "simple harmonic oscillator" part to the story). The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the limit. Also if you were concerned with the requirement for to be compact, you can just define it to be compact on the space of "numbers that my computer can deal with".
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Yes! We are thinking very much about this at the moment and I think this is the correct intuition to have. If one runs SGD on the potential wells , you find that it just gets stuck in the basin it was closest to. So, what's going on in high dimensions? It seems something about the way higher dimensional spaces are different from lower ones is relevant here, but it's very much an open problem.
Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :)
As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima).
Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?
So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the limit.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?
Correct! So, the point is that things get interesting when is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse notation a bit and really just refer to any local minima of as a singularity. The TLDR of this is:
So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.
Watanabe has an interesting little section in the grey book [Remark 7.4, Theorem 7.4, Wat09] talking about the Jeffrey's prior. I haven't studied it in detail but to the best of my reading he is basically saying "from the point of view of SLT, the Jeffrey's prior is zero at singularities anyway, its coordinate-free nature makes it inappropriate for statistical learning, and the RLCT can only be if the Jeffrey's prior is employed." (The last statement is the content of the theorem where he studies the poles of the zeta function when the Jeffrey's prior is employed).
Thanks for the reply!
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
So if , then is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective.