## LESSWRONGLW

Leon Lang

I'm a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/

# Wiki Contributions

(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022/23.)

This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose  is the number of parameters, then you are in the regular case if  can be expressed as a full-rank quadratic form near each singularity,

Anything less than this is a strictly singular case.

So if , then  is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective.

# Zeta Functions in Singular Learning Theory

In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.

The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by

where  is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by

where  is the Kullback-Leibler divergence.

But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about

The answer: by computing a different integral. So now, I'll explain the connection to different integrals we can draw.

Let

which is called the state density function. Here,  is the Dirac delta function.  For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the "size" of different level sets. This state density function is connected to two different things.

## Laplace Transform to the Evidence

First of all, it is connected to the evidence above. Namely, let  be the Laplace transform of . It is a function  given by

In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .

## Mellin Transform to the Zeta Function

But how do we compute ? By using another transform. Let  be the Mellin transform of . It is a function  (or maybe only defined on part of ?) given by

Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function.

# What's this useful for?

The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as

Thus, we essentially changed our problem to the problem of studying the zeta function  To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory.

Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :)

As discussed in the comment in your DSLT1 question, they are both singularities of  since they are both critical points (local minima).

Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at  is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in  is singular. Is this correct?

So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.

The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the  limit.

I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that.

Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)

At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

Which altered the posterior geometry, but not that of  since  (up to a normalisation factor).

I didn't understand this footnote.

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior

We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

The weight annihilation phase  is never preferred by the posterior

I think there is a typo and you meant .

Thanks Liam also for this nice post! The explanations were quite clear.

The property of being singular is specific to a model class , regardless of the underlying truth.

This holds for singularities that come from symmetries where the model doesn't change. However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.

Both configurations, non-weight-annihilation (left) and weight-annihilation (right)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

In particular, it is the singularities of these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance.

To clarify: there is not necessarily a problem with the tangent, right? E.g., the function  has a singularity at  because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers.

• A model, parametrized by weights , where  is compact;

Why do we want compactness? Neural networks are parameterized in a non-compact set. (Though I guess usually, if things go well, the weights don't blow up. So in that sense it can maybe be modeled to be compact)

The empirical Kullback-Leibler divergence is just a rescaled and shifted version of the negative log likelihood.

I think it is only shifted, and not also rescaled, if I'm not missing something.

But these predictions of "generalization error" are actually a contrived kind of theoretical device that isn't what we mean by "generalization error" in the typical ML setting.

Why is that? I.e., in what way is the generalization error different from what ML people care about? Because real ML models don't predict using an updated posterior over the parameter space? (I was just wondering if there is a different reason I'm missing)

Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it's still not clear and/or someone doesn't want to follow up in Liam's thesis,  is a free variable, and the condition is talking about linear dependence of functions of .

Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let  so that  and . Then let  and  be functions such that  and .. Then the set of functions  is a linearly dependent set of functions because .

Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam's thesis. Also thanks for your other comments!

Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:

The partition function is equal to the model evidence , yep. It isn’t equal to (I assume  is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),

and then under this supervised learning setup where we know , we have . Also note that this does “factor over ” (if I’m interpreting you correctly) since the data is independent and identically distributed.

I think I still disagree. I think everything in these formulas needs to be conditioned on the -part of the dataset. In particular, I think the notation  is slightly misleading, but maybe I'm missing something here.

I'll walk you through my reasoning: When I write  or , I mean the whole vectors, e.g., . Then I think the posterior compuation works as follows:

That is just Bayes rule, conditioned on  in every term. Then,  because from alone you don't get any new information about the conditional  (A more formal way to see this is to write down the Bayesian network of the model and to see that  and  are d-separated). Also, conditioned on  is independent over data points, and so we obtain

So, comparing with your equations, we must have  Do you think this is correct?

Btw., I still don't think this "factors over ". I think that

The reason is that old data points should inform the parameter , which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on

If you expand that term out you find that

because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.

Right. that makes sense, thank you! (I think you missed a factor of , but that doesn't change the conclusion)

Thanks also for the corrected volume formula, it makes sense now :)