Leon Lang

I'm a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/

Wiki Contributions

Comments

I do all of these except 3, and implementing a system like 3 is among my deprioritized things in my ToDo-list. Maybe I should prioritize it.

I really enjoyed reading this post! It's quite well-written. Thanks for writing it.

The only critique is that I would have appreciated more details on how the linear regression parameters are trained and what exactly the projection is doing. John's thread is a bit clarifying on this.

One question: If you optimize the representation in the residual stream such that it corresponds to a particular chosen belief state, does the transformer than predict the next token as if in that belief state? I.e., does the transformer use the belief state for making predictions?

Leon Lang1mo127

MATS mentorships are often weekly, but only for limited time, unlike PhD programs that offer mentorship for several years. These years are probably often necessary to develop good research taste.

(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022/23.)

This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.

Thanks for the reply!

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose  is the number of parameters, then you are in the regular case if  can be expressed as a full-rank quadratic form near each singularity, 

Anything less than this is a strictly singular case. 

So if , then  is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective. 

Leon Lang10mo50

Zeta Functions in Singular Learning Theory

In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.

The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by

where  is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by

where  is the Kullback-Leibler divergence. 

But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about

The answer: by computing a different integral. So now, I'll explain the connection to different integrals we can draw. 

Let

which is called the state density function. Here,  is the Dirac delta function.  For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the "size" of different level sets. This state density function is connected to two different things. 

Laplace Transform to the Evidence

First of all, it is connected to the evidence above. Namely, let  be the Laplace transform of . It is a function  given by

In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .

Mellin Transform to the Zeta Function

But how do we compute ? By using another transform. Let  be the Mellin transform of . It is a function  (or maybe only defined on part of ?) given by

Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function. 

What's this useful for?

The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as

Thus, we essentially changed our problem to the problem of studying the zeta function  To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory. 

Leon Lang10mo20

Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :) 

As discussed in the comment in your DSLT1 question, they are both singularities of  since they are both critical points (local minima).

Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at  is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in  is singular. Is this correct?

So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused. 

The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the  limit.

I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that. 

Leon Lang10mo10

Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :) 

At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

Which altered the posterior geometry, but not that of  since  (up to a normalisation factor).

I didn't understand this footnote. 

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior

We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

The weight annihilation phase  is never preferred by the posterior

I think there is a typo and you meant .

Load More