Alexander Gietelink Oldenziel

(...) the term technical is a red flag for me, as it is many times used not for the routine business of implementing ideas but for the parts, ideas and all, which are just hard to understand and many times contain the main novelties.
                                                                                                           - Saharon Shelah


As a true-born Dutchman I endorse  Crocker's rules.

For my most of my writing see my short-forms (new shortform, old shortform)

Twitter: @FellowHominid

Personal website:


Singular Learning Theory

Wiki Contributions


A comment by Patrick Foré, professor from University of Amsterdam

I’m a bit puzzled by this post.The RLCT is a function of (q,p,m), the sampling distribution q, prior distribution p and parameter-to-distribution map m of the statistical model. So it actually takes the parameter-to-distribution map m into account.
However, it was criticised that:
“3. SLT abstracts away from both the prior and the parameter-function map.”Also, in the chapter “Why the SLT Answer Fails” a rather complex parameterization of a model is constructed, where SLT then supposedly fails, by pointing to a rather complex (in contrast to a simple) model. But this is, of course, because SLT takes the (here complicated) parameter-to-distribution map m into account.
So it is unclear if the criticism now is that SLT actually takes m into account (but in a complicated way) or that it doesn’t …It was also said:
“4. Hence, SLT is at its core unable to explain generalisation behaviour.”
“SLT does not explain generalisation in neural networks.”It was shown in Watanabe’s grey book (but there still under a bit more restrictive assumptions), see Remark 6.7 (3) that the Bayes generalization error of Bayesian learning is asymptotically, for sample size n to infinity, given by:
E[Bg] := E[KL(q(x)||p(x|Dn))] ~ RLCT/nSo SLT actually does say something about generalization in the Bayesian learning setting and it is a very satisfying answer imho (similar to what VC-dimension says about binary classification, but where RLCT is defined much more generally and does not just depend on the function class, but on the whole triple (q,p,m) and says something for the average case and not just for the worst case scenario).Of course, usually people don’t do proper Bayesian deep learning (they usually do MLE/MAP estimation with SGD) and they also plot a different type of generalization error and are interested in different aspects (e.g. finite sample generalization, double-descent behaviour, etc.)
But this gap could be mentioned in the very beginning of the post (maybe even in a table ‘what we want’ vs ‘what SLT currently says’) and then it would be less surprising that SLT says something about something else than what (most/some) people are interested in.Certainly, what is written under “The Actual Solution” is closer to how deep learning is done in practice. However, this is also an investigation into learning theory for singular models (that is not focused on RLCT), so can also be considered a part of SLT. Furthermore, nothing prevents us from investigating if and how it relates to quantities like RLCT, singular fluctuation, etc. (e.g. if it is providing upper or lower bounds on such quantities).Maybe the title of the post “My Criticism of Singular Learning Theory” should be replaced by “The deep learning community is interested in something else than what Singular Learning Theory currently provides” 

Pockets of Deep Expertise 

Why am I so bullish on academic outreach? Why do I keep hammering on 'getting the adults in the room'? 

It's not that I think academics are all Super Smart. 

I think rationalists/alignment people correctly ascertain that most professors don't have much useful to say about alignment & deep learning and often say silly things. They correctly see that much of AI congress is fueled by labs and scale not ML academia. I am bullish on non-ML academia, especially mathematics, physics and to a lesser extent theoretical CS, neuroscience, some parts of ML/ AI academia. This is because while I think 95 % of academia is bad and/or useless there are Pockets of Deep Expertise. Most questions in alignment are close to existing work in academia in some sense - but we have to make the connection!

A good example is 'sparse coding' and 'compressed sensing'. Lots of mech.interp has been rediscovering some of the basic ideas of sparse coding. But there is vast expertise in academia about these topics. We should leverage these!

Other examples are singular learning theory, computational mechanics, etc

re: 1.   I agree these are very difficult conceptual puzzles and we're running out of time.

On the other hand, from my pov progress on these questions from within the LW community (and MIRI adjacent researcher specifically) has been remarkable. Personally, the remarkable breakthru of Logical Induction first convinced me that these people were actually doing interesting serious things. 

I also feel that the number of serious researchers working seriously on these questions is currently small and may be scaled substantially. 

re: metacognition I am mildly excited about Vanessa's metacognitive agent framework & the work following from Payor's lemma. The theory-practice gap is still huge but real progress is being made rapidly. On the question of metacognition the alignment community could really benefit trying to engage with academia more - similar questions have been investigated and there are likely Pockets of Deep Expertise to be found. 

The linked post you wrote about classical learning theory states that the bounds PAC gives are far more loose than what we see in practice for Neural Networks. In the post you sketch some directions in which tighter bounds may be proven. It is my understanding that these directions have not been pursued further.

Given all that "Fully adequate account of generalization" seems like an overstatement, wouldn't you agree?

At best we can say that PAC gives a nice toy model for thinking about notions like generalization and learnability as far as I can tell. Maybe I'm wrong- I'm not familiar with the literature- and I'd love to know more about what PAC & classical learning theory can tell us about neural networks.

Abnormalised sampling?
Probability theory talks about sampling for probability distributions, i.e. normalized measures. However, non-normalized measures abound: weighted automata, infra-stuff, uniform priors on noncompact spaces, wealth in logical-inductor esque math, quantum stuff?? etc.

Most of probability theory constructions go through just for arbitrary measures, doesn't need the normalization assumption. Except, crucially, sampling. 

What does it even mean to sample from a non-normalized measure? What is unnormalized abnormal sampling?

I don't know. 

Infra-sampling has an interpretation of sampling from a distribution made by a demonic choice. I don't have good interpretations for other unnormalized measures. 


Concrete question: is there a law of large numbers for unnormalized measures? 

Let f be a measureable function and m a measure. Then the expectation value is defined . A law of large numbers for unnormalized measure would have to say something about repeated abnormal sampling. 


I have no real ideas. Curious to learn more. 

From reading your post it seems that classical VC theory gives vacuous bounds for NN learning behaviour. Correct me if I'm wrong. You say that the PAC formalism can be improved to be more realistic and suggest more non-vacuous bounds may be proved. Do you have a reference where non-vacuous bounds are proved?

"My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory."

What general results from statistical and computational learning theory are you referring to here exactly?

I notice I am confused by this. Seems implausible that a LLM can execute a devious x-risk plan in a single forward-pass based on a wrong prompt.

Alignment work mostly looks like standard academic science in practice. Young people in regular academia are paid a PhD stipend salary not a Bay Area programmer salary...

Load More