Daniel Murfet

Except nobody wants to hear about it at parties.

You seem to do OK...

If they only would take the time to explain things simply you would understand.

This is an interesting one. I field this comment quite often from undergraduates, and it's hard to carve out enough quiet space in a conversation to explain what they're doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.

As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It's extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!

Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.

Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.

I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?

Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).

I think that expression of Jesse's is also correct, in context.

However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like "generalisation" in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn't make enough effort to prevent them from drawing these inferences.

I have noticed people at the Berkeley meeting and elsewhere believing (ii) was somehow resolved by SLT, or just in a vague sense thinking SLT says something more than it does. While there are hard tradeoffs to make in writing expository work, I think your criticism of this aspect of the messaging around SLT on LW is fair and to the extent it misleads people it is doing a disservice to the ongoing scientific work on this important subject.

I'm often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work. So I'm grateful for the chance to have this conversation (and to anybody reading this who sees other areas where they think we're falling short, read this as an invitation to let me know, either privately or in posts like this).

I'll discuss the generalisation topic further with the authors of those posts. I don't want to pre-empt their point of view, but it seems likely we may go back and add some context on (i) vs (ii) in those posts or in comments, or we may just refer people to this post for additional context. Does that sound reasonable?

At least right now, the value proposition I see of SLT lies not in explaining the "generalisation puzzle" but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.

However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)

Seems reasonable to me!

Re: the articles you link to. I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.

I agree that Jesse's post has a title "Neural networks generalize because of this one weird trick" which is clickbaity, since SLT does *not* in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more nuanced, saying things like "SLT seems like a promising route to develop a better understanding of generalization and the limiting dynamics of training". Jesse gives a long list of obstacles to walking this route. I can't find anything in the post itself to object to. Maybe you think its optimism is misplaced, and fair enough.

So I don't really understand what claims about inductive bias or generalisation behaviour in these posts you think is invalid?

I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?

That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT *does* say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in deep learning I think they mean two related but distinct things:

(i) the *general* question about how it is *possible* for overparametrised models to have good generalisation error, despite classical interpretations of Occam's razor like the BIC

(ii) the *specific* question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is *possible* is much weaker than actually explaining why it happens).

In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the *general* question is completely resolved by SLT.

Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don't feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that "test error" as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the *Gibbs* generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don't think it's crucial to our discussion I'll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.

Anyway I'm guessing you're probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).

Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about *neural networks, *as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi's upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity).

As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn't supply any deep reason for why neural networks *in particular* are often observed to generalise well when you train them on a range of what we consider "natural" datasets. We don't understand what distinguishes neural networks from generic singular models, nor what we mean by "natural". These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).

Maybe that has significant overlap with the critique of SLT you're making?

Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the "ML in-practice model selection" story to "choose the model with the best test error given fixed training steps" and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe's theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error as

where is the learning coefficient and is the singular fluctuation and is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I'll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of **why neural networks on natural datasets have low ****.**

I don't know much about this question, and agree it is important and outstanding.

Again, I think this reduction is not trivial since the link between and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)

You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).