My name is pronounced "YOO-ar SKULL-se".
I'm a DPhil Scholar at the Future of Humanity Institute in Oxford.
What I'm suggesting is that volume in high-dimensions can concentrate on the boundary.
Yes. I imagine this is why overtraining doesn't make a huge difference.
Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
See e.g., page 47 in the main paper.
(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)
I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".
This feels similar to:Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural networks are fundamentally Bayesian" I was thinking of something very different.)
This feels similar to:
Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural networks are fundamentally Bayesian" I was thinking of something very different.)
Haha. That's obviously not what we're trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing.
I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the double-descent phenomenon as you increase the size of a model (at small sizes VC-dimensionality mechanisms dominate, but at larger sizes the overparameterisation starts to induce a simplicity bias, which eventually starts to dominate). Since you get double descent in the model size for both neural networks and eg random forests, you should expect there to be some mechanism in common between them (even if the details of course differ from case to case).
I have a few comments on this:
Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume. Saying that SGD is “Bayesian” is one way of saying the latter, and the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.
This has been done with real neural networks! See this, for example -- they use Gaussian Processes on stuff like Mobilenetv2, Densenet121, and Resnet50. It seems to work well.
One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I'm not aware of any work that claims SGD uniformly samples from this boundary, but it's worth considering that possibility if the experimental results hold up.
We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.
Rohin’s opinion: [...]
I basically agree with what you say here.
This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.
But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people involved with this work), but I think of the Levin bound argument as essentially a more formal way to state this intuition. I.e., it is a loose argument for why we needn't be surprised that simple functions have larger measures in parameter-space, even if the argument doesn’t identify all the precise details of how this works in any particular case. For example, while the argument doesn’t apply to all ML algorithms, it does apply to all neural network architectures in exactly the same way, but clearly there are important differences in the inductive bias of eg CNNs and RNNs.
(Also: I don’t think the notion of “low effective parameterisation” really captures what’s going on here, but for the reason you point out yourself.)
Like Rohin, I'm not impressed with the information theoretic side of this work.Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.
Like Rohin, I'm not impressed with the information theoretic side of this work.
Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.
Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if P(f)≃2−K(x)+C regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.
Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!
As for the rest of your comment -- what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?
Yes, it does of course apply in that sense.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I'm not sure if accounting for the floating-point implementation is informative or not in this case.
There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as "the n'th input neuron is equal to the n'th output neuron" lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot "see" properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to "see" more kinds of functions. For example, NALUs give neural networks the ability to "see" arithmetic relations more easily. I hence certainly think it's a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It's a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it's essentially just an intuition pump.
I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.