2y1

...There is an extensive discussion about feature learning in relation to the aforementioned Mingard et al result in the comments of this post. The conclusion of the discussion was that feature learning is uncoupled from inductive bias for infinite (and actually finite width with further conditons) neural networks when trained by a random-sampling process (essentially how NNGPs work).

The open question is whether the probability distribution over functions after each layer are the same whether you train with SGD or random sampling. Given how the posteriors of

Good guess ;)

Haha some things are pretty obvious - it's always really nice to get a very different perspective on an idea, thank you for continuing the conversation!

I see -- so you're saying that even though the distribution of

outputfunctions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution ofintermediatefunctions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite randomly-sampled nets

That is exactly what I'm saying. I ...

21y

I just came across this paper which derives an expression for the posterior
distribution of the weights in each layer in the infinite-width limit. The
result: the distribution is unchanged from the prior in every layer but the
last. So it indeed seems that there is no feature learning in this limit.

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well.

Not exactly the same - it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.

They're clearly different in some respects -- (C) can do transfer learning but (A) cannot

I think this is the main source of our disagreement. First of all, while the posterior of an N...

32y

Good guess ;)
I see -- so you're saying that even though the distribution of output functions
learned by an infinitely-wide randomly-sampled net is unchanged if you freeze
everything but the last layer, the distribution of intermediate functions might
change. If true, this would mean that feature learning and inductive bias are
'uncoupled' for infinite-width randomly-sampled nets. I think this is false,
however -- that is, I think it's provable that the distribution of intermediate
functions does not change in the infinite-width limit when you condition on the
training data, even when conditioning over all layers. I can't find a reference
offhand though, I'll report back if I find anything resolving this one way or
another.

[Advance apologies if I haven't explained stuff well enough here. I think the important theme here is that we should maintain a way of thinking about the random sampling picture that is distinct from NNGPs.]

Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.

Ah I see I need to explain myself further - the following is very counterintuitive but I think it's right. Learning features involves the movement of weights in the early layers, by definition. The claim I am making is that the reason why feature learning is good ...

32y

Yes, I think so. Let's go over the 'thin network' example -- we want to learn
some function which can be represented by a thin network. But let's say a
randomly-initialized thin network's intermediate functions won't be able to fit
the function -- that is (with high probability over the random initialization)
we won't be able to fit the function just by changing the parameters of the last
layer. It seems there are a few ways we can alter the network to make fitting
possible:
(A) Expand the network's width until (with high probability) it's possible to
fit the function by only altering the last layer
(B) Keeping the width the same, re-sample the parameters in all layers until we
find a setting that can fit the function
(C) Keeping the width the same, train the network with SGD
By hypothesis, all three methods will let us fit the target function. You seem
to be saying[I think, correct me if I'm wrong] that all three methods should
have the same inductive bias as well. I just don't see any reason this should be
the case -- on the face of it, I would guess that all three have different
inductive biases(though A and B might be similar). They're clearly different in
some respects -- (C) can do transfer learning but (A) cannot(B is unclear).
My intuition here is that SGD-trained nets can learn functions non-linearly
while NTK/GP can only do so linearly. So in the car detector example, SGD is
able to develop a neuron detecting cars through some as-yet unclear 'feature
learning' mechanism. The NTK/GP can do so as well, sort of, since they're
universal function approximators. However, the way they do this is by taking a
giant linear combination of random functions which is able to function
identically to a car detector on the data points given. It seems like this might
be more fragile/generalize worse than the neurons produced by SGD. Though that
is admittedly somewhat conjectural at this stage, since we don't really have a
great understanding of how feature learning in

I 100% agree that Kolmogorov complexity is not the best measure of complexity here - and I would refer anyone to yours and Joar's comments at https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of for an excellent discussion of this. I am aware that Kolmogorov complexity is defined wrt a UTM, and I should have offered clarification in the blog that a lot of steps were used to make the link between Kolmogorov complexity and these types of input-output maps, and state that we only talk about Kolmgorov comple...

[First thank you for your comments and observations - it's always interesting to read pushback]

First, I think my point about using the GP to measure the volume occupied functions locally to where SGD trained networks are initialised is important. We are not really comparing NNs to NNGPs (well, technically we are, but we are interpreting what the NNGP does differently). We are trying to argue that SGD acts as a random sampler - it will find functions with probability proportional to the volume of those functions local to where the optimiser is in parameter-...

32y

And thanks for engaging with my random blog comments! TBC, I think you guys are
definitely on the right track in trying to relate SGD to function simplicity,
and the empirical work you've done fleshing out that picture is great. I just
think it could be even better if it was based around a better SGD scaling limit
;)
Right, this is an even better argument that NNGPs/random-sampled nets don't
learn features.
I think this only applies to NNGP/random-sampled nets, not SGD-trained nets. To
apply to SGD-trained nets, you'd need to show that the new features learned by
SGD have the same distribution as the features found in an infinitely-wide
random net, but I don't think this is the case. By illustration, some
SGD-trained nets can develop expressive neurons like 'car detector', enabling
them to fit the data with a relatively small number of such neurons. If you used
an NNGP to learn the same thing, you wouldn't get a single 'car detector'
neuron, but rather some huge linear combination of high-frequency features that
can approximate the cars seen in the dataset. I think this would probably
generalize worse than the network with an actual 'car detector'(this isn't
empirical evidence of course, but I think what we know about SGD-trained nets
and the NNGP strongly suggests a picture like this)
Interesting, haven't seen this before. Just skimming the paper, it sounds like
the very small learning rate + added white noise might result in different
limiting behavior from usual SGD. Generally it seems that there are a lot of
different possible limits one can take; empirically SGD-trained nets do seem to
have 'feature learning' so I'm skeptical of limits that don't have that(I assume
they don't have them for theoretical reasons, anyway. Would be interesting to
actually examine the features found in networks trained like this, and to see if
they can do transfer learning at all) re:'colored noise', not sure to what
extent this matters. I think a more likely source of discrepancy

Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias - in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior - which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn't say anything about why a function like 010101010101... appears more frequently than other functions in the maximum-entropy class.

I agree that "large volume-->simple" is what is shown by the evidence in the papers, as opposed to "simple--> large volume" which is in fact not a claim we do not make anywhere (if we do accidentally please let me know and I will fix it) - see https://arxiv.org/abs/1910.00971 for more detail on this, or Joar Skalse's comments on https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of, where he discusses functions which don't obey this rule - such as the identity function, which has small volume a...

32y

Yeah, I didn't mean to imply that you guys said 'simple --> large volume'
anywhere. I just think it's a point worth emphasizing, especially around here
where I think people will imagine "Solomonoff Induction-like" when they hear
about a "bias towards simple functions"
But in the infinite-width setting, Bayesian inference in general is given by a
GP limit, right? Initialization doesn't matter. This means that the arguments
for lack of feature learning still go through. It's technically possible that
there could be feature learning in finite-width randomly-sampled networks, but
it seems strange that finiteness would help here(and any such learning would be
experimentally inaccessible). This is a major reason that I'm skeptical of the
"SGD as a random sampler" picture.

I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of - see in particular replies by Joar Skalse (the author of that post). You say that you don't think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit - I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures...

12y

I think we basically agree on the state of the empirical evidence -- the
question is just whether NTK/GP/random-sampling methods will continue to match
the performance of SGD-trained nets on more complex problems, or if they'll
break down, ultimately being a first-order approximation to some more complex
dynamics. I think the latter is more likely, mostly based on the lack of feature
learning in NTK/GP/random limits.
re: the architecture being the source of inductive bias -- I certainly think
this is true in the sense that architecture choice will have a bigger effect on
generalization than hyperparameters, or the choice of which local optimizer to
use. But I do think that using a local optimizer at all, as opposed to randomly
sampling parameters, is likely to have a large effect.

AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of "train ImageNet without hyperparameters". That said, there is a stronger sense of "hyperparameter-free" where the optimum batch size and architecture size would decide ... (read more)