interstice

Comments

Why did no LessWrong discourse on gain of function research develop in 2013/2014?

I actually agree with you there, there was always discussion of GCR along with extinction risks(though I think Eliezer in particular was more focused on extinction risks). However, they're still distinct categories: even the deadliest of pandemics is unlikely to cause extinction.

Why did no LessWrong discourse on gain of function research develop in 2013/2014?

Killing 90% of the human population would not be enough to cause extinction. That would put us at a population of 800 million, higher than the population in 1700.

Reply to Nate Soares on Dolphins

It could be considered an essence, but physical rather than metaphysical.

The Homunculus Problem

This feels related to metaphilosophy. In the sense that, (to me) it seems that one of the core difficulties of metaphilosophy is that in coming up with a 'model' agent you need to create an agent that is not only capable of thinking about its own structure, but capable of being confused about what that structure is(and presumably, of becoming un-confused). Bayesian etc. approaches can model agents being confused about object-level things, but it's hard to even imagine what a model of an agent confused about ontology would look like.

SGD's Bias

Another example of this sort of thing: least-rattling feedback in driven systems.

Parsing Chris Mingard on Neural Networks

Perhaps this is a physicist vs mathematician type of thinking though

Good guess ;)

This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.

I see -- so you're saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite-width randomly-sampled nets. I think this is false, however -- that is, I think it's provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can't find a reference offhand though, I'll report back if I find anything resolving this one way or another.

Parsing Chris Mingard on Neural Networks

The claim I am making is that the reason why feature learning is good is not because it improves inductive bias—it is because it allows the network to be compressed. That is probably at the core of our disagreement.

Yes, I think so. Let's go over the 'thin network' example -- we want to learn some function which can be represented by a thin network. But let's say a randomly-initialized thin network's intermediate functions won't be able to fit the function -- that is (with high probability over the random initialization) we won't be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible:

(A) Expand the network's width until (with high probability) it's possible to fit the function by only altering the last layer

(B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function

(C) Keeping the width the same, train the network with SGD

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well. I just don't see any reason this should be the case -- on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They're clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear).

What do we know about SGD-trained nets that suggests this?

My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear 'feature learning' mechanism. The NTK/GP can do so as well, sort of, since they're universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don't really have a great understanding of how feature learning in SGD works.

I’ve read the new feature learning paper! We’re big fans of his work, although again I don’t think it contradicts anything I’ve just said.

ETA: Let me elaborate upon what I see as the significance of the 'feature learning in infinite nets' paper. We know that NNGP/NTK models can't learn features, but SGD can: I think this provides strong evidence that they are learning using different mechanisms, and likely have substantially different inductive biases. The question is whether randomly sampled finite nets can learn features as well. Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do can only come from finiteness. In contrast, in the case of SGD, it's possible to do feature learning even in the infinite-width limit. This suggests that even if randomly-sampled finite nets can do feature learning, the mechanism by which they do so is different from SGD, and hence their inductive bias is likely to be different as well.

Parsing Chris Mingard on Neural Networks

First thank you for your comments and observations—it’s always interesting to read pushback

And thanks for engaging with my random blog comments! TBC, I think you guys are definitely on the right track in trying to relate SGD to function simplicity, and the empirical work you've done fleshing out that picture is great. I just think it could be even better if it was based around a better SGD scaling limit ;)

Therefore, if an optimiser samples functions proportional to their volume, you won’t get any difference in performance if you learn features (optimise the whole network) or do not learn features (randomly initialise and freeze all but the last layer and then train just the last).

Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.

Given therefore that the posteriors are the same, it implies that feature learning is not aiding inductive bias—rather, feature learning is important for expressivity reasons

I think this only applies to NNGP/random-sampled nets, not SGD-trained nets. To apply to SGD-trained nets, you'd need to show that the new features learned by SGD have the same distribution as the features found in an infinitely-wide random net, but I don't think this is the case. By illustration, some SGD-trained nets can develop expressive neurons like 'car detector', enabling them to fit the data with a relatively small number of such neurons. If you used an NNGP to learn the same thing, you wouldn't get a single 'car detector' neuron, but rather some huge linear combination of high-frequency features that can approximate the cars seen in the dataset. I think this would probably generalize worse than the network with an actual 'car detector'(this isn't empirical evidence of course, but I think what we know about SGD-trained nets and the NNGP strongly suggests a picture like this)

Furthermore (and on a slightly different note), it is known that infintesimal GD converges to the Boltzmann distribution for any DNN (very similar to random sampling)

Interesting, haven't seen this before. Just skimming the paper, it sounds like the very small learning rate + added white noise might result in different limiting behavior from usual SGD. Generally it seems that there are a lot of different possible limits one can take; empirically SGD-trained nets do seem to have 'feature learning' so I'm skeptical of limits that don't have that(I assume they don't have them for theoretical reasons, anyway. Would be interesting to actually examine the features found in networks trained like this, and to see if they can do transfer learning at all) re:'colored noise', not sure to what extent this matters. I think a more likely source of discrepancy is the lack of white noise in normal training(I guess this counts as 'colored noise' in a sense) and the larger learning rate.

if anyone can point out why this line of argument is not correct, or can steelman a case for SGD inductive bias appearing at larger scales, I would be very interested to hear it.

Not to be a broken record, but I strongly recommend checking out Greg Yang's work. He clearly shows that there exist infinite-width limits of SGD that can do feature/transfer learning.

Parsing Chris Mingard on Neural Networks

I think we basically agree on the state of the empirical evidence -- the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they'll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.

re: the architecture being the source of inductive bias -- I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.

Parsing Chris Mingard on Neural Networks

Yeah, I didn't mean to imply that you guys said 'simple --> large volume' anywhere. I just think it's a point worth emphasizing, especially around here where I think people will imagine "Solomonoff Induction-like" when they hear about a "bias towards simple functions"

Also, very briefly on your comment on feature learning—the GP limit is used to calculate the volume of functions locally to the initialization. The fact that kernel methods do not learn features should not be relevant given this interpretation

But in the infinite-width setting, Bayesian inference in general is given by a GP limit, right? Initialization doesn't matter. This means that the arguments for lack of feature learning still go through. It's technically possible that there could be feature learning in finite-width randomly-sampled networks, but it seems strange that finiteness would help here(and any such learning would be experimentally inaccessible). This is a major reason that I'm skeptical of the "SGD as a random sampler" picture.

Load More