Updating the Lottery Ticket Hypothesis

by johnswentworth2 min read18th Apr 202140 comments

74

Ω 40

Lottery Ticket HypothesisMachine LearningAI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: not confident enough to bet against someone who’s likely to understand this stuff.

The lottery ticket hypothesis of neural network learning (as aptly described by Daniel Kokotajlo) roughly says:

When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.

This is a very simple, intuitive, and useful picture to have in mind, and the original paper presents interesting evidence for at least some form of the hypothesis. Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big. (See this comment for some clarification of this claim.)

Meanwhile, a cluster of research has shown that large neural networks approximate certain Bayesian models, involving phrases like “neural tangent kernel (NTK)” or “Gaussian process (GP)”. Mingard et al. show that these models explain the large majority of the good performance we see from large neural networks in practice. This view also implies a version of the lottery ticket hypothesis, but it has different implications for what the “lottery tickets” are. They’re not subcircuits of the initial net, but rather subcircuits of the parameter tangent space of the initial net.

This post will sketch out what that means.

Let’s start with the jargon: what’s the “parameter tangent space” of a neural net? Think of the network as a function  with two kinds of inputs: parameters , and data inputs . During training, we try to adjust the parameters so that the function sends each data input  to the corresponding data output  - i.e. find  for which , for all . Each data point gives an equation which  must satisfy, in order for that data input to be exactly mapped to its target output. If our initial parameters  happen to be close enough to a solution to those equations, then we can (approximately) solve this using a linear approximation: we look for  such that 

The right-hand-side of that equation is essentially the parameter tangent space. More precisely, (what I'm calling) the parameter tangent space at  is the the set of functions  of the form

… for some .

In other words: the parameter tangent space is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Major upshot of all this: the space-of-models “searched over” during training is approximately just the parameter tangent space.

At initialization, we randomly choose , and that determines the parameter tangent space - that’s our set of “lottery tickets”. The SGD training process then solves the equations - it picks out the lottery tickets which perfectly match the data. In practice, there will be many such lottery tickets - many solutions to the equations - because modern nets are extremely overparameterized. SGD effectively picks one of them at random (that’s one of the main results of the Mingard et al work).

Summary:

  • The “parameter tangent space” of a network is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.
  • The parameter tangent space at the network’s randomly-chosen initial parameters is roughly the set of “lottery tickets”.
  • SGD (effectively) throws out any lottery tickets which don’t perfectly match the data, then randomly picks one of the remaining tickets.

Of course this brushes some things under the rug - e.g. different "lottery tickets" don't have exactly the same weight, and different architectures may have different type signatures. But if you find the original lottery ticket hypothesis to be a useful mental model, than I expect this to generally be an upgrade to that mental model. It maintains most of the conceptual functionality, but is probably more realistic.

Thankyou to Evan, Ajeya, Rohin, Edouard, and TurnTrout for a discussion which led to this post.

74

Ω 40

40 comments, sorted by Highlighting new comments since Today at 6:29 PM
New Comment

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

Yup, people have done this(taking the infinite-width limit at the same time): see here, here. Generally the kernels do worse than the original networks, but not by a lot. On the other hand, they're usually applied to problems that aren't super-hard, where non-neural-net classifiers already worked pretty well. And these models definitely can't explain feature learning, since the functions computed by individual neurons don't change at all during training.

This basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.

Thanks! I'm afraid I don't understand the math yet but I'll keep trying. In the meantime:

I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big.

Can you say more about why? It's not obvious to me that they are not big enough. Would you agree they probably contain edge detectors, circle detectors, etc. at initialization? Also, it seems that some subnetworks/tickets are already decent at the task at initialization, see e.g. this paper. Is that not "dog-recognizing subcircuits at initialization?" Or something similar?

The problem is what we mean by e.g. "dog recognizing subcircuit". The simplest version would be something like "at initialization, there's already one neuron which lights up in response to dogs" or something like that. (And that's basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That's the version which I'd call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits. I would argue that dog-detectors are a lot more special than random circuits even a priori, but not so much more special that the space-of-functions-that-special is less than exponentially large. (For very small circuits like edge detectors, it's more plausible that some neurons implement that function right from the start.)

The thing in the paper you linked is doing something different from that. At initialization, the neurons in the subcircuits they're finding would not light up in recognition of a dog, because they're still connected to a bunch of other stuff that's not in the subcircuit - the subcircuit only detects dogs once the other stuff is disconnected. And, IIUC, SGD should not reliably "find" those tickets: because no neurons in the subcircuit are significantly correlated with dogs, SGD doesn't have any reason to upweight them for dog-recognition. So what's going on in that paper is different from what's going on in normal SGD-trained nets (or at least not the full story). 

I was in a discussion yesterday that made it seem pretty plausible that you're wrong -- this paper suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large.

function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits.

I haven't actually read the paper I'm referencing, but my understanding is that this argument doesn't work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N).

That being said, I don't know whether "present at the beginning" is the same as "easily found by gradient descent".

See my comment here. The paper you link (like others I've seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.

Ah, I should have read comments more carefully.

I agree with your comment that the claim "I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization" is ambiguous -- "contains" can mean "under pruning".

Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is "a single neuron lighting up" where you draw the line?

It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there's a linear combination of neurons (before the output layer) which does, then it seems we're good to go: GD could find that pretty easily.

I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement.

But without the tangent-space hypothesis, it's unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is "what would be necessary for the lottery ticket intuition" isn't clear to me. (Is there a more obvious argument for this, according to you?)

Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I'm not particularly worried about where that line "should" be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.

Wait... so:

  1. The tangent-space hypothesis implies something close to "gd finds a solution if and only if there's already a dog detecting neuron" (for large networks, that is) -- specifically it seems to imply something pretty close to "there's already a feature", where "feature" means a linear combination of existing neurons within a single layer
  2. gd in fact trains NNs to recognize dogs
  3. Therefore, we're still in the territory of "there's already a dog detector"

...yeah?

The tangent-space hypothesis implies something close to this but not quite -- instead of 'dog-detecting neuron', it's 'parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector'. This would include (the partial derivative w.r.t.) neurons via their bias.

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

One important thing to note here is that the LTH paper doesn't demonstrate that SGD "finds" a ticket: just that the subnetwork you get by training and pruning could be trained alone in isolation to higher accuracy. That doesn't mean that the weights in the original training are the same when the network is trained in isolation!

So in particular I basically disagree with the opening summary of the content of the "lottery ticket hypothesis". I think a better summary is found in the abstract of the original paper:

dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

Yup, I agree that that quote says something which is probably true, given current evidence.

I don't know what the referent of 'that quote' is. If you mean the passage I quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly recommend reading a follow-up paper which describes how and why it's wrong for large networks. The abstract of the paper I'm citing here:

We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).


I don't think "picking a winning lottery ticket" is a good way analogy for what that implies

Again, assuming "that" refers to the claim in the original LTH paper, I also don't think it's a good analogy. But by default I think that claim is what "the lottery ticket hypothesis" refers to, given that it's a widely cited paper that has spawned a large number of follow-up works.

Oh that's definitely interesting. Thanks for the link.

The way I think of the Lottery Ticket Hypothesis is in terms of the procedure for actually finding a lottery ticket: start with a random initialised network, then train it, then look at the weights of the network that are the largest after training (say keep only 10% or 1% of the weights), go back to the initial random network and drop all the weights that won't end up being the largest, now train that highly sparse network and you'll end up with close to the same end performance, even though you're using a much smaller network. 

This seems to mean that all those weights we dropped don't really have any effect on the training, we might have imagined that weird nonlinear effects might mean that getting to a good final solution would require the presence of weights that in the end will be useless, those weights might have been "catalysts", helping the network arrive at its final solution, but not themselves useful for prediction. But no, it seems that if weights are going to end up useless, then they're useless from the beginning.

Right, that's basically the picture/experiment from the original paper.

Oh, I wasn't claiming originality, just trying to give some background to people who might have stumbled here.

Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization.

I think there are papers showing exactly this, like Deconstructing Lottery Tickets and What is the Best Multi-Stage Architecture for Object Recognition?. Another paper, describing the second paper:

We also compare to random, untrained weights because Jarrett et al. (2009) showed — quite strikingly — that the combination of random convolutional filters, rectification, pooling, and local normalization can work almost as well as learned features. They reported this result on relatively small networks of two or three learned layers and on the smaller Caltech-101 dataset (Fei-Fei et al., 2004). It is natural to ask whether or not the nearly optimal performance of random filters they report carries over to a deeper network trained on a larger dataset.

(My interpretation of their results is 'yeah actually randomly initialized convs do pretty well on imagenet'; I remember coming across a paper that answer that question more exactly and getting a clearer 'yes' answer but I can't find it at the moment; I remember them freezing a conv architecture and then only training the fully connected net at the end.)

Why do you doubt this? Are you seeing a bunch of evidence that I'm not? Or are you imagining new architectures that people haven't done these tests for yet / have done these tests and the new architectures fail?

[Maybe your standards are higher than mine--in the DLT paper, they're able to get 65% performance on CIFAR-10 by just optimizing a binary mask on the randomly initialized parameters, which is ok but not good.]

In hindsight, I probably should have explained this more carefully. "Today’s neural networks already contain dog-recognizing subcircuits at initialization" was not an accurate summary of exactly what I think is implausible.

Here's a more careful version of the claim:

  • I do not find it plausible that a random network contains a neuron which acts as a reliable dog-detector. This is the sense in which it's not plausible that networks contain dog-recognizing subcircuits at initialization. But this is what would be necessary for the "lottery ticket" intuition (i.e. training just picks out some pre-existing useful functionality) to work.
  • The original lottery ticket paper (and subsequent work) found subcircuits which detect dogs (or something comparable) after disconnecting them from the rest of the network (i.e. after pruning). The "lottery ticket" story is not actually a good intuition for what's going on there; the pruning step itself does a whole bunch of optimization work, and "it's just upweighting a pre-existing function" is not an accurate intuition for that optimization.

(The paper on randomized filters is orthogonal to this - it sounds like they're finding that simple features like edge detection do show up in randomly initialized nets, which is totally plausible. But they still need optimization for the deeper/higher-level parts of the net; they're just using random initialization for the first few layers, assuming I'm understanding it correctly.)

If there were a robust dog-detection neuron at initialization, and SGD just learned to use that neuron's output, then the "lottery ticket" intuition would be a very good description of what's going on. But pruning is a very different operation, one which fundamentally changes what a circuit is computing. The first paper you link uses the phrase "masking is training", which seems like a good way to think about it. A masking operation isn't just "picking a winning lottery ticket", it's changing the functional behavior of the nodes.

But this is what would be necessary for the "lottery ticket" intuition (i.e. training just picks out some pre-existing useful functionality) to work.

I don't think I agree, because of the many-to-many relationship between neurons and subcircuits.  Or, like, I think the standard of 'reliability' for this is very low. I don't have a great explanation / picture for this intuition, and so probably I should refine the picture to make sure it's real before leaning on it too much?

To be clear, I think I agree with your refinement as a more detailed picture of what's going on; I guess I just think you're overselling how wrong the naive version is?

Plausible.

Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically "what's your relative contribution to good subcircuits vs. bad subcircuits?"

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Trying to check if I'm understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these "even out" such that they end up equivalent to a single update from the initial parameters?

Sort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations.

(Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)

By Newton step, do you mean one step of Newton's method?

I don't remember the exact quote, but some sculptor described their art as "the statue is already there inside the stone, I just remove the extra pieces".

And, I guess we can agree that this statement is in some technical sense true, but of course completely misses the point (intentionally, to sound deep). More precisely, it could be said that the essence of the statue -- the information that makes it different from a random piece of stone or whatever material -- was all added by the act of "removing the extra pieces", and none of it was there at the beginning (except for the trivial constraints, such as that the original stone must be larger than the intended statue).

My question is, how much "the network is already there, ML just removes the extra pieces" is a statement of this type?

I've heard it attributed to Michelangelo, but that may be apocryphal.

Assuming that the "start with lots of models then remove the bad ones" picture is accurate, the fact that NNs work in practice implies that the initial set of models matters a lot. The vast majority of models consistent with the data would not generalize well to future data points, unless the models are biased towards models-which-generalize-well (simple models, for instance) from the start.

Maybe I am misunderstanding something, but I don’t think the parameter tangent hypothesis can be generally correct.

Let’s say we have 1 datapoint A to be mapped to -1 and 100 datapoint B to be mapped to +1. The model is randomly initialised. Now the parameter tangent space is just the current model + the gradient over the dataset * delta. The gradient over the entire dataset is the sum of the gradients for each datapoint. Therefore the gradient points a hundred times more towards the solution for the input B than towards the solution for input A. If we search solely in the tangent space, we will either solve B and get a miniscule improvement for A or solve A and massively overshoot the best parameters for B. 

I.e. to reach the final parameters with a single update, the computed gradient has to be balanced between all possible skills the model is supposed to be exhibiting after training. Otherwise the gradient will not point towards a global-ish minimum but a frequency-of-problem-weighted minimum.

SGD solves this problem because the gradient for B shrinks, the more updates into the direction of B-solution have been made. So in our examples the gradient for B would shrink to zero during training and A would get its time in the sun. 

Inutitively, the parameter tangent space would be correct for MNIST and other well balanced small datasets. But for large language models to pick a random example, it is not clear what „well balanced“ in the above sense even means. 

I think you're confusing one-update-of-gradient-descent with one-update-of-Newton's-method. I.e.:

to reach the final parameters with a single update, the computed gradient has to be balanced between...

In general, a single gradient-descent update is not sufficient to solve a system of linear equations. (Your reasoning for why that is is basically correct, though the usual explanation is more general.) Indeed, SGD does not converge in one step in practice.

One step of Newton's method is sufficient to solve a system of linear equations, but that's equivalent to many steps of gradient descent.

No, when I say single update, I just mean that the final model can in principle be reached by a single update with the initial gradient. I'm aware that in practice you need more steps to compute the correct delta. 

My argument is solely about the initial gradient. It does not point to the minimum SGD would reach, because the initial gradient tries harder to solve common problems, but the SGD-minimum (ideally) solves even rare problems. SGD manages to do this because common problems do not influence later gradients, because they will already be solved.  

No, when I say single update,I just mean that the final model can in principle be reached by a single update with the initial gradient.

The final model cannot be reached by a single update with the initial gradient, because in a system of linear equations (i.e. objective is squared error in the system, or anything along those lines), the gradient does not point straight toward the solution. It's not just a question of computing the correct delta; the initial gradient doesn't even point in the right direction.

Ok, I thought your  was one update step of the gradient of  times  away from . I guess then I just don't understand the equation.

Ah, I guess I understand now. I was always thinking about an updating of the parameters. But you are talking about adding to the function output.

Ok, I should walk through this, there's multiple people confused about it.

The training process effectively solves the set of (linear) equations

There's one equation for each data point, and one unknown for each dimension of .

Key point: solving sets of equations is not what gradient descent does. Gradient descent minimizes a scalar function. In order for gradient descent to solve the problem, we need to transform "solve the set of equations" into "minimize a scalar function". Typically, we do that by choosing an objective like "minimize the sum of squared errors" or something along those lines - e.g. in this case we'd probably use:

This is a quadratic function of  (i.e. it's second-order in ). If we compute the gradient of this function, then we'll find that it does not point toward the minimum of the function - the gradient only uses first-order (i.e. linear) information. The gradient does not point toward the solution of the original system of equations. Thus: gradient descent will not solve the set of equations in one step.

In order to solve the set of equations in one step, we could use a second-order minimization method like Newton's (gradient descent is first-order). Key concept: minimizing an objective function means solving the set of equations ; minimization is equivalent to system-solving with one more derivative. So, a first-order method for solving a system is equivalent to a second-order method for minimizing an objective (both are Newton's method).

Does this make more sense?

Yes, definitely, thank you! 

Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.