Consider first the more basic question: why is simple SGD on over-parameterized ANNs an effective global optimizer? This is the first great mystery of ANNs from classical ML theory: they should get stuck in various local minima and or overfit, but generally they don't (with a few tweaks) and just work better and better with scale. Many other techniques generally don't have this property.

A large oversized ANN can encode not just a single circuit solution, but an entire ensemble of candidates circuits (which dropout makes more explicit), and SGD then explores an entire ensemble of solutions in parallel automatically reusing all shared subcomputations resulting in an exponential speedup vs explicitly evaluating every candidate individually (as in most program search methods).

Pruning and lottery tickets are then obvious - after training the ensemble you can always extract one or more of the sparser best candidate circuits. Its also well known that you can compress and reduce bit precision extensively, so the first bit is by far the most important, and just retaining that bit and retraining the rest (masking) should reduce most of the retraining work.

SGD also has an obvious inductive bias, simply because it updates the weights according to , ie the direction that maximizes loss reduction per unit weight change. That's not the direction that moves directly to the lowest loss region (higher order optimizer find that direction); SGD instead moves in a direction that maximizes loss reduction per bit/unit of weight (and thus complexity) gain.

Reply

[-]johnswentworth3y92

One might expect to be a more expressive equation than its linear approximation, but it appears that the parameters of very large neural nets change only by a small amount during training, which means that the overall $Δ θ$ found during training is nearly a solution to the linearly-approximated equations.

Note that this has changed over time, as network architectures change; I doubt that it applies to e.g. the latest LLMs. The thing about pruning doing a whole bunch of optimization does still apply independent of whether net training is linear-ish (though I don't know if anyone's repro'd the lottery ticket hypothesis-driven pruning experiments on the past couple years' worth of LLMs).

Reply

[-]Zach Furman3y107

A bit of a side note, but I don't even think you need to appeal to new architectures - it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

58

Exploring the Lottery Ticket Hypothesis

58

58

The Lottery Ticket Hypothesis, explained in one minute

Relevance to alignment

Phase changes

Deep double descent

The scaling hypothesis

Pruning is all you need

Discussion

John Wentworth's update to the Lottery Ticket Hypothesis

Discussion

The Elastic Lottery Ticket Hypothesis

Discussion

Summary