[ Question ]

Does the lottery ticket hypothesis suggest the scaling hypothesis?

by Daniel Kokotajlo1 min read28th Jul 20202 comments


Ω 6

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The lottery ticket hypothesis, as I (vaguely) understand it, is that artificial neural networks tend to work in the following way: When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.

By the scaling hypothesis I mean that in the next five years, many other architectures besides the transformer will also be shown to get substantially better as they get bigger. I'm also interested in defining it differently, as whatever Gwern is talking about.

New Answer
Ask Related Question
New Comment

1 Answers

The implication depends on the distribution of lottery tickets. If there is a short-tailed distribution, then the rewards of scaling will be relatively small; bigger would still get better, but very slowly. A long-tailed distribution, on the other hand, would suggest continued returns to getting more lottery tickets.

I ask a question here about what's true in practice.

1 Related Questions

2romeostevensit4moOne related question is what sub-tasks of gpt-3 showed surprise jackpots vs gpt-2
9Answer by dsj4moOne assumption that I think might be implicit in your question is that the number of lottery tickets is linear with model size. But it seems plausible to me that it’s exponential in network depth.
1 comments, sorted by Highlighting new comments since Today at 4:49 AM

I wouldn't say the scaling hypothesis is purely about Transformers. Quite a few of my examples are RNNs, and it's unclear how much of a difference there is between RNNs and Transformers anyway. Transformers just appear to be a sweet spot in terms of power while being still efficiently optimizable on contemporary GPUs. CNNs for classification definitely get better with scale and do things like disentangle & transfer & become more robust as they get bigger (example from today), but whether they start exhibiting any meta-learning specifically I don't know.