Daniel Kokotajlo asks whether the lottery ticket hypothesis implies the scaling hypothesis.

The way I see it, this depends on the distribution of "lottery tickets" being drawn from.

- If the quality of lottery tickets follows a normal distribution, then after your neural network is large enough to sample decent tickets, it will get better rather slowly as you scale it -- you have to sample a whole lot of tickets to get a really good one.
- If the quality of tickets has a long upward tail, then you'll see better scaling.

However, a long tail also suggests to me that variance in results would continue to be relatively high as a network is scaled: bigger networks are hitting bigger jackpots, but since even bigger jackpots are within reach, the payoff of scaling remains chaotic.

(This could all benefit from a more mathematical treatment.)

So: what do we know about NN training? Does it suggest we are living in extremistan or mediocristan?

Note: a major conceptual difficulty to answering this question is representing NN quality in the right units. For example, an accuracy metric -- which necessarily falls between 0% and 100% -- *must* yield "diminishing returns", and *cannot* be host to a "long-tailed distribution". Take that same metric and send it through an inverse sigmoid, and now you might *not* have diminishing returns, and *could* have a long-tail distribution. But we can transform data all day. The analysis shouldn't be too ad-hoc. So it's not immediately clear how to measure this.

One related question is what sub-tasks of gpt-3 showed surprise jackpots vs gpt-2