This is a linkpost for work done as part of MATS 9.0 under the mentorship of Richard Ngo.
Loss scaling laws are among the most important empirical findings in deep learning. This post synthesises evidence that, though important in practice, loss-scaling per se is a straightforward consequence of very low-order properties of natural data. The covariance spectrum of natural data generally follows a power-law decay - the marginal value of representing the next feature decays only gradually, rather than falling off a cliff after representing a small handful of the most important features (as tends to be the case for image compression). But we can generate trivial synthetic data which has this property and train random feature models which exhibit loss-scaling.
This is not to say scaling laws have not 'worked' - whatever GPT-2 had, adding OOMs gave GPT-3 more of it. Scaling laws are a necessary but not sufficient part of this story. I want to convince you that the mystery of 'the miracle of deep learning' abides.
This is a linkpost for work done as part of MATS 9.0 under the mentorship of Richard Ngo.
Loss scaling laws are among the most important empirical findings in deep learning. This post synthesises evidence that, though important in practice, loss-scaling per se is a straightforward consequence of very low-order properties of natural data. The covariance spectrum of natural data generally follows a power-law decay - the marginal value of representing the next feature decays only gradually, rather than falling off a cliff after representing a small handful of the most important features (as tends to be the case for image compression). But we can generate trivial synthetic data which has this property and train random feature models which exhibit loss-scaling.
This is not to say scaling laws have not 'worked' - whatever GPT-2 had, adding OOMs gave GPT-3 more of it. Scaling laws are a necessary but not sufficient part of this story. I want to convince you that the mystery of 'the miracle of deep learning' abides.