I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like double descent (AN #77).

Would you mind explaining why this is? It seems to me like random sampling would display double descent. For example, as you increase model size, at first you get more and more parameters that let you approximate the data better... but then you get too many parameters and just start memorizing the data... but then when you get even more parameters, you have enough functions available that simpler ones win out... Doesn't this story work just as well for random sampling as it does for SGD?

Reply

[-]Rohin Shah4yΩ440

Hmm, I think you're right. I'm not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I'd agree with him.)

Possibly I was thinking of epochal double descent, but that shouldn't matter because we're comparing the final outcome of SGD to random sampling, so epochal double descent doesn't come into the picture.

Reply

[-]Daniel Kokotajlo4yΩ220

OK, thanks!

Reply

[-]Charlie Steiner5yΩ120

I normally don't think of most functions as polynomials at all - in fact, I think of most real-world functions as going to zero for large values. E.g. the function "dogness" vs. "nose size" cannot be any polynomial, because polynomials (or their inverses) blow up unrealistically for large (or small) nose sizes.

I guess the hope is that you always learn even polynomials, oriented in such a way that the extremes appear unappealing?

Reply

[-]johnswentworth5yΩ440

I believe the paper says that log densities are (approximately) polynomial - e.g. a Gaussian would satisfy this, since the log density of a Gaussian is quadratic.

Reply

[-]Rohin Shah5yΩ220

What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.

Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.

(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

26

[AN #139]: How the simplicity of reality explains the success of neural nets

26

Ω 17

26

Ω 17

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

MESA OPTIMIZATION

OTHER PROGRESS IN AI

DEEP LEARNING

META LEARNING

FEEDBACK

PODCAST