Gradient surfing: the hidden role of regularization

[-]LawrenceC3yΩ9137

In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?

Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817

Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).

(As an aside, I also think grokking is not very interesting to study -- if you want a generalization phenomena to study, I'd just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)

[-]LawrenceC3yΩ340

As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking.

(Source: Figure 28 from https://arxiv.org/abs/2301.05217)

Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!

[-]Jesse Hoogland3y20

Thanks Lawrence! I had missed the slingshot mechanism paper, so this is great!

(As an aside, I also think grokking is not very interesting to study -- if you want a generalization phenomena to study, I'd just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)

I totally agree on there being much more interesting tasks than grokking with modulo arithmetic, but it seemed like an easy way to test the premise.

Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!

I will continue the exploration!

[-]gwern2yΩ230

Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).

Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead 'skinny' NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)

[-]LawrenceC2yΩ220

Have you tried instead 'skinny' NNs with a bias towards depth,

I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots.

That being said, you're probably correct that having more layers does seem related to slingshots.

(Particularly for MLPs, which are notorious for overfitting due to their power.)

What do you mean by power here?

[-]gwern2yΩ440

What do you mean by power here?

Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of "can be real smart and learn complicated stuff" is.

[-]LawrenceC3yΩ120

The negative result tells us that the strong form of the claim "regularization = navigability" is probably wrong. Having a smaller weight norm actually is good for generalization (just as the learning theorists would have you believe). You'll have better luck moving along the set of minimum loss weights in the way that minimizes the norm than in any other way.

Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:

Similarly, Figure 7 from https://arxiv.org/abs/2301.05217 also makes this point, but less strongly:

[-]LawrenceC3yΩ240

That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.

[-]Jesse Hoogland3y20

I think Omnigrok looked at enough tasks (MNIST, group composition, IMDb reviews, molecule polarizability) to suggest that the weight norm is an important ingredient and not just a special case / cherry-picking.

That said, I still think there's a good chance it isn't the whole story. I'd love to explore a task that generalizes at large weight norms, but it isn't obvious to me that you can straightforwardly construct such a task.