A Simple Method for Accelerating Grokking
> TL;DR: Letting a model overfit first, then applying Frobenius norm regularization, achieves grokking in roughly half the steps of Grokfast on modular arithmetic. I learned about grokking fairly recently, and thought it was quite interesting. It sort of shook up how I thought about training. Overfitting to your training data was a cardinal sin for decades, but we're finding it may not be so bad? I had a pretty poor understanding of what was going on here, so I decided to dig deeper. The intuition from the literature seemed to be that grokking occurs because the model overfits, then as you force the model to compress over time (via weight decay), it begins to find the minimal solution on your training set... And this minimal solution seems to be a good proxy for generalization. I had a pretty simple idea as I learned about this... What if we just let it overfit then, and then forced the model to compress via its loss function? First Success All of the benchmarks for grokking seem to be around modular arithmetic operations, so naturally, I went with that. At first I tried SVD and forcing the loss function to consider the nuclear norm. To my surprise, the model converged in less steps! Whoa! But... each step was 258x slower... Calculating the nuclear norm was O(n3), so I didn't really think it was worth it, but I was still excited about the prospect of grokking faster. I did some research into faster ways of calculating the size of the model as part of its loss function and ended up at... L2 Regularization... A technique that has been around since the 1940s... I was a bit embarrassed, but nonetheless, continued on. My embarrassment was pretty quickly offset by the fact that L2 Regularization after overfitting worked pretty well with not much trouble! I also found it interesting that if I scale the compression up, we can get models that have effective ranks as low as 20 if we bump up the lambda or use log-det penalties! I think this is still worth explori