Why Bigger Models Generalize Better

PapersToAGI

There is still a lingering belief from classical machine learning that bigger models overfit and thus don't generalize well. This is described by the bias-variance trade-off, but this no longer holds in the new age of machine learning. This is empirically shown by phenomena like double descent, where higher-complexity models perform better than lower-complexity ones. The reason why this happens remains counterintuitive for most people, so I aim to address it here:

Capacity Theory: The theory states that when models are much larger than their training data, they have extra capacity not just for memorizing but also for exploring different structures. They can find more generalizable structures that are simpler than those required for memorization. Due to regularization, the model favors these simpler, more generalizable structures over memorization. Essentially, they have the necessary room to experiment with 'compressing' the data.
High-Dimensional Loss Landscape: This concept is a bit trickier to imagine, but let's consider a simple case where we have only one weight and plot a 2D graph with the y-axis representing the loss and the x-axis representing the weight value. The goal is to reach the lowest point in the graph (the global minimum). However, there are valleys in the graph where gradient descent can get stuck—these are local minima that are not the true global minimum. Now, imagine we add another weight, increasing the dimension of the graph by one. The graph is now three-dimensional. You can think of the loss surface as a two-dimensional valley, and the local minimum you were previously stuck in now has another dimension attached to it. This dimension is sloping downward (it's a saddle point), meaning you can escape the local minimum via this newly added dimension.

In general, the more dimensions you add, the higher the likelihood that a local minimum is not a true local minimum. There will likely be some dimensions that slope downward, allowing gradient descent to escape to lower minima.

Now, points 1 and 2 are not disconnected—they are two sides of the same coin. While the model is trying out different structures that don't affect its loss (point 1), gradient descent is roaming around the local minima without changing the loss (point 2). At some point, it may find a path out by discovering a dimension that slopes downward—a 'dimensional alleyway' out of the local minimum, so to speak. This traversal out of the local minimum to a lower point corresponds to the model finding a simpler solution, i.e., the generalized structure.

(Even though the generalized structure might not reduce the loss directly, the regularization penalty on top of the loss surface ensures that the generalized structure will have a lower total loss than memorization.)