I am a Lecturer of Data Science (equivalent of Assistant Professor) at the University of Melbourne. Thinking about how to understand the empirical successes and alignment problem from the perspective of statistics.
To progress understanding in learning theory, I feel it is important to establish some form of "hierarchy" of key factors in deep learning methodology, in order from most critical for good performance, to least critical. I believe this hierarchy might help to identify the strengths of one theory over another.
My proposed order of importance is as follows:
Initialization Scheme The initialization scheme of the neural network appears to be the most impactful property of training neural networks: if the initialization is not appropriately set, training will either not succeed, or will fail to generalize. In fact, it is so critical, that very few attempts are made to alter it. The original ResNet paper is so frequently cited, not necessarily because of the ResNet itself or due to the residual connections, but because of the ubiquity of the initialization scheme proposed therein. PyTorch defaults to the He scheme, and few researchers I have spoken to have dared to alter it.
Optimizer Here I refer to the choice of optimizer (e.g. SGD / momentum / Adam, etc.), as well as its learning rate and batch size (not weight decay). There are several architectures that do not train well without the appropriate optimizer (e.g. transformers require Adam / SignSGD / LION). Clearly, the learning rate and batch size must be appropriately chosen to ensure
Model Architecture By architecture, here I refer to the design of the neural network itself, and not necessarily its model class. The architecture is certainly important, but perhaps secondary to the optimizer that one opts to use.
Loss Function / Objective / Likelihood Most practitioners use MSE or cross-entropy loss for standard tasks, but for more esoteric tasks (e.g. PINNs), choosing the loss function inappropriately can be a major bottleneck. For noisy data, robust loss functions can be important. For standard tasks at scale, it doesn't seem to matter so much: there is evidence to suggest that the Brier score is just as viable as the cross-entropy loss for classification.
Data Augmentation The inclusion of data augmentation strategies has resulted in large improvements in performance, particularly for out-of-distribution tasks, see e.g. DINO. Data augmentation can also influence the appearance of invariances in the model; in some cases, without these invariances, there can be little hope of generalization.
Training Strategy This refers to the choice of learning rate (and learning rate schedule), batch size, and other associated hyperparameters. Clearly, it is possible to break the entire training run by choosing these parameters poorly, but the outcome seems rarely as sensitive to them as newcomers might assume. You aren't going to get massive improvements in accuracy by doing the training run again with slightly different hyperparameters: it either trains, or it doesn't.
Explicit Regularization The most common appearance of an explicit regularizer is the weight decay parameter in a PyTorch Optimizer. More generally, explicit regularizers rarely seem to have much of an effect, as the implicit bias of the model + optimizer often dominates any other regularizer in place.
To progress understanding in learning theory, I feel it is important to establish some form of "hierarchy" of key factors in deep learning methodology, in order from most critical for good performance, to least critical. I believe this hierarchy might help to identify the strengths of one theory over another.
My proposed order of importance is as follows:
The initialization scheme of the neural network appears to be the most impactful property of training neural networks: if the initialization is not appropriately set, training will either not succeed, or will fail to generalize. In fact, it is so critical, that very few attempts are made to alter it. The original ResNet paper is so frequently cited, not necessarily because of the ResNet itself or due to the residual connections, but because of the ubiquity of the initialization scheme proposed therein. PyTorch defaults to the He scheme, and few researchers I have spoken to have dared to alter it.
Here I refer to the choice of optimizer (e.g. SGD / momentum / Adam, etc.), as well as its learning rate and batch size (not weight decay). There are several architectures that do not train well without the appropriate optimizer (e.g. transformers require Adam / SignSGD / LION). Clearly, the learning rate and batch size must be appropriately chosen to ensure
By architecture, here I refer to the design of the neural network itself, and not necessarily its model class. The architecture is certainly important, but perhaps secondary to the optimizer that one opts to use.
Most practitioners use MSE or cross-entropy loss for standard tasks, but for more esoteric tasks (e.g. PINNs), choosing the loss function inappropriately can be a major bottleneck. For noisy data, robust loss functions can be important. For standard tasks at scale, it doesn't seem to matter so much: there is evidence to suggest that the Brier score is just as viable as the cross-entropy loss for classification.
The inclusion of data augmentation strategies has resulted in large improvements in performance, particularly for out-of-distribution tasks, see e.g. DINO. Data augmentation can also influence the appearance of invariances in the model; in some cases, without these invariances, there can be little hope of generalization.
This refers to the choice of learning rate (and learning rate schedule), batch size, and other associated hyperparameters. Clearly, it is possible to break the entire training run by choosing these parameters poorly, but the outcome seems rarely as sensitive to them as newcomers might assume. You aren't going to get massive improvements in accuracy by doing the training run again with slightly different hyperparameters: it either trains, or it doesn't.
The most common appearance of an explicit regularizer is the weight decay parameter in a PyTorch Optimizer. More generally, explicit regularizers rarely seem to have much of an effect, as the implicit bias of the model + optimizer often dominates any other regularizer in place.