Architecture-aware optimisation: train ImageNet and more without hyperparameters
A deep learning system is composed of lots of interrelated components: architecture, data, loss function and gradients. There is a structure in the way these components interact - however, the most popular optimisers (e.g. Adam and SGD) do not utilise this information. This means there are leftover degrees of freedom...
Apr 22, 20236