AutoBound on neural network can achieve OOMs lower training loss

Maybe_a

10 AutoBound on neural network can achieve OOMs lower training loss

17th Apr 2023

ai.googleblog.com

1 min read

10

Seems like gradient descent methods weren't using the relevant math bounds so far. Google released AutoBound as an open-source library.

Here is what I consider a money shot of the article (notice it's a log-plot):

Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance. Their repo implies that with float64 this scales to larger neural networks. LLMs seem to reliably get new capabilities with lower loss, at least.

What do you think?

Here are related technical details:

Optimizers that use upper bounds in this way are called majorization-minimization (MM) optimizers. Applied to one-dimensional logistic regression, AutoBound rederives an MM optimizer first published in 2009. Applied to more complex problems, AutoBound derives novel MM optimizers that would be difficult to derive by hand.

We can use a similar idea to take an existing optimizer such as Adam and convert it to a hyperparameter-free optimizer that is guaranteed to monotonically reduce the loss (in the full-batch setting). The resulting optimizer uses the same update direction as the original optimizer, but modifies the learning rate by minimizing a one-dimensional quadratic upper bound derived by AutoBound. We refer to the resulting meta-optimizer as SafeRate.

Using SafeRate, we can create more robust variants of existing optimizers, at the cost of a single additional forward pass that increases the wall time for each step by a small factor (about 2x slower in the example above).

This seems novel to neural network training, or am I missing something that Bayesian neural net people have been doing already?

Frontpage

10

AutoBound on neural network can achieve OOMs lower training loss

21Quintin Pope

11Razied

2the gears to ascension

2Raemon

2the gears to ascension

2quanticle

2the gears to ascension

2Vladimir_Nesov

1Maybe_a

New Comment

9 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:18 AM

[-]Quintin Pope1y217

These kind of 'twist on known optimizers' papers are pretty common, and they mostly don't amount to too much. E.g., the only difference between Adam and "SafeRate[Adam direction]" is that they used their second-order method to automatically tune the learning rate of the Adam optimizer. Such automatic hyperparameter tuning has been a thing for a long time. E.g., here's a paper from ~30 years ago.

Also note that Adam pretty much keeps up with SafeRate in the above plot until the loss drops to ~, which is extremely low, and very far beyond what any plausible AGI training run will reach. SafeRate's advantage isn't supposed to be 'make loss go down harder', it's supposed to be 'more stable optimization process', which is exactly what you see in the plot above.

That's not to say SafeRate is worthless. The fact that they can do second order hyperparameter tuning with only a second forward pass, and not another pair of forward and backward passes, is somewhat interesting. It may also make large language model training more stable, which I understand to be an issue with tuning such training processes. However, it's extremely unlikely IMO to be some "multiple OOM jump" in training efficiency.

[-]Razied1y119

in the full-batch setting.

uh, yeah, no shit Adam hits a floor on the loss in this context. The entire point of Adam is to compute the running variance of gradients and scale the learning rate to take constant-ish step sizes. What this means in the full-batch setting is that once Adam gets close to a local minimum, it will just oscillate around that minimum, never going further down because it insists on scaling the learning rate by the inverse gradient variance. None of this matters for networks of practical size because they never actually get close to anything like a local minimum.

[-]the gears to ascension1y20

Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance

famous words

[-]Raemon1y25

(Note: I hid this post from logged out users since it seemed capabilities-y)

[-]the gears to ascension1y20

oh nice! I'm not sure that logged in is enough to make me feel comfy, but it's certainly better than nothing. a karma threshold or something might make sense?

[-]quanticle1y20

What is the purpose, beyond mere symbolism, of hiding this post to logged out users when the relevant data is available, in far more detail, on Google's official AI blog?

[-]the gears to ascension1y20

just don't want to be the ones helping things like this go viral. I would post more news here if I had a solid sense of who was benefiting from my news-gathering. I'd like to be able to make posts only visible to some specific group; I still wouldn't be posting anything not already public, and my taste is somewhat iffy, but I haven't done more newsposts of this kind than I have for related reasons.

[-]Vladimir_Nesov1y20

Symbolism is coordination. Not contributing to destroying the world with your own hands, even if you can't stop others from doing it, is a good norm. Iterations of doing concerning things at least a little bit less than others.

[-]Maybe_a1y10

Obviously fine. I posted here to get better than my single point estimate of what's up with this thing.

Moderation Log