My amateur understanding of neural networks is that they almost always train using stochastic gradient descent. The quality of a neural network comes from its size, shape, and training data, but not from the training function, which is always simple gradient descent.

This is a bit unintuitive to me because gradient descent can only find the minimum of a function if that function is convex, and I wouldn't expect typical ML problems (e.g., "find the dog in this picture" or "continue this writing prompt") to have convex cost functions. So why does gradient descent always work?

One explanation I can think of: it *doesn't* work if your goal is to find the optimal answer, but we hardly ever want to know the optimal answer, we just want to know a good-enough answer. For example, if a NN is trained to play Go, it doesn't have to find the best move, it just has to find a winning move. Not sure if this explanation makes sense though.

Wait, how is it possible for there to be no optimum at all? There's only a finite number of possible settings of the 175 billion parameters; there has to be at least one setting such that no other setting has lower loss. (I don't know much math, I'm probably misunderstanding what optimum means.)

Taking finite precision floating point numbers, there must be an optimum.

In the real numbers, there are an infinity of possible settings. Its possible to have an infinite sequence of ever better solutions. (The trivial example is the bigger a parameter is, the better the results, for any setting, its possible to add 1 and get a better setting.)

Its possible for the score to behave like the blue line. Getting better and better without bound. Its also possible for the score to behave like the orange line. With some finite value it approaches but never reaches.