Probabilities — LessWrong

Several people have noted that with enough piece‑wise linear regions a RELU network can approximate any smooth target function to arbitrary precision, so your model is already behaving like a smooth function on a (dense) domain of interest. The whole point is whats of interest.

There are a number of approximation theorems about polynomials here, but you can realize quickly that the bounded error between a C^2 function and a piecewise mesh (akin to relus) under an L_p norm ought to of the order of the size of the mesh (squared). There are some linear interpolation theorems that are useful in this case.

For a piecewise linear interpolation mesh of size h, then the approximation error should be bounded by h**2.

https://www.cs.ubc.ca/~ascher/ag_2012/ag_slides/chap11.pdf

Take a look at slide 14.

https://arxiv.org/abs/1610.01145

This has some relu approximations as well.

NN's are just approximations, chose your approximation settings, but the universal approximation theorem guarantees existence (not convergence) of an arbitrarily good approximation that could be represented with an arbitrarily large Parameterization. This is simply to say that under some pretty simple settings, with enough layers, and enough time to fiddle, (and the guarantees of scaling papers) you will eventually look quite smooth and quite accurate.

"so the loss landscape and output landscape are kind of just like a bunch of flat facets"

This is only if the output landscape does not in fact focus on this area. But it is NOT true that if the output landscape is flat, the loss landscape is flat, it can be both highly curved and quite uninteresting to an optimizer.

Let your loss function be l(x,y) = (x-y)**2, clearly, even if x is fixed, the loss landscape is both smooth and curved.

Even though each region of a ReLU network is affine in the input (zero Hessian),the loss as a function of parameters is piece‑wise quadratic (for MSE) or piece‑wise smooth (for CE). Crossing a single activation wall changes the
quadratic piece, so the parameter‑space Hessian is generically full‑rank on each region and can be highly curved.

A really simple way to see this:
One thing to actually see is to try to approximate even a sign function with incredibly simple MLP's, as you interpolate points and allow the MLP to grow in parameters, you will see it becoming quite smooth, but it wont start that way.

1) Optimizers with momentum like Adam really only make sense when you have something that's locally like a smooth convex problem.

Adam does not at all require convexity, in fact the original paper only requires that the gradients are lipshitz. RELU nets are lipshitz and differentiable (smooth) on a dense set, so we are perfectly fine here.

2) The core thing in SLT is like the learning coefficient, which is related to the curvature of the network. And it seems like people have managed to tie that to interesting high level behaviors.

Yes...yes they have. But Adam has been around for quite a while, and when you go to NIPS, you'll notice SLT is not the largest component of work there.

As far as NN's being combinatorics, sure, that's a way to view it, they combine objects in different ways and output results. But so does any high dimensional function. A combinatorics estimator and a smooth function are not so different in the limit as you noted.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments