Neural networks (NNs) do not output all functions with equal probability, but seem to be biased towards functions of certain types; heuristically, towards 'simple' functions. In VPCL18, MSVP+19, MVPSL20 evidence is given that functions output by NNs are inclined to have low information-theoretic complexity - nice summaries are given on lesswrong here and here and elsewhere by the author. However, the converse is not true; some functions with low information-theoretic complexity (such as simple periodic functions) are not so readily output by NNs - this is discussed extensively in the comments to the above posts. Understanding this kind of problem better is widely considered relevant for AI alignment, see e.g. here.

To try to understand better the biases of NNs, we consider more geometric measures of simplicity. In particular, for a neural network with ReLU (or other piecewise-linear activation function), one measure of complexity is the measure (size) of the set of points at which the function is not linear. For functions defined on small domains we might call a function 'simple' if it has a small set of points of non-linearity. For a function on a larger or unbounded domain, we might consider a function 'simple' if the points of non-linearity are clustered together (so that the function is 'mostly linear').

In this short note we explore this heuristic in the simple case of a NN with one hidden layer, outputting functions on a 1-dimensional domain. We choose the parameters of our neural network uniformly at random, and compute the shape of the resulting distribution of points of non-linearity.

Perhaps surprisingly, we find that the distribution of the points of non-linearity does not depend on the size of the domain from which parameters of the neural network are chosen, but does depend heavily on its shape. The remainder of this post summarises these results.

Summary of results

The results in the note are more general, but here we summarise the outcome in the case of functions defined on the interval in the real numbers. A NN with one hidden activation layer of width produces a piecewise-linear (PL) function with at most points of non-linearity.

Weights and biases chosen uniformly at random in an interval

We consider first the case where both the biases and the weights of our NN are chosen uniformly at random in some fixed interval. Then we find that

  • for any , the probability of a function generated by the NN having exactly points of non-linearity is exactly

  • the expected number of points of non-linearity is . In particular, the average function has significantly less than points of non-linearity.

Weights chosen uniformly in a sphere

For our second example, we again choose the biases of the neurons uniformly at random in some interval . But we choose the weights of the neurons uniformly in a -dimensional ball of radius . In this situation it seems more difficult to compute the exact probabilities, but we find that the expected number of points of non-linearity is given by for even, and by for odd. For large , both asymptotically approach , which is much less than . In other words, such a network has a much stronger bias towards functions with few points of non-linearity ('simple' functions) than in the case where all parameters were chosen uniformly at random.

In particular, we see that the network will require quite large width ( in the uniform case and in the spherical case) to easily approximate the simple periodic function obtained by integrating , and such an approximation seems unlikely to generalise well outside its domain of definition.

Limitations and potential generalisations

Of course there are many, we list a few that seem particularly important to us.

  1. We only consider one-dimensional domains. The higher dimensional case will be more involved, but should be amenable to a similar analysis.
  2. We do not have any training data. It would be much better to restrict to functions agreeing with some finite set of training data. We expect that a similar analysis will again apply, as long as the number of points of training data is substantially less than . Making precise the meaning of 'substantially less' seems interesting.
  3. We consider only the case of a single activation layer. Again we expect that similar results will hold in general, but the analysis clearly becomes more complicated.

But before thinking about these problems I hoped to get some feedback on whether these results are new/interesting, what people consider to be the main gaps, etc. So comments are very welcome!

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 9:40 PM

Neat! Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was <= 1), rather than on a sphere (total norm == 1), right? Is initializing weights that way actually a thing people do?

If training large neural networks only moves the parameters a small distance (citation needed), do you still think there's something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?

I'm reminded of a recent post about LayerNorm. LayerNorm seems like it squeezes the function back down closer to the unit interval, increasing the density of nonlinearities.

Thanks Charlie.

Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was <= 1), rather than on a sphere (total norm == 1), right?

Yes, exactly (though for some constant , which may not be , but turn out not to matter).

Is initializing weights that way actually a thing people do?

Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk ('undirected SGD'), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven't thought about this carefully.

If training large neural networks only moves the parameters a small distance (citation needed), do you still think there's something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?

I hope so! I would want to understand what norm the movements are 'small' in (L2, L, ...).

LayerNorm looks interesting, I'll take a look.