Nonlinear limitations of ReLUs

11jacob_cannell

New Answer

New Comment

1 comment, sorted by Click to highlight new comments since: Today at 7:36 AM

Is the inability to approximate periodic functions of a single variable important?

Periodic functions are already used as an important encoding in SOTA ANNs, from transformer LLMs to NERFs in graphics. From the instant-ngp paper:

For neural networks, input encodings have proven useful in the attention components of recurrent architectures [Gehring et al. 2017] and, subsequently, transformers [Vaswani et al. 2017], where they help the neural network to identify the location it is currently processing. Vaswani et al. [2017] encode scalar positions 𝑥 ∈ R as a multiresolution sequence of 𝐿 ∈ N sine and cosine functions enc(𝑥) = sin(2 0 𝑥),sin(2 1 𝑥), . . . ,sin(2 𝐿−1 𝑥), cos(2 0 𝑥), cos(2 1 𝑥), . . . , cos(2 𝐿−1 𝑥) . (1) This has been adopted in computer graphics to encode the spatiodirectionally varying light field and volume density in the NeRF algorithm [Mildenhall et al. 2020].

A neural net using rectified linear unit activation functions of any size is unable to approximate the function sin(x) outside a compact interval.

I am reasonably confident that I can prove that any NN with ReLU activation approximates a piecewise linear function. I believe the number of linear pieces that can be achieved is bounded by at most 2^(L*D) where L is the number of nodes per layer and D is the number of layers.

This leads me to two questions:

Regarding (2a), empirically I found that while approximating sin(x) with small NNs in scikit-learn, increasing the width of the network caused catastrophic failure of learning (starting at approximately L=10 with D=4, at L=30 with D=8, and at L=50 with D=50).

Regarding (1), naively this seems relevant to questions of out-of-distribution performance and especially the problem of what it means for an input to be out-of-distribution in large input spaces.