Machine learning algorithms such as neural networks are supposed to have some sort of universal uniform approximation theorem that shows that they can (at least in principle) learn any possible data set without simply overfitting to the training data.
The standard universal approximation theorem applies to shallow neural networks with arbitrary continuous non-polynomial activation functions. There are also plenty of polynomial approximation results in mathematics, so one should at least in principle be able to train a real multivariate polynomial to model an arbitrary continuous function arbitrarily well using a multi-layered machine learning model. On the other hand, anyone modestly familiar with complex analysis knows that not every continuous function from a compact subset of to can be approximated by polynomials.
In this post, we shall eventually produce a work-around that allows us to approximate arbitrary continuous functions using complex polynomials.
Why use polynomials instead of deep neural networks?
I personally have many issues with neural networks. To me, neural networks are clumsy to study mathematically. For example, a deep neural network with tanh activation is far more complicated than the function , but the function is awkward to work with. For example, can be written in terms of the exponential function, but this means that is written in terms of an iterated exponential. I personally would rather not use iterated exponentials. To make things worse, which has a complicated singularity set. The unbiased neural networks with tanh activation essentially become neural networks with tan activations when fed purely imaginary inputs, and neural networks with tangent activations are untrainable and pathological. Neural networks with ReLU activations do not have this pathology that we see with tanh activation, but they have their own pathologies.
It seems like the pathological attributes and behavior of neural networks are part of the reason they are so difficult to interpret and study mathematically, so an inherently interpretable alternative to neural networks may help us solve the problems related to AI interpretability and safety. Of course, inherently interpretable AI should ideally match or at least complement the performance of deep neural networks since we need inherently interpretable AI to be relevant to AI safety.
My computer experiments indicate that machine learning models that compute real or complex polynomials do not always contain the pathologies that we see with neural networks. These alternative machine learning models computing polynomial functions may still have multiple layers so they are sophisticated, but they are not as sophisticated as deep neural networks. Keep in mind that people have devoted a lot of resources to training deep neural networks, so we need to take this into consideration when evaluating the performance of alternative machine learning models. In this post, we shall state polynomial uniform approximation theorems in order to help justify the idea that polynomials may have capabilities comparable to neural networks.
Real polynomial approximation:
Real polynomials have no problem uniformly approximating arbitrary continuous functions on compact sets.
If are topological spaces, then let denote the set of all continuous functions from to .
If is a compact Hausdorff space, then can be endowed with a norm defined by , and this norm generates a topology on known as the topology of uniform convergence.
Let be a topological field. We say that is a subalgebra of if implies that as well.
Theorem: (Stone-Weierstrass approximation theorem) Suppose that is a compact Hausdorff space and is a subalgebra of that contains all constant functions and where if , then there is some with . Then is dense in .
As a consequence, polynomials can approximate arbitrary continuous functions uniformly on compact sets.
Corollary: Suppose that , is compact, and is continuous. Then for each , there is a polynomial such that .
Complex polynomial approximation:
Unlike the case with the real numbers, complex polynomials cannot approximate continuous functions uniformly on compact sets.
We say that a subalgebra of is a -subalgebra if implies that .
Theorem: (Complex Stone-Weierstrass approximation theorem) Suppose that is a compact Hausdorff space and is a -subalgebra of that contains all constant functions and where if , then there is some with . Then is dense in .
If the algebra is not closed under complex conjugation, then generally cannot approximate arbitrary complex-valued continuous functions.
Suppose that is an open subset of and is holomorphic for each , if and uniformly on compact sets, then is also holomorphic. As a consequence, only holomorphic functions are uniformly approximable by complex polynomials. Fortunately, Mergelyan's theorem allows us to uniformly approximate functions on compact subsets of the complex plane using polynomials and minimal assumptions, so things are not as bad as they first seem in the case of 1 complex variable.
Theorem: (Mergelyan) Suppose that and is compact. Suppose that has no bounded component. Suppose that where is continuous and holomorphic on the interior of . Then for all , there is some complex polynomial where .
Real and complex polynomial quantum approximation:
We shall now use polynomials of several real or complex variables to approximate arbitrary continuous functions using a work-around from quantum information theory.
Let denote either the field of real or complex numbers. Let be a finite dimensional vector space over the field . The projective space is the quotient space where we set precisely when for some (necessarily non-zero) scalar . Observe that is a -manifold. If is an inner product space, then we can associate with the set of all rank- trace 1 positive semidefinite operators from to ; if , then we equate the equivalence class with the operator . Let denote the set of all trace 1 positive semidefinite operators from to . In quantum information theory, is the set of all pure quantum states in while contains all quantum states including all pure and mixed states.
Suppose that and is compact. Let be a continuous function. Our goal is to approximate uniformly with polynomials followed by normalization and the partial trace operation that takes pure states to mixed states.
Suppose that are finite dimensional inner product spaces over . Then define the partial trace as the unique linear mapping where whenever . If , then . In quantum information theory, if , then is the resulting quantum state that we obtain from when we lose access to .
Coming up with the proof of the following result was not too hard for me (but it was not completely trivial either); it was more difficult coming up with the statement of the result than the proof. The following result is likely the immediate consequence of known facts from quantum information theory and quantum computation, so let me know of a reference.
Theorem: (J. Van Name) Suppose that and is compact. Let be a continuous function and let . Then there is an inner product space and a polynomial function where for each .
Proof outline: By Tietze's extension theorem, we can extend the function to a function where is a compact subset of and . Let be a finite subset of .
Define a polynomial . If , then define a polynomial If , then define a polynomial.
Let be a positive integer. Let be inner product spaces over where and has orthonormal basis . For each , let
be a unit vector with . Then set . I claim that we can choose sets and a natural number so that for each .
Then
.
Therefore,
If and for all (the case when for some is mostly similar), then
.
In particular, is a quantum state with
for some , but for sufficiently large , all the terms in this sum are negligible except when is minimized for all . But if is minimized for all , then.
Q.E.D.
The above result suggests that it is easier to approximate continuous functions using rather than using the polynomial directly. Furthermore, since is a mixed state for all inputs , the object behaves probabilistically which is desirable for many machine learning contexts. The expression when used in machine learning models also seems to be useful in avoiding the pathologies that hamper inherent interpretability. For these reasons, I find the use of favorable in machine learning.