Using complex polynomials to approximate arbitrary continuous functions — LessWrong

x

Using complex polynomials to approximate arbitrary continuous functions — LessWrong

Machine learning algorithms such as neural networks are supposed to have some sort of universal uniform approximation theorem that shows that they can (at least in principle) learn any possible data set without simply overfitting to the training data.

The standard universal approximation theorem applies to shallow neural networks with arbitrary continuous non-polynomial activation functions. There are also plenty of polynomial approximation results in mathematics, so one should at least in principle be able to train a real multivariate polynomial to model an arbitrary continuous function arbitrarily well using a multi-layered machine learning model. On the other hand, anyone modestly familiar with complex analysis knows that not every continuous function from a compact subset of to $C$ can be approximated by polynomials.

In this post, we shall eventually produce a work-around that allows us to approximate arbitrary continuous functions using complex polynomials.

Why use polynomials instead of deep neural networks?

I personally have many issues with neural networks. To me, neural networks are clumsy to study mathematically. For example, a deep neural network with tanh activation is far more complicated than the function $f (x) = tanh (tanh (tanh (x)))$ , but the function $f$ is awkward to work with. For example, $f (x) = tanh (tanh (tanh (x)))$ can be written in terms of the exponential function, but this means that $f (x)$ is written in terms of an iterated exponential. I personally would rather not use iterated exponentials. To make things worse, $f (i \cdot x) = i \cdot tan (tan (tan (x)))$ which has a complicated singularity set. The unbiased neural networks with tanh activation essentially become neural networks with tan activations when fed purely imaginary inputs, and neural networks with tangent activations are untrainable and pathological. Neural networks with ReLU activations do not have this pathology that we see with tanh activation, but they have their own pathologies.

It seems like the pathological attributes and behavior of neural networks are part of the reason they are so difficult to interpret and study mathematically, so an inherently interpretable alternative to neural networks may help us solve the problems related to AI interpretability and safety. Of course, inherently interpretable AI should ideally match or at least complement the performance of deep neural networks since we need inherently interpretable AI to be relevant to AI safety.

My computer experiments indicate that machine learning models that compute real or complex polynomials do not always contain the pathologies that we see with neural networks. These alternative machine learning models computing polynomial functions may still have multiple layers so they are sophisticated, but they are not as sophisticated as deep neural networks. Keep in mind that people have devoted a lot of resources to training deep neural networks, so we need to take this into consideration when evaluating the performance of alternative machine learning models. In this post, we shall state polynomial uniform approximation theorems in order to help justify the idea that polynomials may have capabilities comparable to neural networks.

Real polynomial approximation:

Real polynomials have no problem uniformly approximating arbitrary continuous functions on compact sets.

If $X, Y$ are topological spaces, then let $C (X, Y)$ denote the set of all continuous functions from $X$ to $Y$ .

If $X$ is a compact Hausdorff space, then $C (X, R)$ can be endowed with a norm $∥ * ∥$ defined by $∥ f ∥ = max {| f (x) | : x \in X}$ , and this norm generates a topology on $X$ known as the topology of uniform convergence.

Let $K$ be a topological field. We say that $A$ is a subalgebra of $C (X, K)$ if $f, g \in A, α \in K$ implies that $α f, f + g, f \cdot g \in A$ as well.

Theorem: (Stone-Weierstrass approximation theorem) Suppose that $X$ is a compact Hausdorff space and $A$ is a subalgebra of $α f, f + g, f \cdot g \in A$ that contains all constant functions and where if $x, y \in X, x \neq y$ , then there is some $f \in A$ with $f (x) \neq f (y)$ . Then $A$ is dense in $C (X, R)$ .

As a consequence, polynomials can approximate arbitrary continuous functions uniformly on compact sets.

Corollary: Suppose that $K \subseteq R^{n}$ , $K$ is compact, and $f : K \to R$ is continuous. Then for each $ϵ > 0$ , there is a polynomial $p$ such that $∥ f - p |_{K} ∥ < ϵ$ .

Complex polynomial approximation:

Unlike the case with the real numbers, complex polynomials cannot approximate continuous functions uniformly on compact sets.

We say that a subalgebra $A$ of $C (X, C)$ is a $*$ -subalgebra if $f \in A$ implies that $¯ ¯ ¯ f \in A$ .

Theorem: (Complex Stone-Weierstrass approximation theorem) Suppose that $X$ is a compact Hausdorff space and $A$ is a $*$ -subalgebra of $C (X, C)$ that contains all constant functions and where if $x, y \in X, x \neq y$ , then there is some $f \in A$ with $f (x) \neq f (y)$ . Then $A$ is dense in $C (X, R)$ .

If the algebra $A$ is not closed under complex conjugation, then $A$ generally cannot approximate arbitrary complex-valued continuous functions.

Suppose that $U$ is an open subset of $C^{n}$ and $f_{n} : U \to C$ is holomorphic for each $n$ , if $f : U \to C$ and $f_{n} \to f$ uniformly on compact sets, then $f$ is also holomorphic. As a consequence, only holomorphic functions are uniformly approximable by complex polynomials. Fortunately, Mergelyan's theorem allows us to uniformly approximate functions on compact subsets of the complex plane using polynomials and minimal assumptions, so things are not as bad as they first seem in the case of 1 complex variable.

Theorem: (Mergelyan) Suppose that $L \subseteq C$ and $L$ is compact. Suppose that $C ∖ L$ has no bounded component. Suppose that $f : L \to C$ where $L \subseteq C$ is continuous and holomorphic on the interior of $L$ . Then for all $ϵ > 0$ , there is some complex polynomial $p$ where $∥ p |_{K} - f ∥ < ϵ$ .

Real and complex polynomial quantum approximation:

We shall now use polynomials of several real or complex variables to approximate arbitrary continuous functions using a work-around from quantum information theory.

Let $K$ denote either the field of real or complex numbers. Let $V$ be a finite dimensional vector space over the field $K$ . The projective space $P (V)$ is the quotient space $(V ∖ {0}) / ≃$ where we set $x ≃ y$ precisely when $x = α \cdot y$ for some (necessarily non-zero) scalar $α$ . Observe that $P (V)$ is a $K$ -manifold. If $V$ is an inner product space, then we can associate $x = α \cdot y$ with the set of all rank- $1$ trace 1 positive semidefinite operators from $V$ to $V$ ; if $x \in V, ∥ x ∥ = 1$ , then we equate the equivalence class $[x]$ with the operator $V$ . Let $x \in V, ∥ x ∥ = 1$ denote the set of all trace 1 positive semidefinite operators from $V$ to $V$ . In quantum information theory, $P (V)$ is the set of all pure quantum states in $V$ while $D (V)$ contains all quantum states including all pure and mixed states.

Suppose that $L \subseteq K^{n}$ and $L$ is compact. Let $f : L \to D (V)$ be a continuous function. Our goal is to approximate $f$ uniformly with polynomials followed by normalization and the partial trace operation that takes pure states to mixed states.

Suppose that $U, V, W$ are finite dimensional inner product spaces over $K$ . Then define the partial trace ${Tr}_{U} : L (U \otimes V, U \otimes W) \to L (V, W)$ as the unique linear mapping where ${Tr}_{U} (A \otimes B) = Tr (A) \cdot B$ whenever $A : U \to U, B : V \to W$ . If $A \in D (U \times V)$ , then ${Tr}_{U} (A) \in D (V)$ . In quantum information theory, if $A \in D (U \times V)$ , then ${Tr}_{U} (A)$ is the resulting quantum state that we obtain from $A$ when we lose access to $U$ .

Coming up with the proof of the following result was not too hard for me (but it was not completely trivial either); it was more difficult coming up with the statement of the result than the proof. The following result is likely the immediate consequence of known facts from quantum information theory and quantum computation, so let me know of a reference.

Theorem: (J. Van Name) Suppose that $L \subseteq K^{n}$ and $L$ is compact. Let $f : L \to D (V)$ be a continuous function and let $ϵ > 0$ . Then there is an inner product space $U$ and a polynomial function $p : K^{n} \to (U \otimes V) ∖ {0}$ where $∥ f (z) - {Tr}_{U} ([p (z)]) ∥ < ϵ$ for each $z \in K^{n}$ .

Proof outline: By Tietze's extension theorem, we can extend the function $f$ to a function $g : T^{n} \to D (V)$ where $E$ is a compact subset of $K$ and $L \subseteq T^{n}$ . Let $B$ be a finite subset of $T$ .

Define a polynomial $q (z) = \prod_{b \in B} (z - b)$ . If $T$ , then define a polynomial $q_{b} (z) = \frac{q (z)}{z - b} .$ If $b = (b_{1}, \dots, b_{n}) \in B^{n}$ , then define a polynomial $q_{b} (z_{1}, \dots, z_{n}) = q_{b_{1}} (z_{1}) \dots q_{b_{n}} (z_{n})$ .

Let $N$ be a positive integer. Let $U, U_{1}, U_{2}$ be inner product spaces over $K$ where $U = U_{1} \otimes U_{2}$ and $U_{1}$ has orthonormal basis $(e_{b})_{b \in B^{n}}$ . For each $b \in B^{n}$ , let

$t (b) \in U_{2} \otimes V$ be a unit vector with ${Tr}_{U_{2}} (t (b) t (b)^{*}) = g (b)$ . Then set $p (z) = \sum_{b \in B^{n}} q_{b} (z)^{N} \cdot e_{b} \otimes t (b)$ . I claim that we can choose sets $B$ and a natural number $N$ so that $∥ f (z) - {Tr}_{U} ([p (z)]) ∥ < ϵ$ for each $z \in K^{n}$ .

Then ${Tr}_{U_{1}} (p (z) p (z)^{*}) = \sum_{b, c} q_{b} (z)^{N} {¯ ¯¯¯¯¯¯¯¯¯¯ ¯ q_{v} (z)}^{N} {Tr}_{U_{1}} [(e_{b} \otimes t (b)) (e_{c}^{*} \otimes t (c)^{*})]$

$= \sum_{b, c} q_{b} (z)^{N} {¯ ¯¯¯¯¯¯¯¯¯¯ ¯ q_{v} (z)}^{N} {Tr}_{U_{1}} [e_{b} e_{c}^{*} \otimes t (b) t (c)^{*}]$

$= \sum_{b, c} q_{b} (z)^{N} {¯ ¯¯¯¯¯¯¯¯¯¯ ¯ q_{v} (z)}^{N} Tr [e_{b} e_{c}^{*}] \otimes t (b) t (c)^{*}$

$= \sum_{b} | q_{b} (z)^{N} |^{2} t (b) t (b)^{*}$ .

Therefore, ${Tr}_{U} (p (z) p (z)^{*}) = \sum_{b} | q_{b} (z)^{N} |^{2} {Tr}_{U_{2}} (t (b) t (b)^{*})$

$= \sum_{b} | q_{b} (z) |^{2 N} \cdot g (b) .$

If $z = (z_{1}, \dots, z_{n})$ and $z_{j} \neq b_{j}$ for all $j$ (the case when $z_{j} = b_{j}$ for some $j$ is mostly similar), then

${Tr}_{U} (p (z) p (z)^{*}) \cdot | q (z) |^{- 2 N} = \sum_{b_{1}, \dots, b_{n} \in B} | \frac{1}{(z_{1} - b_{1}) \dots (z_{n} - b_{n})} |^{2 N} \cdot g (b_{1}, \dots, b_{n})$ .

In particular, ${Tr}_{U} ([p (z)])$ is a quantum state with

${Tr}_{U} ([p (z)])$ $= \sum_{b_{1}, \dots, b_{n} \in B} c \cdot | \frac{1}{(z_{1} - b_{1}) \dots (z_{n} - b_{n})} |^{2 N} \cdot g (b_{1}, \dots, b_{n})$ for some $c$ , but for sufficiently large $N$ , all the terms in this sum are negligible except when $| z_{j} - b_{j} |$ is minimized for all $j$ . But if $c$ is minimized for all $j$ , then ${Tr}_{U} ([p (z)]) \approx g (b_{1}, \dots, b_{n})$ .

Q.E.D.

The above result suggests that it is easier to approximate continuous functions using ${Tr}_{U} ([p (z)])$ rather than using the polynomial directly. Furthermore, since ${Tr}_{U} ([p (z)])$ is a mixed state for all inputs $z$ , the object ${Tr}_{U} ([p (z)])$ behaves probabilistically which is desirable for many machine learning contexts. The expression ${Tr}_{U} ([p (z)])$ when used in machine learning models also seems to be useful in avoiding the pathologies that hamper inherent interpretability. For these reasons, I find the use of ${Tr}_{U} ([p (z)])$ favorable in machine learning.