When performing a dimensionality reduction on tensors, the trace is often zero.

Joseph Van Name

In this post, we shall define my new dimensionality reduction for tensors in where $n \geq 3$ , and we shall make an empirical observation about the structure of the dimensionality reduction. There are various simple ways of adapting this dimensionality reduction algorithm to tensors in $V_{1} \otimes \dots \otimes V_{n}$ and even mixed quantum states (mixed states are just positive semidefinite matrices in $V_{1} \otimes \dots \otimes V_{n}$ which trace 1), but that will be a topic for another post.

This dimensionality reduction shall represent tensors in $V^{\otimes n}$ as tuples of matrices $A_{1}, \dots, A_{r}$ . Computer experiments indicate that, in many cases, we have $Tr (A_{i_{1}} \dots A_{i_{m}}) = 0$ whenever $m \neq 0 mod n$ .

If $X$ is a matrix, then the spectral radius of $X$ is the value

$ρ (X) = max {| λ | : λ is an eigenvalue of X} = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$ .

If $X$ is a matrix, then define the conjugate matrix $¯ ¯¯¯ ¯ X = (X^{*})^{T} = (X^{T})^{*}$ ; this is the matrix obtained from $X$ by replacing each entry with its complex conjugate.

If $(X_{1}, \dots, X_{r})$ is a tuple of real or complex matrices, then define the $L_{2}$ -spectral radius by setting

$ρ_{2} (X_{1}, \dots, X_{r}) = ρ (X_{1} \otimes_{1} + \dots + X_{r} \otimes_{r})^{1 / 2}$ .

Suppose that $K$ is either the field of real numbers or the field of complex numbers. Suppose that $p (x_{1}, \dots, x_{r})$ is a non-commutative homogeneous of degree $n$ polynomial with coefficients in $K$ (it is easier to define the dimensionality reduction in terms of homogeneous non-commutative polynomials than tensors).

Then define a fitness function $M_{p} : M_{d} (K)^{r} \to [0, \infty)$ by setting

$M_{p} (A_{1}, \dots, A_{r}) = \frac{ρ (p (A_{1}, \dots, A_{r}))^{1 / n}}{ρ_{2} (A_{1}, \dots, A_{r})}$ .

This function $M_{p}$ is bounded, and it has a maximum value, but to prove that it attains its maximum value, we need to use quantum channels.

We shall call a tuple $(A_{1}, \dots, A_{r})$ where $M_{p} (A_{1}, \dots, A_{r})$ is maximized an

$L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) of the non-commutative polynomial $p (x_{1}, \dots, x_{r})$ . The motivation behind the notion of an LSRDR is that it is easier to represent the variables $x_{1}, \dots, x_{r}$ as the matrices $A_{1}, \dots, A_{r}$ than it is to work with the non-commutative polynomial $p (x_{1}, \dots, x_{r})$ . The $d \times d$ -matrices $A_{1}, \dots, A_{r}$ have $d^{2} \cdot r$ parameters while the non-commutative polynomial could have up to $r^{n}$ parameters where $n$ is the degree of the polynomial $p (x_{1}, \dots, x_{r})$ .

We observe that if $p$ is a quadratic non-commutative homogeneous polynomial, then $max M_{p} (A_{1}, \dots, A_{r}) = ∥ p ∥_{2}^{1 / 2}$ where $∥ \cdot ∥_{2}$ refers to the Frobenius norm. In other words, we already have a well-developed theory of matrices, and LSRDRs do not improve the theory of matrices, but LSRDRs help us analyze tensors of order 3 in several different ways.

Given square matrices $A_{1}, \dots, A_{r} \in M_{d} (K)$ , define a completely positive superoperator $Φ (A_{1}, \dots, A_{r}) : M_{d} (K) \to M_{d} (K)$ by setting $Φ (A_{1}, \dots, A_{r}) (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ . The operator $Φ (A_{1}, \dots, A_{r})$ is similar to the matrix $A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r}$ .

Observation: Suppose that $p$ is a non-commutative homogeneous polynomial of degree $n$ with random complex coefficients. Let $(A_{1}, \dots, A_{r})$ be an $L_{2, d}$ -spectral radius dimensionality reduction of $p$ . Then we often have $Tr (q (A_{1}, \dots, A_{r})) = 0$ whenever $q$ is a homogeneous non-commutative homogeneous polynomial with degree $m$ where $m mod n \neq 0$ . Furthermore, the set of eigenvalues of $Φ (A_{1}, \dots, A_{r})$ is invariant under rotations by $2 π / n$ . Said differently, $Tr (Φ (A_{1}, \dots, A_{r})^{m}) = 0$ whenever $m mod n \neq 0$ .

I currently do not have an adequately developed explanation for why $Tr (q (A_{1}, \dots, A_{r})) = 0$ and $Tr (Φ (A_{1}, \dots, A_{r})^{m}) = 0$ so often (more experimentation is needed), but such an explanation is probably within reach. The observation $Tr (q (A_{1}, \dots, A_{r})) = 0$ does not occur 100 percent of the time since we get $Tr (q (A_{1}, \dots, A_{r})) = 0$ only when the conditions are right.

If $A_{1}, \dots, A_{r} \in M_{d} (K)$ , then

$Tr (Φ (A_{1}, \dots, A_{r})) = | Tr (A_{1}) |^{2} + \dots + | Tr (A_{r}) |^{2}$ . Therefore, $Tr (Φ (A_{1}, \dots, A_{r})) = 0$ precisely when $Tr (A_{j}) = 0$ for $1 \leq j \leq r$ . Furthermore,

$Tr (Φ (A_{1}, \dots, A_{r})^{m}) = \sum_{i_{1}, \dots, i_{m}} | Tr (A_{i_{1}} \dots A_{i_{m}}) |^{2}$ , so $Tr (Φ (A_{1}, \dots, A_{r})^{m}) = 0$ precisely when $Tr (A_{i_{1}} \dots A_{i_{m}}) = 0$ whenever $i_{1}, \dots, i_{m} \in {1, \dots, r}$ .

$Tr (Φ (A_{1, 1}, \dots, A_{1, r_{1}}) \dots Φ (A_{s, 1}, \dots, A_{s, r_{s}}))$ $= \sum_{i_{1} \in {1, \dots, r_{1}}, \dots, i_{s} \in {1, \dots, r_{s}}} | Tr (A_{1, i_{1}} \dots A_{s, i_{s}}) |^{2}$ . Therefore, $Tr (Φ (A_{1, 1}, \dots, A_{1, r_{1}}) \dots Φ (A_{s, 1}, \dots, A_{s, r_{s}})) = 0$ precisely when $Tr (A_{1, i_{1}} \dots A_{s, i_{s}}) = 0$ whenever $i_{1} \in {1, \dots, r_{1}}, \dots, i_{s} \in {1, \dots, r_{s}}$ .

Remark:

LSRDRs of tensors are well-behaved in other ways besides having trace zero. For example, if we train two LSRDRs $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ of a tensor multiple times with the same initialization, then we typically have $M_{p} (A_{1}, \dots, A_{r}) = M_{p} (B_{1}, \dots, B_{r})$ (but this does not happen 100 percent of the time either). After training, the resulting LSRDR therefore does not have any random information left over from the initialization or the training, and any random information present in an LSRDR was originally in the tensor itself.

Remark:

We have some room to modify our fitness function while still retaining the properties of LSRDRs of tensors. For example, suppose that $p$ is a homogeneous non-commutative polynomial of degree $n$ , and define $M_{p, s} : M_{d} (K)^{r} \to [0, \infty)$ by setting

$M_{p, s} (A_{1}, \dots, A_{r}) = \frac{ρ (p (A_{1}, \dots, A_{r}))^{1 / n}}{∥ A_{1} A_{1}^{*} + \dots + A_{r} A_{r}^{*} ∥_{s}^{1 / 2}}$ . Then if $p$ is a random homogeneous non-commutative complex polynomial and $1 < s \leq \infty$ and $∥ \cdot ∥_{s}$ denotes the Schatten norm ( $∥ X ∥_{s} = (Tr ((X X^{*})^{s / 2}))^{1 / s}$ which is the $ℓ_{s}$ norm of the singular values of $X$ ), and $(A_{1}, \dots, A_{r})$ maximizes $M_{p, s} (A_{1}, \dots, A_{r})$ , then (if everything works out right), we still would have $Tr (A_{i_{1}} \dots A_{i_{m}}) = 0$ whenever $m \neq 0 mod n$ .

Conclusion:

Since LSRDRs of tensors do not leave behind any random information that is not already present in the tensors themselves, we should expect for LSRDRs to be much more interpretable than machine learning systems like neural networks that do retain much random information left over from the initialization. Since LSRDRs of tensors give us so many trace zero operators, one should consider LSRDRs of tensors as very well behaved systems, and well behaved systems should be much more interpretable than poorly behaved systems.

I look forward of using LSRDRs of tensors to interpret machine learning models and produce new highly interpretable machine learning models. I do not see LSRDRs of tensors replacing deep learning, but LSRDRs have properties that are hard to reproduce using deep learning, so I look forward to exploring the possibilities with LSRDRs of tensors. I will make more posts about LSRDRs of tensors and other objects produced with similar objective functions.

Edits: (10/12/2023) I originally claimed that my dimensionality reduction does not work well for tensors in $V_{1} \otimes \dots \otimes V_{n}$ , but after reexperimentation, I was able to reduce random tensors in $V_{1} \otimes \dots \otimes V_{n}$ to matrices, and such a dimensionality reduction performed well.

Edited 1/10/2024

LESSWRONG
LW

When performing a dimensionality reduction on tensors, the trace is often zero.

7

New to LessWrong?

7