Joseph Van Name's Shortform

Joseph Van Name

Joseph Van Name's Shortform — LessWrong

40 comments, sorted by

Click to highlight new comments since: Today at 3:33 AM

For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.

This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).

This post shall be about linear machine learning models. Actually, we are using quantum operators, so they are more sophisticated than your logistic regression models, but they are still linear so it is really easy to train a neural network that can solve more sophisticated problems than these linear models can. But the kinds of results that you find in this post can also extend to some non-linear models with multiple layers and stronger capabilities. It is just easier to understand what is going on with the linear models, and even with the linear models, we still obtain some interesting mathematics.

We say that a machine learning model trained by gradient ascent/descent is pseudodeterministically trained (or just pseudodeterministic for short) if the fitness/loss function has precisely one local optimum. As a result, the trained model will have absolutely no information left over from the initialization. As another consequence, the trained model will attain the global optimum rather than a suboptimal local optimum. The results in this post will actually hold whenever the global optimum is unique. But I need to bring up pseudodeterminism since pseudodeterminism implies that we can actually find the unique global optimum instead of always getting stuck at a suboptimal local optimum.

If a machine learning model global optimizes an objective function, the machine learning model should be considered as an inherently interpretable model rather than a high performance model since the machine learning model has no random information in it independent of the objective function itself and since one can only find the global optima for sufficiently easy objective functions. The global optimum is also more interpretable because it inherits the symmetry of the objective function which depends on the training data. In this post, we shall show that if the training data has some symmetry, then the quantum operator that we train will also have that symmetry.

This post is mathematical and contain mathematical proofs. Fortunately, the mathematical proofs are not that difficult, so it is easy for the readers. After all, the main thrust of this post is that these mathematical proofs are backed up by experimental results. The main bottleneck towards understanding this post is therefore the task of getting through all the technical definitions. I might follow up this short post with a more general post, so you should read this before going through the more general post.

Let be a finite dimensional complex inner product space. If $B \subseteq U^{2}$ , then define sets $B^{*}, B^{⊤}, ¯ ¯¯ ¯ B$ by setting

$B^{*} = {(y, x) : (x, y) \in B}, ¯ ¯¯ ¯ B = {(¯ ¯ ¯ x, ¯ ¯ ¯ y) : (x, y) \in B}$

$B^{⊤} = {(¯ ¯ ¯ y, ¯ ¯ ¯ x) : (x, y) \in B}$ .

Let $μ$ be a probability measure on $U^{2}$ . Here, $μ$ is the probability distribution for the training data. Define new measures $μ^{*}, ¯ ¯ ¯ μ, μ^{⊤}$ by setting

$μ^{*} (B) = μ (B^{*}), ¯ ¯ ¯ μ (B) = μ (¯ ¯¯ ¯ B), μ^{⊤} (B) = μ (B^{⊤})$ .

Let $L (U)$ denote the collection of linear operators from $U$ to $U$ . If $A_{1}, \dots, A_{r} \in L (U)$ , then define an operator $Φ (A_{1}, \dots, A_{r}) : L (U) \to L (U)$ by setting $Φ (A_{1}, \dots, A_{r}) (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ . The operators of the form $Φ (A_{1}, \dots, A_{r})$ are the completely positive superoperators of Choi rank at most $r$ . Recall that $L (U)$ is an inner product space with the Frobenius inner product. It is easy to show that the Hermitian adjoint $Φ (A_{1}, \dots, A_{r})^{*}$ is just $Φ (A_{1}^{*}, \dots, A_{r}^{*})$ . If $E$ is a completely positive superoperator, then define $¯ ¯ ¯ E$ by setting

$¯ ¯ ¯ E (X) = (E (X^{* ⊤}))^{* ⊤}$ . Define $E^{⊤} = {¯ ¯ ¯ E}^{*} =^{*}$ . Then it is easy to show that $¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ Φ (A_{1}, \dots, A_{r}) = Φ (_{1}, \dots,_{r})$ and

$Φ (A_{1}, \dots, A_{r})^{⊤} = Φ (A_{1}^{⊤}, \dots, A_{r}^{⊤})$ .

We say that a norm $∥ * ∥$ on $L (L (U))$ is Hermitian adjoint preserving (resp. conjugate preserving, transpose preserving) if $∥ E ∥ = ∥ E^{*} ∥$ (resp, $∥ E ∥ = ∥ ¯ ¯ ¯ E ∥$ and $∥ E ∥ = ∥ E^{⊤} ∥$ ).

The domain of the fitness function $F_{μ, n}$ is the set of all non-zero completely positive superoperators $E : L (U) \to L (U)$ of Choi rank at most $n$ with $∥ E ∥ = 1$ . We define the fitness function $F_{μ, n, ∥ * ∥}$ by setting

$F_{μ, n, ∥ * ∥} (E) = \int log (⟨ E (x x^{*}), y y^{*} ⟩) μ (x, y) = \int log (⟨ E (x x^{*}), y y^{*} ⟩) μ (x, y) - log (∥ E ∥)$ . Observe that we also have $F_{μ, n, ∥ * ∥} = \int log (y^{*} E (x x^{*}) y) d μ (x, y)$ .

Experimental result (pseudodeterminism): Computer experiments show that the function $F_{μ, n, ∥ * ∥}$ typically has only one local maximum in the sense that we cannot find any other local maximum.

Define a function $F_{μ}$ whose domain is the set of all completely positive superoperators $E : L (U) \to L (U)$ by setting

$F_{μ} (E) = \int log (⟨ E (x x^{*}), y y^{*} ⟩) μ (x, y)$ which is equivalent to

$F_{μ} (E) = \int log (y^{*} E (x x^{*}) y) μ (x, y)$ . We wrote $F (μ, E)$ for $F_{μ} (E)$ to reduce the use of subscripts.

Lemma: $F (μ, E) = F (μ^{*}, E^{*}) = F (¯ ¯ ¯ μ, ¯ ¯ ¯ E) = F (μ^{⊤}, E^{⊤})$ .

Proof: $F (μ, E) = \int log (⟨ E (x x^{*}), y y^{*} ⟩ μ (x, y)$

$= \int log (⟨ x x^{*}, E^{*} (y y^{*}) ⟩ μ (x, y)$

$= \int log (⟨ E^{*} (y y^{*}), x x^{*} ⟩) μ (x, y)$

$= \int log (⟨ E^{*} (x x^{*}), y y^{*} ⟩ μ^{*} (x, y) = F (μ^{*}, E^{*}) .$

Likewise,

$F (μ, E) = \int log (y^{*} E (x x^{*}) y) \cdot μ (x, y)$

$= \int log (^{*} ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ E (x x^{*}) ¯ ¯ ¯ y) \cdot μ (x, y)$

$= \int log (^{*} \cdot ¯ ¯ ¯ E (¯ ¯¯¯¯¯¯ ¯ x x^{*}) \cdot ¯ ¯ ¯ y) μ (x, y)$

$= \int log (y^{*} \cdot ¯ ¯ ¯ E (x x^{*}) \cdot y) ¯ ¯ ¯ μ (x, y) = F (¯ ¯ ¯ μ, ¯ ¯ ¯ E) .$

As a consequence, $F (μ, E) = F (μ^{*}, E^{*}) = F (^{*},^{*}) = F (μ^{⊤}, E^{⊤})$ . Q.E.D.

Theorem: Suppose that $F_{μ, n, ∥ * ∥}$ has a unique global maximum $(E, F_{μ, n, ∥ * ∥} (E))$ .

If $∥ * ∥$ is Hermitian adjoint preserving and $μ = μ^{*}$ , then $E = E^{*}$ .
If $∥ * ∥$ is conjugate preserving and $μ = ¯ ¯ ¯ μ$ , then $E = ¯ ¯ ¯ E$ .
If $∥ * ∥$ is transpose preserving and $μ = μ^{⊤}$ , then $E = E^{⊤}$ .

Proof: The proofs of 2 and 3 are similar, so we shall only prove 1. For 1, assuming the premises, both $E$ and $E^{*}$ belong to the domain of $F_{μ, n, ∥ * ∥}$ . But

$F_{μ, n, ∥ * ∥} (E) = F (μ, E)$

$= F (μ^{*}, E^{*}) = F (μ, E^{*}) = F_{μ, n, ∥ * ∥} (E^{*})$ . Since, $F_{μ, n, ∥ * ∥}$ has only one global maximum, we conclude that $E = E^{*}$ . Q.E.D.

From the above result, we conclude that the global maximum $(E, F_{μ, n, ∥ * ∥} (E))$ inherits any symmetry that the measure $μ$ has.

I would really like to build these inherently interpretable models so that they can solve some really interesting problems (or at least be a few layers in solving them), but I am still stuck attempting to communicate with people about linear models. Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI, but I have a hard time communicating with anyone about this.

[-]Mitchell_Porter4mo21

Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI

Hopefully this will draw some attention! But are you sacrificing something else, for the sake of these desirable properties?

[-]Joseph Van Name4mo*90

Yes. It seems like to get pseudodeterministic AI, we will need to rebuild AI from the very beginning, and I am not sure that it will all work. For example, pseudodeterminism is harder to attain with stochastic or mini-batch gradient descent, so one might need to use all the training data whenever one updates the weights. I have so far been able to get pseudodeterministic multi-layered models for solving classification problems, word embeddings for NLP, models that are measurements of security of block ciphers such as the advanced encryption standard (the models evaluating the AES are very easy to train), and other things. I have not been able to make pseudodeterministic version of convolutional networks, transformers, GANs, etc. We can use pseudodeterminism for narrow AI or the first few layers of a deep neural network right now though. There is also a funding and exposure issue since not very many people are talking about pseudodeterminism. I have more posts planned about this though.

A trade of performance In exchange for interpretability is exactly what we want for AI safety.

[-]kbear4mo10

Experimental result (pseudodeterminism): Computer experiments show that the function typically has only one local maximum in the sense that we cannot find any other local maximum.

a lot hinges on this. i would be interested to learn about the experimental setup.

[-]Joseph Van Name4mo10

For experiments, I just used a convex combination of point mass measures for where the point masses are generated uniformly at random (though I might get something more complicated if I tried evaluating the integrals). I then attempted to find multiple local maxima by the usual gradient ascent. If I always end up with the same local maximum, I presume that there is only one local maximum even though I have no mathematical proof that this is the case.

I am redoing the experiments and the only way I can get pseudodeterminism to fail is by using real inner product spaces instead of complex inner product spaces and by setting n=1 (and in this case, pseudodeterminism fails because set of all points where the fitness function returns a real number instead of negative infinity has multiple components). When pseudodeterminism fails, it does not even fail that badly. The distribution of all models that we get has low collision entropy -log(X=Y), so P(X=Y) when X,Y are trained models with different initializations is still high.

Pseudodeterminism does not seem to be rare, but the problem in machine learning is to pseudodeterministically train machine learning models that can solve interesting and challenging problems; I have been working on this in my spare time (without anyone's help), but since people don't seem to be interested in this, progress has been slow.

[-]kbear4mo10

a convex combination of point mass measures for where the point masses are generated uniformly at random

two questions:

this seems to include an assumption regarding the points between training samples. if we take the point masses as the known values, then with this step we're adding some interpolation between those. (that is, if we were "really training" these things, then the integral would look like a sum, since it would only have support at those (x,y) that represent points in our training data.)
have you tried adversarialy constructing \mu such that the integral has multiple maxima? if so, what did you run into? i worry that this could be a case where "random" examples mostly have this property, but some important subset does not (similar to how random functions are nowhere differentiable, but we can still do calculus.)

i'm interested in this! but i've only been introduced to these ideas recently, by your post! please read these questions as my own attempts to understand how you're thinking about it.

[-]Joseph Van Name4mo*10

Yes. When we take convex combinations of finitely many point mass measures, the integral is just a sum. I use the sum of finitely many elements for ease of calculations, but to prove theorems, I should use measures for full generality.

The idea of finding an object along with distinct local optima $G_{A} (x_{1}), \dots, G_{A} (x_{n})$ with $n$ maximized looks like an interesting problem to work on. I have not worked on this kind of objective before, but I can certainly try this, as I have a few ideas of how to do this. This might work better for discrete optimization problems though since I cannot think of a good way to use gradient updates to produce new local optima. In this case, I will need to use either evolutionary computation or hill climbing instead. I do not think that this will result in natural looking objects $A$ though, so I don't think I can learn much from this endeavor.

I have not thought much about finding measures $μ$ where $F_{μ, n, ∥ * ∥}$ has many local maxima because I have many higher priorities. These days, people are focused on the more complicated machine learning systems such as large language models, and in order to catch up, I also need to increase the performance, capabilities, and efficiency of my pseudodeterministic machine learning models. For the more complicated multi-layered models, it seems more difficult to obtain and retain pseudodeterminism. Pseudodeterminism is a robust property for simple objective functions such as when we are training a linear model or performing convex optimization, but pseuodeterminism becomes increasingly fragile as we increase the sophistication of our objective functions. This means that it is trivial to violate pseudodeterminism for the sophisticated models that I want to work more on, but it is difficult to retain pseuodeterminism.

I am not at all worried about any strange case of non-pseudodeterminism when optimizing $F_{μ, n, ∥ * ∥}$ for measures I have not thought about yet since this problem is not even close to being non-pseudodeterministic. For example, if $E_{0}, F_{0}$ are norm 1 completely positive superoperators of the same Choi rank $\leq N$ and if $(x_{n}, y_{n}) \in (U ∖ {0})^{2}$ for all $n$ and $(E_{n})_{n}, (F_{n})_{n}$ are sequences where

$E_{n + 1}$ is obtained by moving from $E_{n}$ in the direction of the gradient (with possible momentum) of $log (| ⟨ E_{n} x_{n}, y_{n} ⟩ |^{2}) - log (∥ E_{n} ∥)$ and $F_{n + 1}$ is obtained from $F_{n}$ the same way with the same rate, then my experiments show that ${lim}_{n \to \infty} ∥ E_{n} - F_{n} ∥ = 0$ regardless of what each $(x_{n}, y_{n})$ is. In other words, even if $(E_{n})_{n}$ does not converge, the sequences $(E_{n})_{n}, (F_{n})_{n}$ uniformly approximate each other as $n \to \infty$ . This is a much stronger form of pseudodeterminism that is hard to violate, so it is not a high priority to find particular instances of non-pseudodeterminism especially if those instances do not coincide with real-world data.

I kind of expect the fitness function $F_{μ, n, ∥ * ∥}$ to have just one or a few local maxima because their closest relatives are the linear models and those linear models are obtained by optimizing an objective function with one local optimum. And I also expect $F_{μ, n, ∥ * ∥}$ to have one or very few local maxima because $F_{μ, n, ∥ * ∥}$ is similar to many other objective functions that I have constructed each with one or a few local optimum. And since $F_{μ, n, ∥ * ∥}$ is simpler than other objective functions I have looked at with few local optima, $F_{μ, n, ∥ * ∥}$ should also have very few local optima. And the function $F_{μ}$ is concave, so there is only one local maximum value ${F_{μ} (E) : E \in Q}$ whenever $Q$ is a convex set (such possible convex sets of interest include all quantum channels and all unital channels). The restriction of our attention to completely positive operators of low Choi rank and in the boundary of the unit ball means that when we maximize $F_{μ, n, ∥ * ∥}$ , we cannot use convexity to prove that there is only one local maximum, but convexity still suggests that there should be just one especially when $n$ is large. When $n$ is small, we cannot use convexity to make conclusions though since I did a Hessian calculation, and the Hessian of $F (μ, Φ (A_{1}, \dots, A_{r}))$ with respect to $(A_{1}, \dots, A_{r})$ generally has plenty of both positive and negative eigenvalues. I do not consider it a major problem if $F_{μ, n, ∥ * ∥}$ has multiple local maxima, since that probably just means that we need to increase the value of $n$ until these local maxima merge.

[-]Joseph Van Name3y*100

Every entry in a matrix counts for the -spectral radius similarity. Suppose that $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are real $n \times n$ -matrices. Set $A^{\otimes 2} = A \otimes A$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ to be the number

$\frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ (A_{1}^{\otimes 2} + \dots + A_{r}^{\otimes 2})^{1 / 2} ρ (B_{1}^{\otimes 2} + \dots + B_{r}^{\otimes 2})^{1 / 2}}$ . Then the $L_{2}$ -spectral radius similarity is always a real number in the interval $[0, 1]$ , so one can think of the $L_{2}$ -spectral radius similarity as a generalization of the value $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ where $u, v$ are real or complex vectors. It turns out experimentally that if $A_{1}, \dots, A_{r}$ are random real matrices, and each $B_{j}$ is obtained from $A_{j}$ by replacing each entry in $B_{j}$ with $0$ with probability $1 - α$ , then the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ will be about $\sqrt{α}$ . If $u = (A_{1}, \dots, A_{r}), v = (B_{1}, \dots, B_{r})$ , then observe that $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥} \approx \sqrt{α}$ as well.

Suppose now that $A_{1}, \dots, A_{r}$ are random real $n \times n$ matrices and $C_{1}, \dots, C_{r}$ are the $m \times m$ submatrices of $A_{1}, \dots, A_{r}$ respectively obtained by only looking at the first $m$ rows and columns of $A_{1}, \dots, A_{r}$ . Then the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $C_{1}, \dots, C_{r}$ will be about $\sqrt{m / n}$ . We can therefore conclude that in some sense $C_{1}, \dots, C_{r}$ is a simplified version of $A_{1}, \dots, A_{r}$ that more efficiently captures the behavior of $A_{1}, \dots, A_{r}$ than $B_{1}, \dots, B_{r}$ does.

If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are independent random matrices with standard Gaussian entries, then the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ will be about $1 / \sqrt{r}$ with small variance. If $u, v$ are random Gaussian vectors of length $r$ , then $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ will on average be about $c / \sqrt{r}$ for some constant $c$ , but $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ will have a high variance.

These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.

[-]Algon3y20

Your notation is confusing me. If r is the size of the list of matrices, then how can you have a probability of 1-r for r>=2? Maybe you mean 1-1/r and sqrt{1/r} instead of 1-r and sqrt{r} respectively?

[-]Joseph Van Name2y30

Thanks for pointing that out. I have corrected the typo. I simply used the symbol for two different quantities, but now the probability is denoted by the symbol $α$ .

[-]Joseph Van Name1y90

In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.

Suppose that is either the field of real numbers, complex numbers, or quaternions.

If $A_{1}, \dots, A_{r} \in M_{m} (K), B_{1}, \dots, B_{r} \in M_{n} (K)$ are matrices, then define an superoperator $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) : M_{m, n} (K) \to M_{m, n} (K)$

by setting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$

(the domain and range of )and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the L_2-spectral radius similarity $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$ by setting

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ where $ρ$ denotes the spectral radius.

Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space $V$ together with a bilinear binary operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = x$ for all $x, y \in V$ .

Suppose that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ . Define operators $(A_{1}, \dots, A_{8})$ by setting $A_{i} v = e_{j} * v$ . Now, define operators $(B_{1}, \dots, B_{64})$ up to reordering by setting ${B_{1}, \dots, B_{64}} = {A_{i} \otimes A_{j} : i, j \in {1, \dots, 8}}$ .

Let $d$ be a positive integer. Then the goal is to find complex symmetric $d \times d$ -matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping $A_{j}$ to $X_{j}$ is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).

Experimental Observation: If $1 \leq d \leq 8$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 6) / 64 = (d + 3) / 32.$

If $7 \leq d \leq 16$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 4) / 64 = (d + 2) / 32.$

[-]Joseph Van Name1y*90

It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.

Our goal is to transform a list of -matrices $(A_{1}, . . ., A_{r})$ into a new and simplified list of $d \times d$ -matrices $(X_{1}, \dots, X_{r})$ . There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for $d < n$ , but in other cases, we would like the matrices $X_{j}$ to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc.

We measure similarity between tuples of matrices using spectral radii. Suppose that $(A_{1}, \dots, A_{r})$ are $n \times n$ -matrices and $(X_{1}, \dots, X_{r})$ are $d \times d$ -matrices. Then define an operator $Γ (A_{1}, \dots, A_{r} : X_{1}, \dots, X_{r})$ mapping $n \times d$ matrices to $n \times d$

-matrices by setting $Γ (A_{1}, \dots, A_{r} : X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots A_{r} X X_{r}^{*}$ . Then define $Φ (X_{1}, \dots, X_{r}) = Γ (X_{1}, \dots, X_{r}; X_{1}, \dots, X_{r})$ . Define the similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by setting

$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (X_{1}, \dots, X_{r}))^{1 / 2}}$

where $ρ$ denotes the spectral radius. Here, $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ should be thought of as a generalization of the cosine similarity to tuples of matrices. And $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ is always a real number in $[0, 1]$ , so this is a sensible notion of similarity.

Suppose that $K$ is either the field of real or complex numbers. Let $M_{n} (K)$ denote the set of $n$ by $n$ matrices over $K$ .

Let $n, d$ be positive integers. Let $T : M_{d} (K) \to M_{d} (K)$ denote a projection operator. Here, $T$ is a real-linear operator, but if $K$ is not complex, then $T$ is not necessarily complex linear. Here are a few examples of such linear operators $T$ that work:

$K = C : T_{1} (X) = (X + X^{T}) / 2$ (Complex symmetric)

$K = C : T_{2} (X) = (X - X^{T}) / 2$ (Complex anti-symmetric)

$K = C : T_{3} (X) = (X + X^{*}) / 2$ (Complex Hermitian)

$K = C : T_{4} (X) = Re (X)$ (real, the real part taken elementwise).

$K = R : T_{5} (X) = (X + X^{T}) / 2$ (Real symmetric)

$K = R : T_{6} (X) = (X - X^{T}) / 2$ (Real anti-symmetric)

$K = C : T_{7} (X) = Re (X) + Re (X)^{T}$ (real symmetric)

$K = C : T_{8} (X) = Re (X) - Re (X)^{T}$ (real anti-symmetric)

Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for $T_{1}, \dots, T_{8}$ along with operators that I have forgotten about.

We are now ready to describe our machine learning algorithm's input and objective.

Input: $r$ -matrices $A_{1}, \dots, A_{r} \in M_{n} (K)$

Objective: Our goal is to obtain matrices $(X_{1}, \dots, X_{r}) \in M_{d} (K)$ where $T (X_{j}) = X_{j}$ for all $j$ but which locally maximizes the similarity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ .

In this case, we shall call $(X_{1}, \dots, X_{r})$ an $L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) along the subspace $im (T) .$

LSRDRs along subspaces often perform tricks and are very well-behaved.

If $(X_{1}, \dots, X_{r}), (Y_{1}, \dots, Y_{r})$ are LSRDRs along subspaces, then there are typically some $λ, C$ where $Y_{j} = λ C X_{j} C^{- 1}$ for all $j$ . Furthermore, if $(X_{1}, \dots, X_{r})$ is an LSRDR along a subspace, then we can typically find some matrices $R, S$ where $X_{j} = T (R A_{j} S)$ for all $j$ .

The model $(X_{1}, \dots, X_{r})$ simplifies since it is encoded into the matrices $R, S$ , but this also means that the model $(X_{1}, \dots, X_{r})$ is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices $R, S$ tend to have mathematical properties that I can't explain and am still exploring.

[-]Joseph Van Name8mo71

Bitcoin mining is a real-world example of a goal that people spend an enormous amount of resources to attain, but this goal is useless or at least horribly inefficient.

Recall that the orthogonality thesis states that it is possible for an intelligent entity to have bad or dumb goals and that it is also possible for a not-so-intelligent entity to have good goals. I would therefore consider Bitcoin mining to be a real-world prominent example of the orthogonality thesis as it in a sense a dumb goal attained intelligently (though, this example is imperfect).

Bitcoin's mining algorithm consists of computing many SHA-256 hashes relentlessly. The Bitcoin miners are rewarded whenever they compute a suitable SHA-256 hash that is lower than the target. These SHA-256 hashes establish decentralized consensus about the state of the blockchain, and they distribute newly minted bitcoins. But besides this, computing so many SHA-256 hashes is nearly useless. Computing so many SHA-256 hashes consumes large quantities of energy and creates electronic waste.

So what are some of the possible alternatives to Bitcoin mining? It seems like the best alternative that does not significantly change the nature of Bitcoin mining would be to replace SHA-256 mining with some other mining algorithm that serves some scientific purpose.

This is more difficult than it seems because Bitcoin mining must satisfy a list of cryptographic properties. If the mining algorithm did not satisfy these cryptographic problems, then it might not be feasible for newly minted bitcoins to be dispersed every 10 minutes, and we may enter a scenario where a single entity with a secret algorithm or slightly faster hardware were to put all the blocks on the blockchain.

Since Bitcoin mining must satisfy a list of cryptographic properties, it is difficult to come up with a more scientifically useful mining algorithm that satisfies these cryptographic properties. But in science, if there is a difficult problem, people should perform research on this scientific problem. While finding a useful cryptocurrency mining algorithm has its challenges, cryptocurrency mining algorithms are easy to produce since they can be made from cryptographic hash functions without requiring public key encryption or other advanced cryptographic algorithms, so difficulty seems more like an excuse rather than a legitimate reason not to investigate useful cryptocurrency mining algorithms. The cryptocurrency sector does not want to perform this research. I can think of several reasons why people refuse to support this sort of endeavor despite the great effort that people put into Bitcoin mining, but none of these reasons justify the lack of interest in useful cryptocurrency mining.

The diminishing quality of cryptocurrency users:

It seems like when altcoins were first being developed around 2014, people were much more interested in developing scientifically useful mining algorithms. But around 2017 when cryptocurrency really started to become popular, people simply wanted to make money from cryptocurrencies, yet they were not very interested in understanding how cryptocurrencies work or how to improve them.

Mining algorithms with questionable scientific use:

Some cryptocurrencies and proposals such as Primecoin and Gapcoin have more scientific mining algorithms, but these mining algorithms still have questionable usefulness. For example, the objective in Primecoin mining is to find suitable Cunningham chains. A Cunningham chain of the first kind is a sequence of prime numbers where $p_{j + 1} = 2 p_{j} + 1$ whenever $1 \leq j < n$ . The most interesting thing about Cunningham chains is that they can be used in cryptocurrency mining algorithms, but they are otherwise of minor importance to mathematics.

These questionable mining algorithms are supposed to steer the cryptocurrency community into a more scientific direction, but in reality, they have just steered the cryptocurrency community towards using mining to perform mathematical calculations that not even mathematicians care that much about.

Alternative solutions to the energy waste problem:

Many people just want to do away with cryptocurrency mining in an altcoin by replacing it with proof-of-stake or some other consensus mechanism. This solution is attractive to the cryptocurrency creators since they want complete control over all the coins at the beginning of the project, and they just use the energy usage of cryptocurrency as a marketing strategy to get people interested in their project. But this solution should not be appealing to anyone who wants to use the cryptocurrency even if a cryptocurrency is better funded without much mining (of course, if mining is replaced with another consensus mechanism after all the coins have been created, then this objection does not stand). After all, Satoshi Nakamoto did not fund Bitcoin by selling bitcoins. There are other ways to fund a cryptocurrency project without alternate consensus mechanisms.

Hostility against cryptocurrency technologies:

It seems like many members of society are hostile against cryptocurrency technologies and hate people who own or are in any way interested in cryptocurrency. This sort of hostility is a very good reason to conduct as many transactions using just cryptocurrency since I do not want to deal with all of those Karens. But this hostility may have turned people away from researching useful cryptocurrency mining algorithms even though the usefulness would probably not benefit the cryptocurrency directly.

Hardcore Bitcoiners:

If Bitcoin mining were magically replaced with a useful mining algorithm, barely anything about Bitcoin would change. But in my experience, Bitcoiners do not see it this way. They are so stuck in their ways that they reject all altcoins.

Conclusion:

While cryptocurrencies have a lot of monetary value, they are not exactly powerhouses of innovation, nor do I find them extremely interesting on their own. But a good scientific mining algorithm would make them much more innovative and interesting.

[-]ChristianKl8mo31

But this solution should not be appealing to anyone who wants to use the cryptocurrency even if a cryptocurrency is better funded without much mining (of course, if mining is replaced with another consensus mechanism after all the coins have been created, then this objection does not stand). After all, Satoshi Nakamoto did not fund Bitcoin by selling bitcoins. There are other ways to fund a cryptocurrency project without alternate consensus mechanisms.

I don't understand why that would be an argument against just using proof of stake. Proof of stake has a bunch of different benefits. It solves the energy problem.

It also increases the amount of writes that the blockchain can do per minute which is very important for usability.

[-]Joseph Van Name8mo10

Proof-of-stake is still wasteful since it promotes pump and dump scams and causes people to waste their money on scam projects. If the creators are able to get their reward at the very beginning of a project, they will be more interested in short-term gains rather than a long-term token that will last. Humans are not psychologically/socially equipped to invest in proof-of-stake cryptocurrencies since they tend to get scammed.

[-]ChristianKl8mo30

Proof-of-stake is just technology that can be used in different ways. It can be used for pump and dump scams but also for different purposes.

If you are building a product that's actually setup to create long-term value it's useful to use proof-of-stake is it allows you to provide more value because you have higher throughput while using less energy.

If the value of the project rises as features are build out, there's an incentive to build out the project. There's no good reason for anyone who wants to build a system that actually creates value to use technology that burns more energy and provides less performance.

[-]Joseph Van Name2mo*30

I am going to apply my own dimensionality reduction algebra to a quantum channel (or matrices) obtained from the Okubo algebra in order to demonstrate the compatibility between my dimensionality reduction and the Okubo algebra.

TL-DR version: I trained my own machine learning algorithms on Okubo algebras and the squares of the fitness levels of the local maxima were usually either rational numbers or quadratic algebraic numbers. This suggests that my machine learning algorithm behaves mathematically.

Origin of algorithm: I have originally created this dimensionality reduction algorithm to analyze the cryptographic security of block ciphers for the cryptocurrency that I have created. If you want to discuss cryptocurrency technologies, please contact me privately off this site since I really do not feel comfortable talking about that stuff here.

After obtaining the dimensionality reduction algorithm, I noticed that such algorithms behaved mathematically for reasons that I still can't explain, and I have concluded that such mathematical behavior is needed to construct inherently interpretable and safe machine learning algorithms. Of course, if we want inherently interpretable and safe AI, we need machine learning algorithms that we can use to train models with many layers that can solve sophisticated tasks, but I am well on my way towards creating these algorithms too despite a complete and total lack of support.

Mathematics: The Okubo algebra^[1] is a close cousin to the octonions and satisfies many similar properties to the octonions.

The underlying set of the Okubo algebra is the set of all -complex Hermitian matrices with trace 0. Observe that the set of all -complex Hermitian matrices forms a real vector space of dimension . Therefore, the Okubo algebra's underlying set has dimension . Let be the complex numbers with . Then up to complex conjugation. The Okubo algebra is endowed with a bilinear operation defined by (I scaled the operation by a factor of so that the norm on the Okubo algebra is just the Frobenius norm). The operation satisfies the property where refers to the Frobenius norm and .

Let be an isomorphism between inner product spaces. Then define an operation on by setting . Then define orthogonal matrices by where is the standard basis for real Euclidean space.

If are -complex matrices and are -complex matrices, then define the -spectral radius similarity between and by

Computational results: The following facts are suggested by computer experiments but have not been rigorously proven. To run the computer experiments, I used gradient ascent to locally maximize the -spectral radius similarity. By maximizing the -spectral radius similarity, we reduce the dimensions of a tuple of matrices, and I call this dimensionality reduction the -spectral radius dimensionality reduction (LSRDR).

The maximum value of among the real -matrices is . Let be the maximum value of among the -complex,real symmetric,complex symmetric, complex anti-symmetric, complex Hermitian matrices. Then

Similar facts seem to hold for the other values (but I have not completely performed the calculations due to numerical instabilities that I do not want to fix). For example, and for .

The fitness levels that I have are simple but they are not too simple. This indicates that LSRDRs of Okubo algebras are interesting mathematically.

^{^}
Okubo algebras: automorphisms, derivations and idempotents, Alberto Elduque,2013,
https://api.semanticscholar.org/CorpusID:119713330

[-]Joseph Van Name1y30

I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.

We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.

Define for all complex numbers $z$ . Then $f (0) = 0, f (1) = 1, f^{'} (0) = f^{'} (1) = 0$ , and there are neighborhoods $U, V$ of $0, 1$ respectively where if $x \in U$ , then $f^{N} (x) \to 0$ quickly and if $y \in V$ , then $f^{N} (y) \to 1$ quickly. Set $f^{\infty} = {lim}_{N \to \infty} f^{N}$ . The function $f^{\infty}$ serves as error correction for projection matrices since if $Q$ is nearly a projection matrix, then $f^{\infty} (Q)$ will be a projection matrix.

Suppose that $K$ is either the field of real numbers, complex numbers or quaternions. Let $Z (K)$ denote the center of $K$ . In particular, $Z (R) = R, Z (C) = C, Z (H) = R$ .

If $A_{1}, \dots, A_{r}$ are $m \times n$ -matrices, then define $Φ (A_{1}, \dots, A_{r}) : M_{n} (K) \to M_{m} (K)$ by setting $Φ (A_{1}, \dots, A_{r}) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ . Then we say that an operator of the form $Φ (A_{1}, \dots, A_{r})$ is completely positive. We say that a $Z (K)$ -linear operator $E : M_{n} (K) \to M_{m} (K)$ is Hermitian preserving if $E (X)$ is Hermitian whenever $X$ is Hermitian. Every completely positive operator is Hermitian preserving.

Suppose that $E : M_{n} (K) \to M_{n} (K)$ is $Z (K)$ -linear. Let $t > 0$ . Let $P_{0} \in M_{n} (K)$ be a random orthogonal projection matrix of rank $d$ . Set $P_{N + 1} = f^{\infty} (P_{N} + t \cdot E (P_{N}))$ for all $N$ . Then if everything goes well, the sequence $(P_{N})_{N}$ will converge to a projection matrix $P$ of rank $d$ , and the projection matrix $P$ will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix $P$ . If $E$ is Hermitian preserving, then the projection matrix $P$ will typically be an orthogonal projection. This experiment performs well especially when $E$ is completely positive or at least Hermitian preserving or nearly so. The projection matrix $P$ will satisfy the equation $P \cdot E (P) = E (P) \cdot P = P \cdot E (P) \cdot P$ .

In the case when $E$ is a quantum channel, we can easily explain what the projection $P$ does. The operator $P$ is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel $E$ . In particular, the image $Im (P)$ is spanned by the top $d$ eigenvectors of $E (P)$ . This means that if we send the completely mixed state $P / d$ through the quantum channel $E$ and we measure the state $E (P / d)$ with respect to the projective measurement $(P, I - P)$ , then there is an unusually high probability that this measurement will land on $P$ instead of $I - P$ .

Let us now use the algorithm that obtains $P$ from $E$ to solve a problem in many cases.

If $x$ is a vector, then let $Diag (x)$ denote the diagonal matrix where $x$ is the vector of diagonal entries, and if $X$ is a square matrix, then let $Diag (X)$ denote the diagonal of $X$ . If $x$ is a length $n$ vector, then $Diag (x)$ is an $n \times n$ -matrix, and if $X$ is an $n \times n$ -matrix, then $Diag (X)$ is a length $n$ vector.

Problem Input: An $n \times n$ -square matrix $A$ with non-negative real entries and a natural number $d$ with $1 \leq d < n$ .

Objective: Find a subset $B \subseteq {1, \dots, n}$ with $| B | = d$ and where if $x = A \cdot χ_{B}$ , then the $d$ largest entries in $x$ are the values $x [b]$ for $b \in B$ .

Algorithm: Let $E$ be the completely positive operator defined by setting $E (X) = Diag (A \cdot Diag (X))$ . Then we run the iteration using $E$ to produce an orthogonal projection $P$ with rank $d$ . In this case, the projection $P$ will be a diagonal projection matrix with rank $d$ where $diag (P) = χ_{B}$ and where $B$ is our desired subset of ${1, \dots, n}$ .

While the operator $P$ is just a linear operator, the pseudodeterminism of the algorithm that produces the operator $P$ generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.

[-]Joseph Van Name1y30

This post gives an example of some calculations that I did using my own machine learning algorithm. These calculations work out nicely which indicates that the machine learning algorithm I am using is interpretable (and it is much more interpretable than any neural network would be). These calculations show that one can begin with old mathematical structures and produce new mathematical structures, and it seems feasible to completely automate this process to continue to produce more mathematical structures. The machine learning models that I use are linear, but it seems like we can get highly non-trivial results simply by iterating the procedure of obtaining new structures from old using machine learning.

I made a similar post to this one about 7 months ago, but I decided to revisit this experiment with more general algorithms and I have obtained experimental results which I think look nice.

To illustrate how this works, we start off with the octonions. The octonions consists of an 8-dimensional inner product space together with a bilinear operation $*$ and a unit $1 \in V$ where $1 * v = v * 1 = v$ for all $v \in V$ and where $∥ u * v ∥ = ∥ u ∥ \cdot ∥ v ∥$ for all $u, v \in V$ . The octonions are uniquely determined up to isomorphism from these properties. The operation $*$ is non-associative, but the $*$ is closely related to the quaternions and complex numbers. If we take a single element in $v \in V ∖ Span (1)$ , then ${1, v}$ generates a subalgebra of $(V, *)$ isomorphic to the field of complex numbers, and if $u, v \in V$ and ${1, u, v}$ are linearly independent, then ${1, u, v, u * v}$ spans a subalgebra of $V$ isomorphic to the division ring of quaternions. For this reason, one commonly thinks of the octonions as the best way to extend the division ring of quaternions to a larger algebraic structure in the same way that the quaternions extend the field of complex numbers. But since the octonions are non-associative, they cannot be used to construct matrices, so they are not as well-known as the quaternions (and the construction of the octonions is more complicated too)

Suppose now that ${e_{0}, e_{1}, \dots, e_{7}}$ is an orthonormal basis for the division ring of octonions with $e_{0} = 1$ . Then define matrices $A_{0}, \dots, A_{7} : V \to V$ by setting $A_{j} v = e_{j} * v$ for all $j$ . Our goal is to transform $(A_{0}, \dots, A_{7})$ into other tuples of matrices that satisfy similar properties.

If $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ are matrices, then define the $L_{2}$

-spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ as

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} =$

$\frac{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})}{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})^{1 / 2} \cdot ρ (B_{1} \otimes_{1} + \dots + B_{r} \otimes_{r})^{1 / 2}}$

where $ρ$ denotes the spectral radius, $\otimes$ is the tensor product, and $¯ ¯¯¯ ¯ X$ is the complex conjugate of $X$ applied elementwise.

Let $d \in {1, \dots, 8}$ , and let $F_{d}, G_{d}, H_{d}$ denote the maximum value of the fitness level $8 \cdot ∥ (A_{0}, \dots, A_{7}) ≃ (X_{0}, \dots, X_{7}) ∥^{2}$ such that each $X_{j}$ is a complex $d \times d$ anti-symmetric matrix ( $X = - X^{T}$ ), a complex $d \times d$ symmetric matrix ( $X = X^{T}$ ), and a complex $d \times d$ -Hermitian matrix ( $X = X^{*}$ ) respectively.

The following calculations were obtained through gradient ascent, so I have no mathematical proof that the values obtained are actually correct.

$G_{1} = 2$ , $H_{1} = 1$

$G_{2} = 3$ , $H_{2} = 3$

$F_{3} = 1 + \sqrt{3}$ , $G_{3} = 3.5$ , $H_{3} = 1 + 2 \sqrt{2}$

$F_{4} = 4$ , $G_{4} = 4$ , $H_{4} = 1 + 3 \sqrt{2}$

$F_{5} = (5 + \sqrt{13}) / 2$ , $G_{5} = 4.5$ , $H_{5} \approx 5.27155841$

$F_{6} = 5$ , $G_{6} = 5$ , $H_{6} = 3 + 2 \sqrt{2}$

$F_{7} = 6$ , $G_{7} = 2 + 2 \sqrt{3} \approx 5.4641$ , $H_{7} = 1 + 2 \sqrt{7}$

$F_{8} = 7$ , $G_{8} = 6$ , $H_{8} = 7$

Observe that with at most one exception, all of these values $F_{d}, G_{d}, H_{d}$ are algebraic half integers. This indicates that the fitness function that we maximize to produce $F_{d}, G_{d}, H_{d}$ behaves mathematically and can be used to produce new tuples $(X_{1}, \dots, X_{r})$ from old ones $(A_{1}, \dots, A_{r})$ . Furthermore, an AI can determine whether something notable is going on with the new tuple $(X_{1}, \dots, X_{r})$ in several ways. For example, if $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥^{2}$ has low algebraic degree at the local maximum, then $(X_{1}, \dots, X_{r})$ is likely notable and likely behaves mathematically (and is probably quite interpretable too).

The good behavior of $F_{d}, G_{d}, H_{d}$ demonstrates that the octonions are compatible with the $L_{2}$ -spectral radius similarity. The operators $(A_{0}, \dots, A_{7})$ are all orthogonal, and one can take the tuple $(A_{0}, \dots, A_{7})$ as a mixed unitary quantum channel that is very similar to the completely depolarizing channel. The completely depolarizing channel completely mixes every quantum state while the mixture of orthogonal mappings $(A_{0}, \dots, A_{7})$ completely mixes every real state. The $L_{2}$ -spectral radius similarity works very well with the completely depolarizing channel, so one should expect for the $L_{2}$ -spectral radius similarity to also behave well with the octonions.

[-]Joseph Van Name1y*30

Since AI interpretability is a big issue for AI safety, let's completely interpret the results of evolutionary computation.

Disclaimer: This interpretation of the results of AI does not generalize to interpreting deep neural networks. This is a result for interpreting a solution to a very specific problem that is far less complicated than deep learning, and by interpreting, I mean that we iterate a mathematical operation hundreds of times to get an object that is simpler than our original object, so don't get your hopes up too much.

A basis matroid is a pair where $X$ is a finite set, and $M \subseteq P (X)$ where $M$ denotes the power set of $X$ that satisfies the following two properties:

If $A, B \in M$ , then $| A | = | B |$ .
if $A, B \in M, A \neq B, a \in A ∖ B$ , then there is some $b \in B ∖ A$ with $(A ∖ {a}) \cup {b} \in M$ (the basis exchange property).

I ran a computer experiment where I obtained a matroid $(X, M)$ where $| X | = 9$ $| M | = 68$ and where each element in $M$ has size $4$ through evolutionary computation, but the population size was kept so low that this evolutionary computation mimicked hill climbing algorithms. Now we need to interpret the matroid $(X, M)$ .

The notion of a matroid has many dualities. Our strategy is to apply one of these dualities to the matroid $(X, M)$ so that the dual object is much smaller than the original object $(X, M)$ . One may formulate the notion of a matroid in terms of closure systems (flats),hyperplanes, closure operators, lattices, a rank function, independent sets, bases, and circuits. If these seems to complicated, many of these dualities are special cases of other dualities common with ordered sets. For example, the duality between closure systems, closure operators, and ordered sets applies to contexts unrelated to matroids such as in general and point-free topology. And the duality between the basis, circuit, and the hyperplanes may be characterized in terms of rowmotion.

If $(Z, \leq)$ is a partially ordered set, then a subset $A \subseteq Z$ is said to be an antichain if whenever $a, b \in A, a \leq b$ , then $a = b$ . In other words, an antichain is a subset $A$ of $Z$ where the restriction of $\leq$ to $A$ is equality. We say that a aubset $L$ of $Z$ is downwards closed if whenever $x \leq y$ and $y \in L$ , then $x \in L$ as well. If $A \subseteq Z$ , then let $L (A)$ denote the smallest downwards closed subset of $Z$ containing $A$ . Suppose that $Z$ is a finite poset. If $A$ is an antichain in $Z$ , then let $A^{'}$ denote the set of all minimal elements in $Z ∖ L (A)$ . Then $A^{'}$ is an antichain as well, and the mapping $A \mapsto A^{'}$ is a bijection from the set of all antichains in $Z$ to itself. This means that if $A$ is an antichain, then we may define $A^{(n)}$ for all integers $n$ by setting $A^{(0)} = A, A^{(n + 1)} = (A^{(n)})^{'}$ .

If $(X, M)$ is a basis matroid, then $M$ is an antichain in $P (X)$ , so we may apply rowmotion, so we say that $(X, M^{(n)})$ is an $n$ -matroid. In this case, the $1$

-matroids are the circuit matroids while the $- 1$ -matroids are the hyperplane matroids. Unfortunately, the $n$ -matroids have not been characterized for $| n | > 1$ . We say that the rowmotion order of $(X, M)$ is the least positive integer $n$ where $M^{(n)} = M$ . If $(X, M)$ is a matroid of order $n$ , then my computer experiments indicate that $gcd (| X | + 2, n) > 1$ whichs lends support to the idea that the rowmotion of a matroid is a sensible mathematical notion that may be satisfied mathematically. The notion of rowmotion of a matroid also appears to be a sensible mathematical notion for other reasons; if we apply iteratively apply a bijective operation $g$ (such as a reversible cellular automaton) to a finite object $x$ , then that bijective operation will often increase the entropy in the sense that if $x$ has low entropy, then $g^{(n)} (x)$ will typically have a high amount of entropy and look like noise. But this is not the case with matroids as $n$ -matroids do not appear substantially more complicated than basis matroids. Until and if there is a mundane explanation for this behavior of the rowmotion of matroids, I must consider the notion of rowmotion of matroids to be a mathematically interesting notion even though it is currently not understood by anyone.

With the matroid $(X, M)$ obtained from evolutionary computation, I found that $(X, M)$ has order $1958$ which factorizes as $1958 = 2 \cdot 79 \cdot 11$ . Set $X = {1, \dots, 9}$ . By applying rowmotion to this matroid, I found that $M^{(342)}$ ={{1, 8, 9},{2, 3, 6, 8},{2, 3, 7, 9},{4, 5},{4, 6, 9},{4, 7, 8},{5, 6, 9},{5, 7, 8}}. If $(X, M^{(m)})$ is a basis matroid, then $M^{(m)} = M$ , so the set $M^{(342)}$ is associated with a unique basis matroid. This is the smallest way to represent $(X, M)$ in terms of rowmotion since if $| M^{(n)} | \leq 8$ , then $M^{(n)} = M^{(342)}$ .

I consider this a somewhat satisfactory interpretation of the matroid $(X, M)$ that I have obtained through evolutionary computation, but there is still work to do because nobody has researched the rowmotion operation on matroids and because it would be better to simplify a matroid without needing to go through hundreds of layers of rowmotion. And even if we were to understand matroid rowmotion better, this would not help us too much with AI safety since here this interpretability of the result of evolutionary computation does not generalize to other AI's and it certainly does not apply to deep neural networks.

I made a video here where one may see the rowmotion of this matroid and that video is only slightly interpretable.

Deep matroid duality visualization: Rowmotion of a matroid

It turns out that evolutionary computation is not even necessary to construct matroids since Donald Knuth has produced an algorithm that can be used to construct an arbitrary matroid in his 1975 paper on random matroids. But I applied the rowmotion to the matroid in his paper and the resulting 10835-matroid $B^{(10835)}$ ={{1, 2, 4, 5},{1, 2, 6, 10},{1, 3, 4, 6},{1, 3, 4, 7, 9},{1, 3, 6, 7, 9},{1, 4, 6, 7},{1, 4, 6, 9},{1, 4, 8, 10},{2, 3, 4, 5, 6, 7, 8, 9, 10}}. It looks like the rowmotion operation is good for simplifying matroids in general. We can uniquely recover the basis matroid from the 10835 matroid since $B^{(m)}$ is not a basis matroid whenever $0 < m \leq 10835$ .

[-]Joseph Van Name2y30

I have originally developed a machine learning notion which I call an LSRDR (

-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.

Suppose that $n$ is a natural number. Let $N$ denote the quantum channel that takes an $n$ qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).

If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are complex matrices, then define superoperators $Φ (A_{1}, \dots, A_{r})$ and $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ by setting

$Φ (A_{1}, \dots, A_{r}) (X) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ and $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) = \sum_{k = 1}^{r} A_{k} X B_{k}^{*}$ for all $X$ .

Given tuples of matrices $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ , define the L_2-spectral radius similarity between these tuples of matrices by setting

$∥ ∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ .

Suppose now that $A_{1}, \dots, A_{4 n}$ are matrices where $N = Φ (A_{1}, \dots, A_{4 n})$ . Let $1 \leq d \leq n$ . We say that a tuple of complex $d$ by $d$ matrices $(X_{1}, \dots, X_{4 n})$ is an LSRDR of $A_{1}, \dots, A_{4 n}$ if the quantity $∥ (A_{1}, \dots, A_{4 n}) ≃ (X_{1}, \dots, X_{4 n}) ∥_{2}$ is locally maximized.

Suppose now that $X_{1}, \dots, X_{4 n}$ are complex $2 \times 2$ -matrices and $(X_{1}, \dots, X_{4 n})$ is an LSRDR of $A_{1}, \dots, A_{4 n}$ . Then my computer experiments indicate that there will be some constant $λ$ where $λ Γ (A_{1}, \dots, A_{4 n}; X_{1}, \dots, X_{4 n})$ is similar to a positive semidefinite operator with eigenvalues ${0, \dots, n + 1}$ and where the eigenvalue $j$ has multiplicity $3 \cdot C (n - 1, k) + C (n - 1, k - 2)$ where $C (\cdot, \cdot)$ denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR $(X_{1}, \dots, X_{4 n})$ of $(A_{1}, \dots, A_{4 n})$ , and I have plenty of other examples of interpreted LSRDRs.

We also have a similar pattern for the spectrum of $Φ (A_{1}, \dots, A_{4 n})$ . My computer experiments indicate that there is some constant $λ$ where $λ \cdot Φ (A_{1}, \dots, A_{4 n})$ has spectrum ${0, \dots, n}$ where the eigenvalue $j$ has multiplicity $3^{n - j} \cdot C (n, j)$ .

[-]Joseph Van Name2y*30

The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.

One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.

For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.

Let be a complex Euclidean space. Let $L (V)$ denote the vector space of linear operators from $V$ to $V$ . Given complex Euclidean spaces $V, W$ , we say that a linear operator $E$ from $L (V)$ to $L (W)$ is a trace preserving if $Tr (E (X)) = Tr (X)$

for all $X$ , and we say that $E$ is completely positive if there are linear transformations $A_{1}, . . ., A_{r}$ where $E (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ for all $X$ ; the value $r$ is known as the Choi rank of $E$ . A completely positive trace preserving operator is known as a quantum channel.

The collection of quantum channels from $L (V)$ to $L (W)$ is compact and convex.

If $W$ is a complex Euclidean space, then let $D_{p} (W)$

denote the collection of pure states in $W$ . $D_{p} (W)$

can be defined either as the set of unit vector in $W$ modulo linear dependence, or $D_{p} (W)$

can be also defined as the collection of positive semidefinite rank- $1$ operators on $W$ with trace $1$ .

Given complex Euclidean spaces $U, V$ and a (multi) set of $r$ distinct ordered pairs of unit vectors $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq U \times V$ , and given a quantum channel

$E : L (U) \to L (V)$ , we define the fitness level $F (f, E) = \sum_{k = 1}^{r} log (E (u_{k} u_{k}^{*}) v_{k}, v_{k} ⟩$ and the loss level $L (f, E) = \sum_{k = 1}^{r} - log (E (u_{k} u_{k}^{*}) v_{k}, v_{k} ⟩$ .

We may locally optimize $E$ to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the $L (L (U), L (V))$ which has dimension $Dim (U)^{2} \cdot Dim (V)^{2}$ . Due to the large dimension, any locally optimal $E$ will contain $Dim (U)^{2} \cdot Dim (V)^{2}$ many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.

Empirical Observation: Suppose that $U, V$ are complex Euclidean spaces, $f \subseteq U \times V$ is finite and $r$ is a positive integer. Then computer experiments indicate that there is typically only one quantum channel $E : L (U) \to L (V)$ of Choi rank at most $r$ where $L (f, E)$ is locally minimized. More formally, if we run the experiment twice and produce two quantum channels $E_{1}, E_{2}$ where $L (f, E_{j})$ is locally minimized for $j \in {1, 2}$ , then we would typically have $E_{1} = E_{2}$ . We therefore say that when $L (f, E)$ is minimized, $E$ is the best Choi rank $r$ quantum channel approximation to $f$ .

Suppose now that $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq D_{p} (U) \times D_{p} (V)$ is a multiset. Then we would ideally like to approximate the function $f$ better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function $DE$ that maps a positive semidefinite matrix $P$ to its (equivalence class of) unit dominant eigenvector.

Empirical observation: If $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq D_{p} (U) \times D_{p} (V)$ and $E$ is the best Choi rank $r$ quantum channel approximation to $f$ , then let $u_{j}^{♯} = DE (E (u_{j} u_{j}^{*}))$ for all $j$ , and define $f^{♯} = {(u_{1}^{♯}, v_{1}), \dots, (u_{n}^{♯}, v_{n})}$ . Let $U$ be a small open neighborhood of $f^{♯}$ . Let $g \in U$ . Then we typically have $g^{♯ ♯} = g^{♯}$ . More generally, the best Choi rank $r$ quantum channel approximation to $g$ is typically the identity function.

From the above observation, we see that the vector $u_{j}^{♯}$ is an approximation of $v_{j}$ that cannot be improved upon. The mapping $DE \circ E : D_{p} (U) \to D_{p} (V)$ is therefore a trainable approximation to the mapping $f$ and since $D_{p} (U), D_{p} (V)$ are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping $DE \circ E$ is a non-linear model for the function to $f$ .

I have been investigating machine learning models similar to $DE \circ E$ for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as $DE \circ E$ .

I admit that machine learning models like $DE \circ E$ are currently far from being as powerful as deep neural networks, but since $DE \circ E$ behaves mathematically, the model $DE \circ E$ should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like $DE \circ E$ but which can perform more and more machine learning tasks.

(Edited 8/14/2024)

[-]Joseph Van Name2y10

Here is an example of what might happen. Suppose that for each , we select a orthonormal basis $e_{j, 1}, \dots, e_{j, s}$ of unit vectors for $V$ . Let $R = {(u_{j}, e_{j, k}) : 1 \leq j \leq n, 1 \leq k \leq s}$ . Then

Then for each quantum channel $E$ , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have

$L (R, E) = \sum_{j = 1}^{n} \sum_{k = 1}^{n} - log (E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$

$\leq \sum_{j = 1}^{n} - log (\sum_{k = 1}^{n} ⟨ E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$

$= \sum_{j = 1}^{n} - log (Tr (E))$ . Here, equality is reached if and only if

$E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩ = E (u_{j} u_{j}^{*}) e_{j, l}, e_{j, l} ⟩$ for each $j, k, l$ , but this equality can be achieved by the channel

defined by $E (X) = Tr (X) \cdot I / s$ which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel $E$ has maximum Choi rank since the Choi representation of $E$ is just the identity function divided by the rank. This example is not unexpected since for each input of $R$ the possible outputs span the entire space $V$ evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function $L (R, E)$ are the channels that give us a sort of linear regression of $R$ but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.

[-]Joseph Van Name2y30

We can use the spectral radius similarity to measure more complicated similarities between data sets.

Suppose that $A_{1}, \dots, A_{r}$ are $m \times m$ -real matrices and $B_{1}, \dots, B_{r}$ are $n \times n$ -real matrices. Let $ρ (A)$ denote the spectral radius of $A$ and let $A \otimes B$ denote the tensor product of $A$ with $B$ . Define the $L_{2}$ -spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (A_{1} \otimes A_{1} + \dots + A_{r} \otimes A_{r})^{1 / 2}$ , Define the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $B_{1}, \dots, B_{r}$ as

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} = \frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ .

We observe that if $C$ is invertible and $λ$ is a constant, then

$∥ (A_{1}, \dots, A_{r}) ≃ (λ C B_{1} C^{- 1}, \dots, λ C B_{r} C^{- 1}) ∥_{2} = 1.$

Therefore, the $L_{2}$ -spectral radius is able to detect and measure symmetry that is normally hidden.

Example: Suppose that $u_{1}, \dots, u_{r}; v_{1}, \dots, v_{r}$ are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation $T$ with $T (u_{j}) = v_{j}$ for all $j$ (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors $x_{1}, \dots, x_{r}; y_{1}, \dots, y_{r}$ with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of low close we are to obtaining such an affine transformation $T$ . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that $M, N$ are both the first few layers in a (typically different) neural network. Suppose that $a_{1}, \dots, a_{r}$ is a set of data points. Then if $u_{j} = M (a_{j})$ and $v_{j} = M (a_{j})$ , then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of the similarity between $M$ and $N$ .

I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of $S \subseteq {0, 1}^{32} \times {0, 1}^{32}$ of ordered pairs and I trained the neural networks $M, N$ to minimize the expected losses $E (∥ N (a) - b ∥^{2} : (a, b) \in S), E (∥ M (a) - b ∥^{2} : (a, b) \in S)$ . In my experiment, each $a_{j}$ was a random vector of length 32 whose entries were 0's and 1's. In my experiment, the similarity $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ was worse than if $x_{1}, \dots, x_{r}, y_{1}, \dots, y_{r}$ were just random vectors.

This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the $L_{2}$ -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).

[-]Joseph Van Name9mo20

In this post, we shall go over a way to produce mostly linear machine learning classification models that output probabilities for each possible label. These mostly linear models are pseudodeterministically trained (or pseudodeterministic for short) in the sense that if we train them multiple times with different initializations, we will typically get the same trained model (up-to-symmetry and miniscule floating point differences).

The algorithms that I am mentioning in this post generalize to more complicated multi-layered algorithms in the sense that the multi-layered algorithms remain pseudodeterministic, but for simplicity, we shall stick to just linear operators here.

Let denote either the field of real numbers, the field of complex numbers, or the division ring of quaternions. Let $U$ be a finite dimensional inner product space over $K$ . The training data is a set $D$ of pairs $(u, v)$ where $u \in U$ and $v \in {1, \dots, n}$ where $u$ is the machine learning model input and $v$ is the label. The machine learning model is trained to predict the label $v$ when given the input $u$ . The trained model is a function $f$ that maps $U$ to the set of all probability vectors of length $n$ , so the trained model actually gives the probabilities for each possible label.

Suppose that $V_{i}$ is a finite dimensional inner product space over $K$ for each $i \in {1, \dots, n}$ . Then the domain of the fitness function consists of tuples $(A_{1}, \dots, A_{n})$ where each $A_{i}$ is a linear operator from $U$ to $V_{i}$ . Let $p \in (0, 1), r \in (0, \infty), q \in (1, \infty)$ , and let $λ \geq 0$ . The parameter $p$ is the exponent while $λ$ is the regularization parameter. Define (almost total) functions $G, R, F : L (U, V_{1}) \times \dots \times L (U, V_{n}) \to R$ by setting

$G (A_{1}, \dots, A_{n}) = \sum_{(u, v) \in D} (\frac{∥ A_{v} u ∥^{r}}{∥ A_{1} u ∥^{r} + \dots + ∥ A_{n} u ∥^{r}})^{p} / | D |$

$R (A_{1}, \dots, A_{n}) = (\sum_{(u, v) \in D} λ \cdot log (∥ A_{v} u ∥) / | D |)$

$- λ \cdot (log (∥ A_{1} ∥_{q}) + \dots + log (∥ A_{n} ∥_{q})) / n$ .

Here, $∥ * ∥_{q}$ denotes the Schatten $q$ -norm which can be defined by setting

$∥ A ∥_{q} = Tr ((A A^{*})^{q / 2})$ .

Set $F = G + R$ . Here, $F$ denotes our fitness function. The function $G$ what we really want to maximize, but unfortunately, $G$ is typically non-pseudodeterministic, so we need to add the regularization term $R$ to obtain pseudodeterminism. The regularization term $R$ also has the added effect of making $∥ A_{v} u ∥$ relatively large compared to the norm $∥ A ∥_{q}$ for training data points $(u, v)$ . This may be useful in determining whether a pair should belong to either the training or test data in the first place.

We observe that $F$ is $0$ -homogeneous in the sense that $F (A_{1}, \dots, A_{n}) = F (c A_{1}, \dots, c A_{n})$ for each non-zero scalar $c$ (in the quaternionic case, the scalars are just the real numbers).

Suppose now that we have obtained a tuple $(A_{1}, \dots, A_{n})$ that maximizes the fitness $F (A_{1}, \dots, A_{n})$ . Let $P V (n)$ denote the set of all probability vectors of length $n$ . Then define an almost total function $f : U \to P V (n)$ by setting

$f (u) = \frac{(∥ A_{1} u ∥^{r (1 - p)}, \dots, ∥ A_{n} u ∥^{r (1 - p)})}{∥ A_{1} u ∥^{r (1 - p)} + \dots + ∥ A_{n} u ∥^{r (1 - p)}} .$

If $(u, v)$ belongs to the training data set, then the $i$ -th entry of $f (u)$ is the machine learning model's estimate of the probability that $i = v$ . I will let the reader justify this calculation of the probabilities.

We can generalize the function $f$ to pseudodeterministically trained machine learning models with multiple layers by replacing the linear operators $A_{1}, \dots, A_{n}$ with some non-linear or multi-linear operators. Actually, there are quite a few ways of generalizing the fitness function $F$ , and I have taken some liberty in the exact formulation for $F$ .

In addition to being pseudodeterministic, the fitness function $F$ has other notable desirable properties. For example, when maximizing $F$ using gradient ascent, one tends to converge to the local maximum at an exponential rate without needing to decay the learning rate.

[-]Joseph Van Name2y*20

So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.

Suppose that is either the field of real numbers or the field of complex numbers. Let $V_{1}, \dots, V_{n}$ be finite dimensional inner product spaces over $K$ with dimensions $d_{1}, \dots, d_{n}$ respectively. Suppose that $V_{i}$ has basis $e_{i, 1}, \dots, e_{i, d_{i}}$ . Given $v \in V_{1} \otimes \dots \otimes V_{n}$ , we would sometimes want to approximate the tensor $v$ with a tensor that has less parameters. Suppose that $(m_{0}, \dots, m_{n})$ is a sequence of natural numbers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is a $m_{i - 1} \times m_{i}$ matrix over the field $K$ for $1 \leq i \leq n$ and $1 \leq j \leq d_{i}$ . From the system of matrices $(X_{i, j})_{i, j}$ , we obtain a tensor $T ((X_{i, j})_{i, j}) = \sum_{i_{1}, \dots, i_{n}} e_{i_{1}} \otimes \dots \otimes e_{i_{n}} \cdot X_{1, i_{1}} \dots X_{n, i_{n}}$ . If the system of matrices $(X_{i, j})_{i, j}$ locally minimizes the distance $∥ v - T ((X_{i, j})_{i, j}) ∥$ , then the tensor $T ((X_{i, j})_{i, j})$ is a dimensionality reduction of $v$ which we shall denote by $u$ .

Intuition: One can associate the tensor product $V_{1} \otimes \dots \otimes V_{n}$ with the set of all degree $n$ homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form $x_{1, i_{1}} \dots x_{n, i_{n}}$ . Given, our matrices $X_{i, j}$ , we can define a linear functional $ϕ : V_{1} \otimes \dots \otimes V_{n} \to K$ by setting $ϕ (p) = p ((X_{i, j})_{i, j})$ . But by the Reisz representation theorem, the linear functional $ϕ$ is dual to some tensor in $V_{1} \otimes \dots \otimes V_{n}$ . More specifically, $ϕ$ is dual to $T ((X_{i, j})_{i, j})$ . The tensors of the form $T ((X_{i, j})_{i, j})$ are therefore the

Advantages:

In my computer experiments, the reduced dimension tensor $u$ is often (but not always) unique in the sense that if we calculate the tensor $u$ twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors $u$ will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor $v$ .
This dimensionality reduction does not depend on the choice of orthonormal basis $e_{i, 1}, \dots, e_{i, d_{i}}$ . If we chose a different basis for each $V_{i}$ , then the resulting tensor $u$ of reduced dimensionality will remain the same (the proof is given below).
If $K$ is the field of complex numbers, but all the entries in the tensor $v$ happen to be real numbers, then all the entries in the tensor $u$ will also be real numbers.
This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.

Disadvantages:

This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces $V_{1}, \dots, V_{n}$ .
Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
The tensors of reduced dimensionality $T ((X_{i, j})_{i, j})$ have a more complicated description than the tensors in the CP tensor dimensionality reduction.

Proposition: The set of tensors of the form $\sum_{i_{1}, \dots, i_{n}} e_{1, i_{1}} \otimes \dots \otimes e_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$ does not depend on the choice of bases $(e_{i, 1}, \dots, e_{i, d_{i}})_{i}$ .

Proof: For each $i$ , let $f_{i, 1}, \dots, f_{i, d_{i}}$ be an alternative basis for $V_{i}$ . Then suppose that $e_{i, j} = \sum_{k} u_{i, j, k} f_{i, k}$ for each $i, j$ . Then

$\sum_{i_{1}, \dots, i_{n}} e_{1, i_{1}} \otimes \dots \otimes e_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$

$= \sum_{i_{1}, \dots, i_{n}} \sum_{k_{1}} u_{1, i_{1}, k_{1}} f_{1, i_{1}} \otimes \dots \otimes \sum_{k_{n}} u_{n, i_{n}, k_{n}} f_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$

$= \sum_{k_{1}, \dots, k_{n}} f_{1, k_{1}} \otimes \dots \otimes f_{n, k_{n}} \sum_{i_{1}, \dots, i_{n}} u_{1, i_{1}, k_{1}} \dots u_{n, i_{n}, k_{n}} X_{1, i_{1}} \dots X_{n, i, n}$

$= \sum_{k_{1}, \dots, k_{n}} f_{1, k_{1}} \otimes \dots \otimes f_{n, k_{n}} (\sum_{i_{1}} u_{1, i_{1}, k_{1}} X_{1, i_{1}}) \dots (\sum_{i_{n}} u_{n, i_{n}, k_{n}} X_{i_{n}})$ . Q.E.D.

A failed generalization: An astute reader may have observed that if we drop the requirement that $m_{n} = 1$ , then we get a linear functional defined by letting

$ϕ (p) = Tr (p ((X_{i, j})_{i, j}))$ . This is indeed a linear functional, and we can try to approximate $v$ using a the dual to $ϕ$ , but this approach does not work as well.

[-]Joseph Van Name6d10

In this post, we shall compute average loss/fitness level for a linear dimensionality reduction.

The purpose of these calculations is to demonstrate that such a linear dimensionality reduction behaves mathematically and should be used as a simple model for what your loss/fitness functions should look like in AI/ML if you want your AI/ML to be well-behaved and interpretable.

Suppose that is either the field or real numbers, the field of complex numbers, or the division ring of quaternions. Suppose that is a -dimensional inner product space over the field

Suppose that is a measure over the unit sphere in . Then the objective is to find an optimal -dimensional subspace of for the measure . Let be a function. Therefore, define a function mapping the set of all -dimensional orthogonal projection matrices to by setting . The goal is to find an orthogonal projection that maximize/minimizes .

Let . Then, let be independent random variables each following the standard normal distribution on one real-variable. Then observe that follows the Chi-squared distribution with degrees of freedom. If follows the Chi-square distribution with degrees of freedom, then where is the digamma function. Let be a probability measure on the unit sphere of , and let be the uniform probability measure on the set of all orthogonal projections from to of rank . Then

where the random variable follows the F-distribution with and degrees of freedom. From standard facts about the F-distribution, we know that if and is a positive integer, then

. Observe that precisely when

, so in this case when , then

, and

diverges whenever .

Here is the digamma function where For integers and half-intergers, the digamma function can be evaluated as where is the Euler-Mascheroni constant, and which is a harmonic number. Thus, in the case where both are even (which includes the complex and quaternionic case), we have

[-]Joseph Van Name14d10

I was able to completely interpret a simple machine learning model trained on some cryptographic input. This objective is a special case of something I call an LSRDR which is a machine learning algorithm that I created in order to analyze block ciphers for cryptocurrency mining.

Set Let denote the finite field with elements. For each , let be the function defined by . Let denote the standard irreducible representation of . Here, can be represented somewhat inconveniently as an -matrix. Then our objective is to find a unit vector such that the spectral radius is locally maximized. Sometimes I obtain a bad local maximum, but sometimes I obtain a good one. Whenever I obtain a good local maximum, it is always the same thing. And in this case, for the good local maximum, after multiplying by -1 for positivity, I can always find positive constants such that whenever , whenever and .

Here, , .

The scenario where we obtain an overly perfect and completely interpretation to the local optimum happens all the time with these sorts of optimization algorithms that I have been working on, so if we want to develop more interpretable machine learning, it seems like this is the right direction to go. Of course, my trained model is very simple, so we need to do a substantial amount of work to generalize this sort of machine learning algorithm to something like a deep neural network. I am making progress, but it takes more computational power than I have to make progress with inherently interpretable deep learning.

[-]Joseph Van Name22d*10

This post will be about my machine learning algorithm where quadratic algebraic numbers including the golden ratio appear in the trained models. This demonstrates that these machine learning models behave mathematically which is exactly the kind of thing that we want for AI interpretability and AI safety.

This post will be about particular examples of -spectral radius dimensionality reductions (LSRDRs). I originally developed the notion of an LSRDR to evaluate the cryptographic security of block ciphers for cryptocurrency mining, but let's talk about machine learning instead of cryptocurrency technologies here.

Also, the results that I have obtained in this proof have been obtained experimentally. I have not proven these results rigorously.

Dimensionality reduction: Let denote either the field of real or complex numbers. Suppose that are -matrices over and are -matrices over . Then define the operation by setting . Define the operator .

Define the -spectral radius similarity by setting

Here, the spectral radius is analogous to a dot product, and is analogous to the cosine similarity.

If are fixed matrices and , then we say that is an -SRDR if the similarity is locally maximized. Informally, the LSRDR is a collection of smaller matrices that approximates the collection of bigger matrices.

Lie algebras: A Lie algebra is a vector space over a field together with a bilinear operation that satisfies the identities:

for all
for all .

For example, if is an associative bilinear operation, then one can check that the commutator operation defined by is a Lie-bracket, and a Lie algebra should be thought of as a vector space with an abstract commutator operation.

Let denote the Lie algebra of -anti-symmetric matrices over where the Lie algebra operation is just the commutator For the rest of this post, we shall set . Then is a Lie algebra of dimension

Set and let be an orthonormal basis for . Use the standard orthonormal basis if you want, but it does not matter which basis you choose.

An observation about the spectrum: Let be the linear operators defined by setting for each . Let be an -SRDR of . It turns out that the spectrum eventually stabilizes in the sense that if we keep constant and set greater than around or so, then does not depend on whenever Therefore, let denote the multiset for sufficiently large Then is the multi-set

multiplied by a constant scaling factor. Here, the notion means that the eigenvalue has multiplicity .

The general pattern:

So if we want to get interesting experimental results about LSRDRs, then we just need to the following. We first select a finite dimensional inner product space with an interesting bilinear operation , but make sure that is not associative. We then select an orthonormal basis of and define linear operators by . Then take an LSRDR of and then the operators will have interesting spectra.

Testing if a number is quadratic:

After evaluating the spectra, I needed to first normalize the spectrum and then try to figure out exact values of the eigenvalues from their floating point approximation. This is easy to do for quadratic algebraic numbers. You just take the continued fraction representation of your number that you want to test. If the continued fraction representation terminates, then you have a rational number. And your continued fraction of a positive irrational repeats if and only if it is a solution to a quadratic equation with integer coefficients, and it is easy to find those coefficients from the continued fraction representation.

Are LSRDRs relevant to deep learning?

LSRDRs are linear models without all the layers that deep neural networks have. But I have been generalizing LSRDRs to deeper machine learning models that retain some but not all of the interesting mathematical properties of LSRDRs. I would therefore consider these investigations into LSRDRs as relevant to deep learning.

[-]Joseph Van Name1y10

In this post, we shall describe 3 related fitness functions with discrete domains where the process of maximizing these functions is pseudodeterministic in the sense that if we locally maximize the fitness function multiple times, then we typically attain the same local maximum; this appears to be an important aspect of AI safety. These fitness functions are my own. While these functions are far from deep neural networks, I think they are still related to AI safety since they are closely related to other fitness functions that are locally maximized pseudodeterministically that more closely resemble deep neural networks.

Let denote a finite dimensional algebra over the field of real numbers together with an adjoint operation $*$ (the operation $*$ is a linear involution with $(x y)^{*} = y^{*} x^{*}$ ). For example, $K$ could be the field of real numbers, complex numbers, quaternions, or a matrix ring over the reals, complex, or quaternions. We can extend the adjoint $*$ to the matrix ring $M_{r} (K)$ by setting $(x_{i, j})_{i, j}^{*} = (x_{j, i}^{*})_{i, j}$ .

Let $n$ be a natural number. If $A_{1}, \dots, A_{r} \in M_{n} (K), X_{1}, \dots, X_{r} \in M_{d} (K)$ , then define

$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by setting $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ .

Suppose now that $1 \leq d < n$ . Then let $S_{d} \subseteq M_{n, n} (K)$ be the set of all $0, 1$ -diagonal matrices with $d$ many $1$ 's on the diagonal. We observe that each element in $S_{d}$ is an orthogonal projection. Define fitness functions $F_{d}, G_{d}, H_{d} : S_{d} \to R$ by setting

$F_{d} (P) = ρ (Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P))$ ,

$G_{d} (P) = ρ (Γ (P A_{1} P, \dots, P A_{r} P; P A_{1} P, \dots, P A_{r} P))$ , and

$H_{d} (P) = \frac{F_{d} (P)^{2}}{G_{d} (P)}$ . Here, $ρ$ denotes the spectral radius.

$F_{d} (P)$ is typically slightly larger than $G_{d} (P)$ , so these three fitness functions are closely related.

If $P, Q \in S_{d}$ , then we say that $Q$ is in the neighborhood of $P$ if $Q$ differs from $P$ by at most 2 entries. If $F$ is a fitness function with domain $S_{d}$ , then we say that $(P, F (P))$ is a local maximum of the function $F$ if $F (P) \geq F (Q)$ whenever $Q$ is in the neighborhood of $P$ .

The path from initialization to a local maximum $(P_{s}, F (P_{s}))$ for will be a sequence $(P_{0}, \dots, P_{s})$ where $P_{j}$ is always in the neighborhood of $P_{j - 1}$ and where $F (P_{j}) \geq F (P_{j - 1})$ for all $j$ and the length of the path will be $s$ and where $P_{0}$ is generated uniformly randomly.

Empirical observation: Suppose that $F \in {F_{d}, G_{d}, H_{d}}$ . If we compute a path from initialization to local maximum for $F$ , then such a path will typically have length less than $n$ . Furthermore, if we locally maximize $F$ multiple times, we will typically obtain the same local maximum each time. Moreover, if $P_{F}, P_{G}, P_{H}$ are the computed local maxima of $F_{d}, G_{d}, H_{d}$ respectively, then $P_{F}, P_{G}, P_{H}$ will either be identical or differ by relatively few diagonal entries.

I have not done the experiments yet, but one should be able to generalize the above empirical observation to matroids. Suppose that $M$ is a basis matroid with underlying set ${1, \dots, n}$ and where $| A | = d$ for each $A \in M$ . Then one should be able to make the same observation about the fitness functions $F_{d} |_{M}, G_{d} |_{M}, H_{d} |_{M}$ as well.

We observe that the problems of maximizing $F_{d}, G_{d}, H_{d}$ are all NP-complete problems since the clique problems can be reduced to special cases of maximizing $F_{d}, G_{d}, H_{d}$ . This means that the problems of maximizing $F_{d}, G_{d}, H_{d}$ can be sophisticated problems, but this also means that we should not expect it to be easy to find the global maxima for $F_{d}, G_{d}, H_{d}$ in some cases.

[-]Joseph Van Name1y10

This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.

Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.

Suppose that are $n \times n$ -complex matrices and $X_{1}, \dots, X_{r}$ are $d \times d$ -complex matrices. Then define a mapping $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (C) \to M_{n, d} (C)$ by $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ for all $X$ . Define

$Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$

-spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by

$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (X_{1}, \dots, X_{r})}$ .

The $L_{2}$ -spectral radius similarity is always in the interval $[0, 1]$ . if $n = d$ , $A_{1}, \dots, A_{r}$ generates the algebra of $n \times n$ -complex matrices, and $X_{1}, \dots, X_{r}$ also generates the algebra of $n \times n$ -complex matrices, then $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2} = 1$ if and only if there are $C, λ$ with $A_{j} = λ C X_{j} C^{- 1}$ for all $j$ .

Define $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ to be the supremum of

$ρ_{2} (A_{1}, \dots, A_{r}) \cdot ∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$

where $X_{1}, \dots, X_{r}$ are $d \times d$ -Hermitian matrices.

One can get lower bounds for $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ simply by locally maximizing $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.

Empirical observation/conjecture: If $(A_{1}, \dots, A_{r})$ are $n \times n$ -complex matrices, then $ρ_{2, n}^{H} (A_{1}, \dots, A_{r}) = ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ whenever $d \geq n$ .

The above observation means that sequences of $n \times n$ -matrices $(A_{1}, \dots, A_{r})$ are fundamentally non-Hermitian. In this case, we cannot get better models of $(A_{1}, \dots, A_{r})$ using Hermitian matrices larger than the matrices $(A_{1}, \dots, A_{r})$ themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever $d \geq n$

, but the purpose of modeling $(A_{1}, \dots, A_{r})$ as Hermitian matrices is generally to use smaller matrices and not larger matrices.

This means that the function $ρ_{2, d}^{H}$ behaves mathematically.

Now, the model $(X_{1}, \dots, X_{r})$ is a linear model of $(A_{1}, \dots, A_{r})$ since the mapping $A_{j} \mapsto X_{j}$ is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model $(X_{1}, \dots, X_{r})$ generalizes to multi-layered machine learning models.

[-]Joseph Van Name1y10

Here are some observations about the kind of fitness functions that I have been running experiments on for AI interpretability. The phenomena that I state in this post are determined experimentally without a rigorous mathematical proof and they only occur some of the time.

Suppose that is a continuous fitness function. In an ideal universe, we would like for the function $F$ to have just one local maximum. If $F$ has just one local maximum, we say that $F$ is maximized pseudodeterministically (or simply pseudodeterministic). At the very least, we would like for there to be just one real number of the form $F (x)$ for local maximum $(x, F (x))$ . In this case, all local maxima will typically be related by some sort of symmetry. Pseudodeterministic fitness function seem to be quite interpretable to me. If there are many local maximum values and the local maximum value that we attain after training depends on things such as the initialization, then the local maximum will contain random/pseudorandom information independent of the training data, and the local maximum will be difficult to interpret. A fitness function with a single local maximum value behaves more mathematically than a fitness function with many local maximum values, and such mathematical behavior should help with interpretability; the only reason I have been able to interpret pseudodeterminisitic fitness functions before is that they behave mathematically and have a unique local maximum value.

Set $O = F^{- 1} [(- \infty, \infty)] = X ∖ F^{- 1} [{- \infty}]$ . If the set $O$ is disconnected (in a topological sense) and if $L$ behaves differently on each of the components of $L$ , then we have literally shattered the possibility of having a unique local maximum, but in this post, we shall explore a case where each component of $O$ still has a unique local maximum value.

Let $m_{0}, \dots, m_{n}$ be positive integers with $m_{0} = m_{n} = 1$ and where $m_{1} \geq 1, \dots, m_{n - 1} \geq 1$ . Let $r_{0}, \dots, r_{n - 1}$ be other natural numbers. The set $X$ is the collection of all tuples $A = (A_{i, j})_{i, j}$ where each $A_{i, j}$ is a real $m_{i + 1} \times m_{i}$ -matrix and where the indices range from $i \in {0, \dots, n - 1}, j \in {1, \dots, r_{i}}$ and where $(A_{i, j})_{j}$ is not identically zero for all $i$ .

The training data is a set $Σ$ that consists of input/label pairs $(u, v)$ where $v \in {- 1, 1}$ and where $u = (u_{0}, \dots, u_{n - 1})$ such that each $u_{i}$ is a subset of ${1, \dots, r_{i}}$ for all $i$ (i.e. $Σ$ is a binary classifier where $u$ is the encoded network input and $v$ is the label).

Define $W (u, A) = (\sum_{j \in u_{n - 1}} A_{n - 1, j}) \dots (\sum_{j \in u_{0}} A_{0, j})$ . Now, we define our fitness level by setting

$F (A) = \sum_{(u, v) \in Σ} log (| W (u, A) |) / | Σ | - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$

$= E (log (| W (u, A) |)) - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$ where the expected value is with respect to selecting an element $(u, v) \in Σ$ uniformly at random. Here, $∥ * ∥_{p}$ is a Schatten $p$ -norm which is just the $ℓ_{p}$ -norm of the singular values of the matrix. Observe that the fitness function $F$ only depends on the list $(u : (u, v) \in Σ)$ , so $F$ does not depend on the training data labels.

Observe that $O = X ∖ ⋃_{u \in Σ} {A \in X : W (u, A) = 0}$ which is a disconnected open set. Define a function $f : O \to {- 1, 1}^{Σ}$ by setting $f (A) = (W (u, A) / | W (u, A) |)_{(u, v) \in Σ}$ . Observe that if $x, y$ belong to the same component of $O$ , then $f (x) = f (y)$ .

While the fitness function $F$ has many local maximum values, the function $F$ seems to typically have at most one local maximum value per component. More specifically, for each $(α_{i})_{i \in Σ}$ , the set $f^{- 1} [{(α_{i})_{i \in Σ}}]$ seems to typically be a connected open set where $F$ has just one local maximum value (maybe the other local maxima are hard to find, but if thye are hard to find, they are irrelevant).

Let $Ω = f^{- 1} [{(v)_{(u, v) \in Σ}]$ . Then $Ω$ is a (possibly empty) open subset of $O$ , and there tends to be a unique (up-to-symmetry) $A_{0} \in Ω$ where $F (A_{0})$ is locally maximized. This unique $A_{0}$ is the machine learning model that we obtain when training on the data set $Σ$ . To obtain $A_{0}$ , we first perform an optimization that works well enough to get inside the open set $Ω$ . For example, to get inside $Ω$ , we could try to maximize the fitness function $\sum_{(u, v) \in Σ} arctan (v \cdot W (u, A))$ . We then maximize $F$ inside the open set $Ω$ to obtain our local maximum.

After training, we obtain a function $f$ defined by $f (u) = W (u, A_{0})$ . Observe that the function $f$ is a multi-linear function. The function $f$ is highly regularized, so if we want better performance, we should tone down the amount of regularization, but this can be done without compromising pseudodeterminism. The function $f$ has been trained so that $f (u) / | f (u) | = v$ for each $(u, v) \in Σ$ but also so that $| f (u) |$ is large compared to what we might expect whenever $(u, v) \in Σ$ . In other words, $f$ is helpful in determining whether $(u, v)$ belongs to $Σ$ or not since one can examine the magnitude and sign of $f (u)$ .

In order to maximize AI safety, I want to produce inherently interpretable AI algorithms that perform well on difficult tasks. Right now, the function $f$ (and other functions that I have designed) can do some machine learning tasks, but they are not ready to replace neural networks, but I have a few ideas about how to improve my AI algorithms performance without compromising pseudodeterminism. I do not believe that pseudodeterministic machine learning will increase AI risks too much because when designing these pseudodeterministic algorithms, we are trading some (but hopefully not too much) performance for increased interpretability, but this tradeoff is good for safety by increasing interpretability without increasing performance.

[-]Joseph Van Name2y10

In this note, I will continue to demonstrate not only the ways in which LSRDRs (-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.

Suppose that $N$ is the quantum channel that maps a $n$ qubit state to a $n$ qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that $A_{1}, \dots, A_{4 n}$ are $2^{n}$ by $2^{n}$ matrices where $N$ has the Kraus representation $N (X) = \sum_{k = 1}^{4 n} A_{k} X A_{k}^{*}$ .

The objective is to locally maximize the fitness level $ρ (\sum_{k = 1}^{4 n} z_{k} A_{k}) / ∥ (z_{1}, \dots, z_{4 n}) ∥$ where the norm in question is the Euclidean norm and where $ρ$ denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel $N$ .

Let $A = \sum_{k = 1}^{4 n} z_{k} A_{k}$ when $(z_{1}, \dots, z_{4 n})$ is selected to locally maximize the fitness level. Then my empirical calculations show that there is some $λ$ where $λ \sum_{k = 1}^{4 n} z_{k} A_{k}$ is positive semidefinite with eigenvalues ${0, \dots, n}$ and where the eigenvalue $k$ has multiplicity $(\frac{n}{k})$ which is the binomial coefficient. But these are empirical calculations for select values $λ$ ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).

Here, we have obtained a complete characterization of $A$ up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.

I made a few YouTube videos showcasing the process of maximizing the fitness level here.

Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training

Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training

Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training

I will make another post soon about more LSRDRs of a higher dimension of the same channel $N$ .

[-]Joseph Van Name2y*10

I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.

My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let's see what happens when we take an LSRDR of the octonions.

Let denote either the field of real numbers or the field of complex numbers ( $K$

could also be the division ring of quaternions, but for simplicity, let's not go there). If $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices over $K$ , then an LSRDR ( $L_{2, d}$ -spectral radius dimensionality reduction) of $A_{1}, \dots, A_{r}$ is a collection $X_{1}, \dots, X_{r}$ of $d \times d$ -matrices that locally maximizes the fitness level

$\frac{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})}{ρ (X_{1} \otimes_{1} + \dots + X_{r} \otimes_{r})^{1 / 2}}$ . $ρ$ denotes the spectral radius function while $\otimes$ denotes the tensor product and $¯ ¯¯ ¯ Z$ denotes the matrix obtained from $Z$ by replacing each entry with its complex conjugate. We shall call the maximum fitness level the $L_{2, d}$ -spectral radius of $A_{1}, \dots, A_{r}$ over the field $K$ , and we shall wrote $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ for this spectral radius.

Define the linear superoperator $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ by setting

$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ and set $Φ (X_{1}, \dots, X_{r}) = Γ (X_{1}, \dots, X_{r}; X_{1}, \dots, X_{r})$ . Then the fitness level of $X_{1}, \dots, X_{r}$ is $\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{Φ (X_{1}, \dots, X_{r})^{1 / 2}}$ .

Suppose that $V$ is an $8$ -dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation $*$ on $V$ together with a unit $1$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = 1$ for all x $, y \in V$ . If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result.

We say that an octonion-like algbera is a $8$ -dimensional real inner product space $V$ together with a unique up-to-isomorphism bilinear operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ for all $x, y$ .

Let $V$ be a specific octonion-like algebra.

Suppose now that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ (this does not need to be the standard basis). Then for each $j \in {1, \dots, 8}$ , let $A_{j}$ be the linear operator from $V$ to $V$ defined by setting $A_{j} v = e_{j} * v$ for all vectors $v$ . All non-zero linear combinations of $A_{1}, \dots, A_{8}$ are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis $e_{1}, \dots, e_{8}$ since I could formulate everything in a coordinate-free manner.

Empirical Observation from computer calculations: Suppose that $1 \leq d \leq 8$ and $K$ is the field of real numbers. Then the following are equivalent.

The $d \times d$ matrices $X_{1}, \dots, X_{8}$ are a LSRDR of $A_{1}, \dots, A_{8}$ over $K$ where $A_{1} \otimes X_{1} + \dots + A_{8} \otimes X_{8}$ has a unique real dominant eigenvalue.
There exists matrices $R, S$ where $X_{j} = R A_{j} S$ for all $j$ and where $S R$ is an orthonormal projection matrix.

In this case, $ρ_{2, d}^{K} (A_{1}, \dots, A_{8}) = \sqrt{d}$ and this fitness level is reached by the matrices $X_{1}, \dots, X_{8}$ in the above equivalent statements. Observe that the superoperator $Γ (A_{1}, \dots, A_{8}; P A_{1} P, \dots, P A_{8} P)$ is similar to a direct sum of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))$ and a zero matrix. But the projection matrix $P$ is a dominant eigenvector of $Γ (A_{1}, \dots, A_{8}; P A_{1} P, \dots, P A_{8} P)$ and of $Φ (P A_{1} P, \dots, P A_{8} P)$ as well.

I have no mathematical proof of the above fact though.

Now suppose that $K = C$ . Then my computer calculations yield the following complex $L_{2, d}$ -spectral radii: $(ρ_{2, j}^{K} (A_{1}, \dots, A_{8}))_{j = 1}^{8}$

$= (2, 4, 2 + \sqrt{8}, 5.4676355784..., 6.1977259251..., 4 + \sqrt{8}, 7.2628726081..., 8)$

Each time that I have trained a complex LSRDR of $A_{1}, \dots, A_{8}$ , I was able to find a fitness level that is not just a local optimum but also a global optimum.

In the case of the real LSRDRs, I have a complete description of the LSRDRs of $(A_{1}, \dots, A_{8})$ . This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.

I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.

Edit: I have made some corrections to this post on 9/22/2024.

Fitness levels of complex LSRDRs of the octonions (youtube.com)

[-]Joseph Van Name2y10

There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.

Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.

Theorem: Suppose that is a real,complex, or quaternionic $n \times n$ -matrix that minimizes the quantity $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2}$ . Then $U$ is unitary.

Proof: The real case is a special case of a complex case, and by representing each $n \times n$ -quaternionic matrix as a complex $2 n \times 2 n$ -matrix, we may assume that $U$ is a complex matrix.

By the Schur decomposition, we know that $U = V T V^{*}$ where $V$ is a unitary matrix and $T$ is upper triangular. But we know that $∥ U ∥_{2} = ∥ T ∥_{2}$ . Furthermore, $U^{- 1} = V T^{- 1} V^{*}$ , so $∥ U^{- 1} ∥_{2} = ∥ T^{- 1} ∥_{2}$ . Let $D$ denote the diagonal matrix whose diagonal entries are the same as $T$ . Then $∥ T ∥_{2} \geq ∥ D ∥_{2}$ and $∥ T^{- 1} ∥_{2} \geq ∥ D^{- 1} ∥_{2}$ . Furthermore, $∥ T ∥_{2} = ∥ D ∥_{2}$ iff T is diagonal and $∥ T^{- 1} ∥_{2} = ∥ D^{- 1} ∥_{2}$ iff $D$ is diagonal. Therefore, since $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2} = ∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ and $∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ is minimized, we can conclude that $T = D$ , so $T$ is a diagonal matrix. Suppose that $T$ has diagonal entries $(z_{1}, \dots, z_{n})$ . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that $\frac{1}{2} \cdot (∥ (z_{1}, \dots, z_{n}) ∥_{2} + ∥ (z_{1}^{- 1}, \dots, z_{n}^{- 1}) ∥_{2}) \geq ∥ (| z_{1} |, \dots, | z_{n} |) ∥_{2} \cdot ∥ (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ∥_{2}$

$\geq ⟨ (| z_{1} |, \dots, | z_{n} |), (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ⟩ = \sqrt{n} .$

Here, the equalities hold if and only if $| z_{j} | = 1$ for all $j$ , but this implies that $U$ is unitary. Q.E.D.

[-]Joseph Van Name2y10

The -spectral radius similarity is not transitive. Suppose that $A_{1}, \dots, A_{r}$ are $m \times m$ -matrices and $B_{1}, \dots, B_{r}$ are real $n \times n$ -matrices. Then define $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (A_{1} \otimes A_{1} + \dots + A_{r} \otimes A_{r})^{1 / 2}$ . Then the generalized Cauchy-Schwarz inequality is satisfied:

$ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r}) \leq ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})$ .

We therefore define the $L_{2, d}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ as $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥ = \frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ . One should think of the $L_{2}$ -spectral radius similarity as a generalization of the cosine similarity $\frac{⟨ u, v ⟩}{∥ u ∥ \cdot ∥ v ∥}$ between vectors $u, v$ . I have been using the $L_{2}$ -spectral radius similarity to develop AI systems that seem to be very interpretable. The $L_{2}$ -spectral radius similarity is not transitive.

$∥ (A_{1}, \dots, A_{r}) ≃ (A_{1} \oplus B_{1}, \dots, A_{r} \oplus B_{r}) ∥ = 1$ and

$∥ (B_{1}, \dots, B_{r}) ≃ (A_{1} \oplus B_{1}, \dots, A_{r} \oplus B_{r}) ∥ = 1$ , but $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥$ can take any value in the interval $[0, 1]$ .

We should therefore think of the $L_{2, d}$ -spectral radius similarity as a sort of least upper bound of $[0, 1]$ -valued equivalence relations than a $[0, 1]$ -valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.

Notation: $ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$ is the spectral radius. The spectral radius $A$ is the largest magnitude of an eigenvalue of the matrix $A$ . Here the norm does not matter because we are taking the limit. $A \oplus B$ is the direct sum of matrices while $A \otimes B$ denotes the Kronecker product of matrices.

[-]Joseph Van Name2y10

Let's compute some inner products and gradients.

Set up: Let denote either the field of real or the field of complex numbers. Suppose that $d_{1}, \dots, d_{r}$ are positive integers. Let $m_{0}, \dots, m_{n}$ be a sequence of positive integers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is an $m_{i - 1} \times m_{i}$ -matrix whenever $1 \leq j \leq d_{i}$ . Then from the matrices $X_{i, j}$ , we can define a $d_{1} \times \dots \times d_{r}$ -tensor $T ((X_{i, j})_{i, j}) = (X_{1, i_{1}} \dots X_{n, i_{n}})_{i_{1}, \dots, i_{n}}$ . I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the $ℓ_{2}$ -distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors $T ((X_{i, j})_{i, j})$ , namely the inner product of such tensors and the gradient of the inner product with respect to the matrices $(X_{i, j})_{i, j}$ .

Notation: If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are matrices, then let $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ denote the superoperator defined by letting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$ . Let $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ .

Inner product: Here is the computation of the inner product of our tensors.

$⟨ T ((A_{i, j})_{i, j}), T ((B_{i, j})_{i, j}) ⟩$

$= ⟨ (A_{1, i_{1}} \dots A_{n, i_{n}})_{i_{1}, \dots, i_{n}}, (B_{1, i_{1}} \dots B_{n, i_{n}})_{i_{1}, \dots, i_{n}} ⟩$

$= \sum_{i_{1}, \dots, i_{n}} A_{1, i_{1}} \dots A_{n, i_{n}} (B_{1, i_{1}} \dots B_{n, i_{n}})^{*}$

$= \sum_{i_{1}, \dots, i_{n}} A_{1, i_{1}} \dots A_{n, i_{n}} B_{n, i_{n}}^{*} \dots B_{1, i_{1}}^{*}$

$= Γ (A_{1, 1}, \dots, A_{1, d_{1}}; B_{1, 1}, \dots, B_{1, d_{1}}) \dots Γ (A_{n, 1}, \dots, A_{n, d_{n}}; B_{n, 1}, \dots, B_{n, d_{n}})$ .

In particular, $∥ T ((A_{i, j})_{i, j}) ∥^{2} = Φ (A_{1, 1}, \dots, A_{1, d_{1}}) \dots Φ (A_{n, 1}, \dots, A_{n, d_{n}})$ .

Gradient: Observe that $\nabla_{X} Tr (A X) = A^{T}$ . We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.

$\nabla_{X_{α, β}} ⟨ T ((X_{i, j})_{i, j}), T ((A_{i, j})_{i, j}) ⟩$

$= \nabla_{X_{α, β}} \sum_{i_{1}, \dots, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots A_{1, i_{1}}^{*}$

$= \nabla_{X_{α, β}} Tr (\sum_{i_{1}, \dots, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots A_{1, i_{1}}^{*})$

$= \nabla_{X_{α, β}} Tr (\sum_{i_{1}, \dots, i_{n}} X_{α, i_{α}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$

$A_{α + 1, i_{α + 1}}^{*} A_{α, i_{α}}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})$

$= \nabla_{X_{α, β}} Tr (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α, β} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$

$A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})$

$= (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α + 1, i_{α + 1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$

$A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})^{T}$

$= (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α + 1, i_{α + 1}} \dots X_{n, i_{n}}$

$A_{n, i_{n}}^{*} \dots A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})^{T}$

$= [(Γ (X_{α + 1, 1}, \dots, X_{α + 1, d_{α + 1}}; A_{α + 1, 1}, \dots, A_{α + 1, d_{α + 1}}) \dots$

$Γ (X_{n, 1}, \dots, X_{n, d_{n}}; A_{n, 1}, \dots, A_{n, d_{n}}) (1))$

$A_{α, β}^{*}$

$((Γ (A_{α - 1, 1}^{*}, \dots, A_{α - 1, d_{α - 1}}^{*}; X_{α - 1, 1}^{*}, \dots, X_{α - 1, d_{α - 1}}^{*}) \dots$

$Γ (A_{1, 1}^{*}, \dots, A_{1, d_{1}}^{*}; X_{1, 1}^{*}, \dots, X_{1, d_{1}}^{*}) (1))]^{T}$ .

[-]Joseph Van Name2y10

So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.

Suppose that is either the field of real numbers or the field of complex numbers. Suppose that $d_{1}, \dots, d_{n}$ are positive integers and $(m_{0}, \dots, m_{n})$ is a sequence of positive integers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is an $m_{i - 1} \times m_{i}$ -matrix whenever $1 \leq j \leq d_{i}$ . Then define a tensor $T ((X_{i, j})) = (X_{1, i_{1}} \dots X_{n, i_{n}})_{i_{1}, \dots, i_{n}} \in K^{d_{1}} \otimes \dots \otimes K^{d_{n}}$ .

If $v \in K^{d_{1}} \otimes \dots \otimes K^{d_{n}}$ , and $(X_{i, j})_{i, j}$ is a system of matrices that minimizes the value $∥ v - T ((X_{i, j})) ∥$ , then $T ((X_{i, j})_{i, j})$ is a dimensionality reduction of $(X_{i, j})_{i, j}$ , and we shall denote let $u$ denote the tensor of reduced dimension $T ((X_{i, j})_{i, j})$ . We shall call $u$ a matrix table to tensor dimensionality reduction of type $(m_{0}, \dots, m_{n})$ .

Observation 1: (Sparsity) If $v$ is sparse in the sense that most entries in the tensor $v$ are zero, then the tensor $u$ will tend to have plenty of zero entries, but as expected, $u$ will be less sparse than $v$ .

Observation 2: (Repeated entries) If $v$ is sparse and $v = (x_{i_{1}, \dots, i_{n}})_{i_{1}, \dots, i_{n}}$ and the set ${x_{i_{1}, \dots, i_{n}} : i_{1}, \dots, i_{n}}$ has small cardinality, then the tensor $u$ will contain plenty of repeated non-zero entries.

Observation 3: (Tensor decomposition) Let $v$ be a tensor. Then we can often find a matrix table to tensor dimensionality reduction $u$ of type $(m_{0}, \dots, m_{n})$ so that $v - u$ is its own matrix table to tensor dimensionality reduction.

Observation 4: (Rational reduction) Suppose that $v$ is sparse and the entries in $v$ are all integers. Then the value $∥ u - v ∥^{2}$ is often a positive integer in both the case when $u$ has only integer entries and in the case when $u$ has non-integer entries.

Observation 5: (Multiple lines) Let $m$ be a fixed positive even number. Suppose that $v$ is sparse and the entries in $v$ are all of the form $r \cdot e^{2 π i n / m}$ for some integer $n$ and $r \geq 0$ . Then the entries in $u$ are often exclusively of the form $r \cdot e^{2 π i n / m}$ as well.

Observation 6: (Rational reductions) I have observed a sparse tensor $v$ all of whose entries are integers along with matrix table to tensor dimensionality reductions $u_{1}, u_{2}$ of $v$ where $∥ v - u_{1} ∥ = 3, ∥ v - u_{1} ∥ = 2, ∥ u_{2} - u_{1} ∥ = 5$ .

This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.

From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.

[-]Joseph Van Name3y10

Suppose that are natural numbers. Let $1 < p < \infty$ . Let $z_{i, j}$ be a complex number whenever $1 \leq i \leq q, 1 \leq j \leq r$ . Let $L : M_{d} (C)^{r} ∖ {0} \to [- \infty, \infty)$ be the fitness function defined by letting $L (X_{1}, \dots, X_{r})$ $= (\sum_{i = 1}^{q} log (ρ (\sum_{j = 1}^{r} z_{i, j} X_{j})) / q) - log (∥ \sum_{j = 1}^{r} X_{j} X_{j}^{*} ∥_{p}) / 2$ . Here, $ρ (X)$ denotes the spectral radius of a matrix $X$ while $∥ X ∥_{p}$ denotes the Schatten $p$ -norm of $X$ .

Now suppose that $(A_{1}, \dots, A_{r}) \in M_{d} (C)^{r} ∖ {0}$ is a tuple that maximizes $L (A_{1}, \dots, A_{r})$ . Let $M : C^{r} ∖ {0} \to [- \infty, \infty)$ be the fitness function defined by letting $M (w_{1}, \dots, w_{r}) = log (ρ (w_{1} A_{1} + \dots + w_{r} A_{r})) - log (∥ (w_{1}, \dots, w_{r}) ∥_{2})$ . Then suppose that $(v_{1}, \dots, v_{r}) \in C^{r} ∖ {0}$ is a tuple that maximizes $M (v_{1}, \dots, v_{r})$ . Then we will likely be able to find an $ℓ \in {1, \dots, q}$ and a non-zero complex number $α$ where $(v_{1}, \dots, v_{r}) = α \cdot (x_{ℓ, 1}, \dots, x_{ℓ, r})$ .

In this case, $(z_{i, j})_{i, j}$ represents the training data while the matrices $A_{1}, \dots, A_{r}$ is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model $A_{1}, \dots, A_{r}$ without any distortion to the data values.

I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.

[-]Joseph Van Name3y30

I think that all that happened here was the matrices just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.

Moderation Log