Joseph Van Name - LessWrong

The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.

Machine learning models from generalized Laver tables?

I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.

But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.

Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.

Let $X$ be a set, and let $*$ be a binary operation on $X$ that satisfies the self-distributivity identity $x * (y * z) = (x * y) * (x * z)$ . Define the right powers $x^{[n]}$ for $n \geq 1$ inductively by setting $x^{[1]} = x$ and $x^{[n + 1]} = x * x^{[n]}$ . We say that $(X, *, 1)$ is a reduced nilpotent self-distributive algebra if $1$ satisfies the identities $1 * x = x, x * 1 = 1$ for $x \in X$ and if for all $x \in X$ there is an $n \geq 1$ with $x^{[n]} = 1$ . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra $(X, *)$ where if $x_{n} \in X$ for each $n \geq 0$ , then $x_{0} * \dots * x_{N} = 1$ for some $N$ . Here we make the convention to put the implied parentheses on the left so that $a * b * c * d = ((a * b) * c) * d$ .

Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If $(X, *)$ is a self-distributive algebra, then we can define $x *_{0} y = y, x *_{n + 1} y = x *_{n} (x * y) = x * (x *_{n} y)$ and $x *_{\infty} y = {lim}_{n \to \infty} x *_{n} y$ in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define $x ⪯ y$ precisely when $x *_{\infty} y = 1$ and $x ≃ y$ precisely when $x ⪯ y ⪯ x$ . We can define $crit (x)$ to be the equivalence class of $x$ with respect to $≃$ and $crit [X] = X / ≃$ . The operations on $crit [X]$ are $crit (x) \to crit (y) = crit (x *_{\infty} y)$ and $crit (x) \land crit (y) = crit (x \circ y)$ whenever we have our composition operation $\circ$ . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra $(X, *)$ that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra $(Y, *)$ where $X, Y$ are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but $| Y | >> | X |$ but there is a surjective homomorphism $ϕ : Y \to X$ . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.

Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.

From Laver-like algebras, we can also produce non-commutative polynomials. If $(X, *)$ is a Laver-like algebra, and $(x_{a})_{a \in A}$ is a generating set for $X$ , then for each non-maximal critical point $α \in crit [X]$ , we can define a non-commutative polynomial $p_{α} ((x_{a})_{a \in A})$ to be the sum of all non-commutative monomials of the form $x_{a_{1}} \dots x_{a_{n}}$ where $a_{1} \dots a_{n} \in A^{*}$ and $crit (a_{1} * \dots * a_{n}) = α$ but where $crit (a_{1} * \dots * a_{m}) < α$ for $1 \leq m < n$ . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.

I have used these non-commutative polynomials to construct the infinite product formula

$\dots p_{r} (x_{1}, \dots, x_{n}) \dots p_{0} (x_{1}, \dots, x_{n}) = \frac{1}{1 - (x_{1} + \dots + x_{n})}$ where the polynomials $p_{n} (x_{1}, \dots, x_{n})$ are obtained from $n$ different non-trivial rank-into-rank embeddings $j_{1}, \dots, j_{n} : V_{λ} \to V_{λ}$ . This means that the non-commutative ring operations $+, \cdot$ in the rings of non-commutative polynomials are meaningful for Laver-like algebras.

If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.

Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.

Using deep learning to investigate generalized Laver tables

We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.

The n-th classical Laver table is the unique self-distributive algebraic structure $A_{n} = ({1, \dots, 2^{n}}, *_{n})$ where $x *_{n} 1 = x + 1 mod 2^{n}$ whenever $x \in {1, \dots, 2^{n}}$ . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.

Problem: Compute the $n$ -th classical Laver table for as large $n$ as we can achieve.

Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table $A_{48}$ and with machine learning, I have personally gotten up to $A_{768}$ (I think I have a couple of errors, but those are not too significant), and I have some of the computation of $A_{1024}$ .

Suppose now that $2^{N} \leq n \leq 3 \cdot 2^{N}$ . Then Dougherty has shown that one can easily recover the entire function $*_{n}$ from the restriction $L_{N, n} : {1, \dots, 2^{2^{N}}} \times {1, \dots, 2^{n}} \to {1, \dots, 2^{n}}$ where $L_{N, n} (x, y) = x *_{n} y$ for all $x, y$ . Suppose now that $1 \leq x \leq 2^{n}$ . Then let $o_{n} (x)$ denote the least natural number $m$ such that $x *_{n} 2^{m} = 2^{n}$ . Then there is a number $θ_{n + 1} (x)$ called the threshold of $x$ at $A_{n + 1}$ such that $0 \leq θ_{n + 1} (x) \leq 2^{o_{n} (x)}$ and $θ_{n + 1} (2^{n}) = 0$ and such that for every $y$ with $1 \leq y \leq 2^{n}$ , we have $x *_{n + 1} y = x *_{n} y$ whenever $1 \leq y \leq θ_{n + 1} (x)$ and $x *_{n + 1} y = (x *_{n} y) + 2^{n}$ for $θ_{n + 1} (x) < y \leq 2^{o_{n} (x)}$ . The classical Laver table $A_{n + 1}$ can be easily recovered from the table $A_{n}$ and the threshold function $θ_{n + 1}$ . To go beyond Dougherty's calculation of $A_{48}$ to $A_{768}$ and beyond, we exploit a couple of observations about the threshold function $θ_{n + 1}$ :

i: in most cases, $θ_{n + 2} (x) = θ_{n + 1} (x)$ , and

ii: in most cases, $θ_{n + 1} (x) = 2^{i} - 2^{j}$ for some $i, j$ .

Let $E (n) = {x \in {1, \dots, 2^{n - 2}} : θ_{n - 1} (x) \neq θ_{n} (x)}$ , and let $E^{♯} (n) = E (n) \cap {1, \dots, 2^{N} - 2}$ where $N$ is the least natural number where $n \leq 3 \cdot 2^{N}$ and $T_{n} = {(x, θ_{n} (x)) : x \in E^{♯} (n)}$ . Then one can easily compute $A_{n}$ from $A_{n - 1}$ and the data $T_{n}$ by using Dougherty's algorithm. Now the only thing that we need to do is compute $T_{n}$ in the first place. It turns out that computing $T_{n}$ is quite easy except when $n$ is of the form $2^{r} + 1$ or $3 \cdot 2^{r} + 1$ for some $r$ . In the case, when $n = 2^{r} + 1$ or $n = 3 \cdot 2^{r} + 1$ , we initially set $T_{n}^{temp} = \emptyset .$ We then repeatedly modify $T_{n}^{temp}$ until we can no longer find any contradiction in the statement $T_{n} = T_{n}^{temp}$ . Computing $T_{n}$ essentially amounts to finding new elements in the set $E^{♯} (n)$ and we find new elements in $E^{♯} (n)$ using some form of machine learning.

In my calculation of $A_{768}$ , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in $E^{♯} (n)$ to find new elements in $E^{♯} (n)$ , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.

My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in $E^{♯} (n)$ belong to ${1, \dots, 2^{N}}$ and ${1, \dots, 2^{N}}$ has a natural tensor product structure. One can therefore consider $E^{♯} (n)$ to be a subset of $(R^{2})^{\otimes n}$ . Furthermore, a tuple of $r$ elements in $E^{♯} (n)$ can be considered as a subset of $(R^{2}))^{\otimes n} \otimes R^{r}$ . This means that the linear layers from $(R^{2}))^{\otimes n} \otimes R^{r}$ to $(R^{2}))^{\otimes n} \otimes R^{r}$ should be Kronecker products $A_{1} \otimes \dots \otimes A_{n + 1}$ or Kronecker sums $A_{1} \oplus \dots \oplus A_{n + 1}$ (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices $A, B$ are defined by setting $(A \otimes B) (u \otimes v) = A u \otimes B v$ and $A \oplus B = (A \otimes 1) + (1 \otimes B)$ respectively. Perhaps neural networks can be used to generate new elements in $E^{♯} (n)$ . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in $E^{♯} (n)$ have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute $A_{768}$ without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of $E^{♯} (n)$ since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.

Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.

Solution: The non-commutative polynomials $p_{0}, \dots, p_{n}$ are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials $p_{0}, \dots, p_{n}$ into unique sequences of real matrices $A_{0}, \dots, A_{n}$ (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices $A_{0}, \dots, A_{n}$ into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.

Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.

Solution: The set of all multigenic Laver tables over an alphabet $A$ forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.

I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.

Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.

Joseph Van Name's Shortform

Joseph Van Name2mo10

There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.

Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.

Theorem: Suppose that is a real,complex, or quaternionic $n \times n$ -matrix that minimizes the quantity $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2}$ . Then $U$ is unitary.

Proof: The real case is a special case of a complex case, and by representing each $n \times n$ -quaternionic matrix as a complex $2 n \times 2 n$ -matrix, we may assume that $U$ is a complex matrix.

By the Schur decomposition, we know that $U = V T V^{*}$ where $V$ is a unitary matrix and $T$ is upper triangular. But we know that $∥ U ∥_{2} = ∥ T ∥_{2}$ . Furthermore, $U^{- 1} = V T^{- 1} V^{*}$ , so $∥ U^{- 1} ∥_{2} = ∥ T^{- 1} ∥_{2}$ . Let $D$ denote the diagonal matrix whose diagonal entries are the same as $T$ . Then $∥ T ∥_{2} \geq ∥ D ∥_{2}$ and $∥ T^{- 1} ∥_{2} \geq ∥ D^{- 1} ∥_{2}$ . Furthermore, $∥ T ∥_{2} = ∥ D ∥_{2}$ iff T is diagonal and $∥ T^{- 1} ∥_{2} = ∥ D^{- 1} ∥_{2}$ iff $D$ is diagonal. Therefore, since $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2} = ∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ and $∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ is minimized, we can conclude that $T = D$ , so $T$ is a diagonal matrix. Suppose that $T$ has diagonal entries $(z_{1}, \dots, z_{n})$ . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that $\frac{1}{2} \cdot (∥ (z_{1}, \dots, z_{n}) ∥_{2} + ∥ (z_{1}^{- 1}, \dots, z_{n}^{- 1}) ∥_{2}) \geq ∥ (| z_{1} |, \dots, | z_{n} |) ∥_{2} \cdot ∥ (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ∥_{2}$

$\geq ⟨ (| z_{1} |, \dots, | z_{n} |), (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ⟩ = \sqrt{n} .$

Here, the equalities hold if and only if $| z_{j} | = 1$ for all $j$ , but this implies that $U$ is unitary. Q.E.D.

Apologizing is a Core Rationalist Skill

Joseph Van Name4mo120

I do not care to share much more of my reasoning because I have shared enough and also because there is a reason that I have vowed to no longer discuss except possibly with lots of obfuscation. This discussion that we are having is just convincing me more that the entities here are not the entities I want to have around me at all. It does not do much good to say that the community here is acting well or to question my judgment about this community. It will do good for the people here to act better so that I will naturally have a positive judgment about this community.

Apologizing is a Core Rationalist Skill

Joseph Van Name4mo-20

You are judging my reasoning without knowing all that went into my reasoning. That is not good.

Apologizing is a Core Rationalist Skill

Joseph Van Name4mo3-1

I will work with whatever data I have, and I will make a value judgment based on the information that I have. The fact that Karma relies on very small amounts of information is a testament to a fault of Karma, and that is further evidence of how the people on this site do not want to deal with mathematics. And the information that I have indicates that there are many people here who are likely to fall for more scams like FTX. Not all of the people here are so bad, but I am making a judgment based on the general atmosphere here. If you do not like my judgment, then the best thing would be to try to do better. If this site has made a mediocre impression on me, then I am not at fault for the mediocrity here.

Apologizing is a Core Rationalist Skill

Joseph Van Name4mo100

Let's see whether the notions that I have talked about are sensible mathematical notions for machine learning.

Tensor product-Sometimes data in a neural network has tensor structure. In this case, the weight matrices should be tensor products or tensor sums. Regarding the structure of the data works well with convolutional neural networks, and it should also work well for data with tensor structure to it.

Trace-The trace of a matrix measures how much the matrix maps vectors onto themselves since

where $v$ follows the multivariate normal distribution.

Spectral radius-Suppose that we are iterating a smooth function $f$ . Suppose furthermore that $f (v) = v$ and $u_{0}$ is near $v$ and $u_{n + 1} = f (u_{n})$ . We would like to determine whether ${lim}_{n \to \infty} u_{n} = v$ or not. If the Jacobian of $f$ at $v$ has spectral radius less than $1$ , then ${lim}_{n \to \infty} u_{n} = v$ ,. If the Jacobian of $f$ at $v$ has spectral radius greater than $1$ , then this limit does not converge.

The notions that I have been talking about are sensible and arise in machine learning. And understanding these notions is far easier than trying to interpret very large networks like GPT-4 without using these notions. Many people on this site just act like clowns. Karma is only a good metric when the people on the network value substance over fluff. And the only way to convince me otherwise will be for the people here to value posts that involve basic notions like the trace, eigenvalues, and spectral radius of matrices.

P.S. I can make the trace, determinant, and spectral radius even simpler. These operations are what you get when you take the sum, product, and the maximum absolute value of the eigenvalues. Yes. Those are just the basic eigenvalue operations.

Apologizing is a Core Rationalist Skill

Joseph Van Name4mo41

Talking about whining and my loss of status is a good way to get me to dislike the LW community and consider them to be anti-intellectuals who fall for garbage like FTX. Do you honestly think the people here should try to interpret large sections of LLMs while simultaneously being afraid of quaternions?

It is better to comment on threads where we are interacting in a more positive manner.

I thought apologizing and recognizing inadequacies was a core rationalist skill. And I thought rationalists were supposed to like mathematics. The lack of mathematical appreciation is one of these inadequacies of the LW community. But instead of acknowledging this deficiency, the community here blasts me as talking about something off topic. How ironic!

An Introduction To The Mandelbrot Set That Doesn't Mention Complex Numbers

Joseph Van Name4mo50

I usually think of the field of complex numbers algebraically, but one can also think of the real numbers, complex numbers, and quaternions geometrically. The real numbers are good with dealing with 1 dimensional space, and the complex numbers are good for dealing with 2 dimensional space geometrically. While the division ring of quaternions is a 4 dimensional algebra over the field of real numbers, the quaternions are best used for dealing with 3 dimensional space geometrically.

For example, if are open subsets of some Euclidean space, then a function $f : U \to V$ is said to be a conformal mapping when it preserves angles and the orientation. We can associate the 2-dimensional Euclidean space with the field of complex numbers, and the conformal mappings between open subsets of 2-dimensional spaces are just the complex differentiable mappings. For the Mandelbrot set, we need this conformality because we want the Mandelbrot set to look pretty. If the complex differentiable maps were not conformal, then the functions that we iterate in complex dynamics would stretch subsets of the complex plane in one dimension and expand them in the other dimension and this would result in a fractal that looks quite stretched in one real dimension and squashed in another dimension (the fractals would look like spaghetti; oh wait, I just looked at a 3D fractal and it looks like some vegetable like broccoli). This stretching and squashing is illustrated by 3D fractals that try to mimic the Mandelbrot set but without any conformality. The conformality is why the Julia sets are sensible (mathematicians have proven theorems about these sets) for any complex polynomial of degree 2 or greater.

For the quaternions, it is well-known that the dot product and the cross product operations on 3 dimensional space can be described in terms of the quaternionic multiplication operation between purely imaginary quaternions.

Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment?

Answer by Joseph Van NameJan 12, 2024100

Um. If you want to convince a mathematician like Terry Tao to be interested in AI alignment, you will need to present yourself as a reasonably competent mathematician or related expert and actually formulate an AI problem in such a way so that someone like Terry Tao would be interested in it. If you yourself are not interested in the problem, then Terry Tao will not be interested in it either.

Terry Tao is interested in random matrix theory (he wrote the book on it), and random matrix theory is somewhat related to my approach to AI interpretability and alignment. If you are going to send these problems to a mathematician, please inform me about this before you do so.

My approach to alignment: Given matrices , define a superoperator $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ by setting

$Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = \sum_{k = 1}^{r} A_{k} X B_{k}^{*}$ , and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$ -spectral radius of $A_{1}, \dots, A_{r}$ as $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Here, $ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$ is the usual spectral radius.

Define $ρ_{2, d}^{K} (A_{1}, \dots, A_{r}) = max {\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (X_{1}, \dots, X_{r})} : X_{1}, \dots, X_{r} \in M_{d} (K)}$ . Here, $K$ is either the field of reals, field of complex numbers, or division ring of quaternions.

Given matrices $A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}$ , define

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥ = \frac{Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ . The value $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥$ is always a real number in the interval $[0, 1]$ that is a measure of how jointly similar the tuples $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ are. The motivation behind $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ is that $\frac{ρ_{2, d}^{K} (A_{1}, \dots, A_{r})}{ρ_{2} (A_{1}, \dots, A_{r})}$ is always a real number in $[0, 1]$ (well except when the denominator is zero) that measures how well $A_{1}, \dots, A_{r}$ can be approximated by $d \times d$ -matrices. Informally, $\frac{ρ_{2, d}^{K} (A_{1}, \dots, A_{r})}{ρ_{2} (A_{1}, \dots, A_{r})}$ measures how random $A_{1}, \dots, A_{r}$ are where a lower value of $\frac{ρ_{2, d}^{K} (A_{1}, \dots, A_{r})}{ρ_{2} (A_{1}, \dots, A_{r})}$ indicates a lower degree of randomness.

A better theoretical understanding of $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ would be great. If $X_{1}, \dots, X_{r} \in M_{d} (K)$ and $\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ϕ_{2} (X_{1}, \dots, X_{r})}$ is locally maximized, then we say that $(X_{1}, \dots, X_{r})$ is an LSRDR of $(A_{1}, \dots, A_{r})$ . Said differently, $(X_{1}, \dots, X_{r}) \in M_{d} (K)$ is an LSRDR of $(A_{1}, \dots, A_{r})$ if the similarity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥$ is maximized.

Here, the notion of an LSRDR is a machine learning notion that seems to be much more interpretable and much less subject to noise than many other machine learning notions. But a solid mathematical theory behind LSRDRs would help us understand not just what LSRDRs do, but the mathematical theory would help us understand why they do it.

Problems in random matrix theory concerning LSRDRs:

Suppose that $U_{1}, \dots, U_{r}$ are random matrices (according to some distribution). Then what are some bounds for $ρ_{2, d}^{K} (U_{1}, \dots, U_{r})$ .
Suppose that $U_{1}, \dots, U_{r}$ are random matrices and $A_{1}, \dots, A_{r}$ are non-random matrices. What can we say about the spectrum of $Γ (A_{1}, \dots, A_{r}; U_{1}, \dots, U_{r})$ ? My computer experiments indicate that this spectrum satisfies the circular law, and the radius of the disc for this circular law is proportional to $ρ_{2} (A_{1}, \dots, A_{r})$ , but a proof of this circular law would be nice.
Tensors can be naturally associated with collections of matrices. Suppose now that $U_{1}, \dots, U_{r}$ are the matrices associated with a random tensor. Then what are some bounds for $ρ_{2, d}^{K} (U_{1}, \dots, U_{r})$ .

P.S. By massively downvoting my posts where I talk about mathematics that is clearly applicable to AI interpretability and alignment, the people on this site are simply demonstrating that they need to do a lot of soul searching before they annoy people like Terry Tao with their lack of mathematical expertise.

P.P.S. Instead of trying to get a high profile mathematician like Terry Tao to be interested in problems, it may be better to search for a specific mathematician who is an expert in a specific area related to AI alignment since it may be easier to contact a lower profile mathematician, and a lower profile mathematician may have more specific things to say and contribute. You are lucky that Terry Tao is interested in random matrix theory, but this does not mean that Terry Tao is interested in anything in the intersection between alignment and random matrix theory. It may be better to search harder for mathematicians who are interested in your specific problems.

P.P.P.S. To get more mathematicians interested in alignment, it may be a good idea to develop AI systems that behave much more mathematically. Neural networks currently do not behave very mathematically since they look like the things that engineers would come up with instead of mathematicians.

P.P.P.P.S. I have developed the notion of an LSRDR for cryptocurrency research because I am using this to evaluate the cryptographic security of cryptographic functions.

Joseph Van Name's Shortform

Joseph Van Name4mo30

We can use the spectral radius similarity to measure more complicated similarities between data sets.

Suppose that $A_{1}, \dots, A_{r}$ are $m \times m$ -real matrices and $B_{1}, \dots, B_{r}$ are $n \times n$ -real matrices. Let $ρ (A)$ denote the spectral radius of $A$ and let $A \otimes B$ denote the tensor product of $A$ with $B$ . Define the $L_{2}$ -spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (A_{1} \otimes A_{1} + \dots + A_{r} \otimes A_{r})^{1 / 2}$ , Define the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $B_{1}, \dots, B_{r}$ as

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} = \frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ .

We observe that if $C$ is invertible and $λ$ is a constant, then

$∥ (A_{1}, \dots, A_{r}) ≃ (λ C B_{1} C^{- 1}, \dots, λ C B_{r} C^{- 1}) ∥_{2} = 1.$

Therefore, the $L_{2}$ -spectral radius is able to detect and measure symmetry that is normally hidden.

Example: Suppose that $u_{1}, \dots, u_{r}; v_{1}, \dots, v_{r}$ are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation $T$ with $T (u_{j}) = v_{j}$ for all $j$ (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors $x_{1}, \dots, x_{r}; y_{1}, \dots, y_{r}$ with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of low close we are to obtaining such an affine transformation $T$ . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that $M, N$ are both the first few layers in a (typically different) neural network. Suppose that $a_{1}, \dots, a_{r}$ is a set of data points. Then if $u_{j} = M (a_{j})$ and $v_{j} = M (a_{j})$ , then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of the similarity between $M$ and $N$ .

I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of $S \subseteq {0, 1}^{32} \times {0, 1}^{32}$ of ordered pairs and I trained the neural networks $M, N$ to minimize the expected losses $E (∥ N (a) - b ∥^{2} : (a, b) \in S), E (∥ M (a) - b ∥^{2} : (a, b) \in S)$ . In my experiment, each $a_{j}$ was a random vector of length 32 whose entries were 0's and 1's. In my experiment, the similarity $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ was worse than if $x_{1}, \dots, x_{r}, y_{1}, \dots, y_{r}$ were just random vectors.

This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the $L_{2}$ -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).

LESSWRONG
LW

Posts

Wiki Contributions

Comments