Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices

Joseph Van Name

Here are some empirical observations that I have made on August 14, 2023 to August 19, 2023 that are characteristics of the interpretability of my own matrix dimensionality reduction algorithm. These phenomena that we observe do not occur on all inputs (they sometimes occur partially); and it would be nice if there were a more complete mathematical theory with proofs that explains these empirical phenomena.

Given a (possibly directed and possibly weighed) graph with nodes represented as a collection of $n \times n$ -matrices $A_{1}, \dots, A_{r}$ , we will observe that a dimensionality reduction $(X_{1}, \dots, X_{r})$ of $(A_{1}, \dots, A_{r})$ where each $X_{i}$ is a $d \times d$ -matrix (I call this dimensionality reduction an LSRDR) is in many cases the optimal solution to a combinatorial problem for the graph. In this case, we have a complete interpretation of what the dimensionality reduction algorithm is doing.

For this post, let $K$ denote either the field of real numbers or the field of complex numbers (everything also works well when $K$ is the division ring of quaternions).

Notation: $ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n} = max {| λ | : λ is an eigenvalue of A}$ is the spectral radius of the matrix $A$ . $A^{T}$ denotes the transpose of $A$ while $A^{*}$ denotes the adjoint of $A$ . We say that a tuples of matrices $(A_{1}, \dots, A_{r})$ is jointly similar to $(B_{1}, \dots, B_{r})$ if there is an invertible $C$ with $B_{j} = C A_{j} C^{- 1}$ for $1 \leq j \leq r$ . $A \otimes B$ denotes the tensor product of $A$ with $B$ .

Let $1 \leq d < n$ . Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices with entries in $K$ . Then we say that a collection $(X_{1}, \dots, X_{r})$ of $d \times d$ -matrices with entries in $K$ is an $L_{2, d}$ -spectral radius dimensionality reduction (abbreviated LSRDR) of $A_{1}, \dots, A_{r}$ if the following quantity is locally maximized: $\frac{ρ (A_{1} \otimes (X_{1}^{*})^{T} + \dots + A_{r} \otimes (X_{r}^{*})^{T})}{ρ (X_{1} \otimes (X_{1}^{*})^{T} + \dots + X_{r} \otimes (X_{r}^{*})^{T})^{1 / 2}}$ . LSRDRs may be computed using a variation of gradient ascent.

If $(X_{1}, \dots, X_{r})$ is an LSRDR of $(A_{1}, \dots, A_{r})$ , then one will typically be able to find matrices $R, S, P$ and some $λ \in K$ where $X_{j} = R A_{j} S$ for $1 \leq j \leq r$ and where $S R = λ P$ and $P^{2} = P$ . We shall call $P$ a $L_{2, d}$ -SRDR projection operator of $(A_{1}, \dots, A_{r})$ . $(X_{1} \oplus 0_{n - d}, \dots, X_{r} \oplus 0_{n - d})$ is jointly similar to $(λ P A_{1} P, \dots, λ P A_{r} P)$ where $0_{n - d}$ is the $(n - d) \times (n - d)$ -zero matrix. The $L_{2, d}$ -SRDR projection operator $P$ is typically unique in the sense that if we run the gradient ascent to obtain another $L_{2, d}$ -SRDR projection operator, then we will obtain the same $L_{2, d}$ -SRDR projection operator that we originally had. If $B_{1}, \dots, B_{r}$ are $n \times n$ -matrices, then let $Φ (B_{1}, \dots, B_{r}) : M_{n} (K) \to M_{n} (K)$ denote the completely positive linear operator defined by $Φ (B_{1}, \dots, B_{r}) (X) = B_{1} X B_{1}^{*} + \dots + B_{r} X B_{r}^{*}$ . Let $H$ denote the dominant eigenvector of $Φ (B_{1}, \dots, B_{r})$ with $Tr (H) = 1$ , and let $G$ denote the dominant eigenvector of $Φ (B_{1}, \dots, B_{r})^{*}$ with $Tr (G) = 1$ . Then the eigenvectors $H, G$ will typically be positive semidefinite matrices.

Suppose now that $V_{1}, \dots, V_{q}$ are finite dimensional $K$ -inner product spaces. Let $V = V_{1} \oplus \dots \oplus V_{q}$ . Let $A_{1}, \dots, A_{r} : V \to V$ be linear transformations. Suppose that for $1 \leq j \leq r$ , there are $u_{j}, v_{j} \in {1, \dots, q}$ where if $x \in V_{u}, y \in V_{v}$ and $⟨ A x, y ⟩ \neq 0$ , then $u = u_{j}, v = v_{j}$ . Suppose that $P : V \to V$ is a $L_{2, d}$ -SRDR projection operator for $(A_{1}, \dots, A_{r})$ . Then we will typically be able to find linear operators $P_{j} : V_{j} \to V_{j}$ for $1 \leq j \leq q$ where $P = P_{1} \oplus \dots \oplus P_{q}$ . Since $P = P^{2}$ , we observe that $P_{j} = P_{j}^{2}$ for all $j$ . As a consequence, there will be positive semidefinite operators $G_{j}, H_{j} : V_{j} \to V_{j}$ for $1 \leq j \leq q$ where $G = G_{1} \oplus \dots \oplus G_{q}, H = H_{1} \oplus \dots \oplus H_{q}$ .

Application: Weighed graph/digraph dominant clustering.

Let $V = {1, \dots, n}$ be a vertex set. Suppose that $1 \leq d < n$ . Let $f : V \times V \to K$ be a function. For example, the function $f$ could denote a weight matrix of a graph or neural network. For each $i, j$ , let $A_{i, j}$ be the $n \times n$ -matrix where the $(i, j)$ -th entry is $f (i, j)$ and all the other entries are zero. Then we will typically be able to find an $L_{2, d} -$ SRDR projection operator $P$ of $(A_{i, j})_{i, j}$ along with a set $T \subseteq {1, \dots, n}$ where $P$ is the diagonal matrix where the $i$ -th diagonal entry is $1$ for $i \in T$ and $0$ otherwise. The set $T$ represents a dominant cluster of size $d$ in the set $V$ . Let $A = (a_{i, j})_{i, j}$ be the $n \times n$ -matrix where $a_{i, j} = | f (i, j) |^{2}$ for all $i, j$ . If $S \subseteq V$ , then set

$A |_{S} = (a_{i, j} \cdot χ_{S} (i) \cdot χ_{S} (j))_{1 \leq i \leq n, 1 \leq j \leq n}$ where $χ_{S} (i) = 1$ whenever $i \in S$ and $χ_{S} (i) = 0$ otherwise. In other words, if $A |_{s} = (b_{i, j})_{i, j}$ , then $b_{i, j} = a_{i, j}$ whenever $i, j \in S$ and $b_{i, j} = 0$ otherwise. Then the dominant cluster $T$ will typically be the subset of $V$ of size $d$ where the spectral radius $ρ (A |_{T})$ is maximized.

We say that a square matrix $C$ with non-negative real entries is a primitive matrix if there is some $n$ where each entry in $C^{n}$ is positive. Suppose now that $C$ is the direct sum of a primitive matrix and a zero matrix. Then the spectral radius $ρ (C)$ is the dominant eigenvalue of $C$ , and the root $ρ (C)$ of the characteristic polynomial of $C$ has multiplicity $1$ . Furthermore, there is an vector $v$ with non-negative real entries with $C v = ρ (C) v$ . We shall call $v$ the Perron-Frobenius eigenvector of $C$ .

For $T \subseteq {1, \dots, n}$ , let $v_{T}$ be the dominant eigenvector of $A |_{T}$ where the sum of the entries in $v_{T}$ is $1$ , and let $w_{T}$ be the dominant eigenvector of $(A |_{T})^{*}$ where the sum of the entries in $w_{T}$ is $1$ . If $u$ is a vector, then let $diag (u)$ denote the matrix where $u$ is the list of diagonal entries in $u$ . Then $H = diag (v_{T}), G = diag (w_{T})$ .

The problem of maximizing $ρ (A |_{T})$ is a natural problem that is meaningful for adjacency matrices of (weighted) graphs/digraphs and Markov chains. If $f$ is the adjacency matrix of a graph or a digraph $G$ , then the value $ρ (A |_{T})$ is a measure of how internally connected the underlying graph is, and if the graph is undirected and simple, then $ρ (A |_{T})$ is maximized when $T$ is a clique (recall that a subset of an simple undirected graph is a clique if all of the pairs of distinct nodes are connected to each other). More specifically, the number of paths in the induced subgraph $G [T]$ with $m$ edges will be about $ρ (A |_{T})^{m}$ . To make this statement precise, if there are $t_{m}$ paths in the induced subgraph $G [T]$ of length $m$ , then ${lim}_{m \to \infty} t_{m}^{1 / m} = ρ (A |_{T})$ . Therefore, the set $T$ maximizes the number of paths in the induced subgraph $G [T]$ of length $m$ for large $m$ .

The problem of finding a clique of size $d$ in a graph is an NP-complete problem, so we should not expect for there to be an algorithm that always solves this problem efficiently. On the other hand, for many NP-complete problems, there are plenty of heuristic algorithms that give decent solutions in most cases. The use of LSRDRs to find the clique $T$ is another kind of heuristic algorithm that can be used to find the largest clique in a graph and solve more general problems. But the NP-completeness of the problem of finding a clique of size $d$ in a graph also indicates that LSRDRs most likely are unable to find produce cliques in exceptionally difficult graphs.

If $A$ is the transition matrix of an irreducible and aperiodic Markov chain $(X_{m})_{m \geq 0}$ , then the probability that $X_{m}, \dots, X_{m + k} \in T$ will be approximately $ρ (A |_{T})^{k}$ . More precisely, ${lim}_{k \to \infty} P (X_{0}, \dots, X_{k} \in T)^{1 / k} = ρ (A |_{T})$ . In this case, the set $T$ maximizes the probability $P (X_{0}, \dots, X_{k} \in T)$ for large values $k$ .

Maximizing the total weight of edges of an induced subgraph:

If $Y$ is a matrix, then let $\sum Y$ denote the sum of all the entries in $Y$ .

Proposition: Suppose that $M_{d}$ is the $d \times d$ matrix where each entry of $M_{d}$ is $1 / d$ . Let $X$ be a real-valued matrix. Then $_{δ \to 0}^{lim} \frac{ρ (M_{d} + δ X) - ρ (M_{d})}{δ} = \sum X / d$ .

Let $M_{n, d}$ be the $n \times n$ -matrix where each entry of $M_{n, d}$ is $1 / d$ . Let $B$ be a real $n \times n$ -matrix. For simplicity, assume that the value $\sum B |_{T}$ is distinct for each $T \subseteq {1, \dots, n}$ . Let $A = M_{n, d} + δ B$ . Then for sufficiently small $δ$ , the spectral radius $ρ (A |_{T})$ is maximized (subject to the condition that $| T | = d$ ) precisely when the sum $\sum B |_{T}$ is maximized. LSRDRs may therefore be used to find the subset $T \subseteq {1, \dots, n}$ with $| T | = d$ that maximizes $\sum B |_{T}$ .

Why do LSRDRs behave this way?

Suppose that $(A_{1}, \dots, A_{r})$ are complex matrices that generate the algebra $M_{n} (C)$ . Then there is some invertible $B$ and constant $λ$ where the operator $Φ (λ B A_{1} B^{- 1}, \dots, λ B A_{r} B^{- 1})$ is a quantum channel (by a quantum channel, I mean a completely positive trace preserving superoperator), so LSRDRs should be considered to be dimensionality reductions of quantum channels $E : M_{n} (C) \to M_{n} (C)$ . Primitive matrices can be associated with stochastic matrices in the same way; if $A$ is a primitive matrix, then there is a diagonal matrix $D$ and a constant $λ$ where $λ D^{- 1} A D$ is a stochastic matrix. One should consider the LSRDR of $(A_{i, j})_{i, j}$ to be a dimensionality reduction for Markov chains. The most sensible way to take a dimensionality reduction of an $n$ -state Markov chain is to select $d$ states so that those $d$ states make a subset that is in some sense optimal. And, for LSRDRs, the best choice of a $d$ element subset $T$ of ${1, \dots, n}$ is the option that maximizes $ρ (A |_{T})$ .

Conclusions:

The LSRDRs of $(A_{i, j})_{i, j}$ have a notable combination of features of interpretability; these LSRDRs tend to converge to the same local maxima (up-to-joint similarity and a constant factor) regardless of the initialization, and we are able to give an explicit expression for these local maxima. We also have a duality between the problem of computing the LSRDR of $(A_{i, j})_{i, j}$ and the problem of maximizing $ρ (A |_{T})$ where $| T | = d$ . With this duality, the LSRDR of $(A_{i, j})_{i, j}$ is fully interpretable as a solution to a combinatorial optimization problem.

I hope to make more posts about some of my highly interpretable machine learning algorithms together with some of my tools that we can use to interpret AI.

Edited: 1/10/2024

[-]Mitchell_Porter8mo100

A meta-comment: You have an original research program, and as far as I know you don't have a paid research position. Is there a summary somewhere of the aims and methods of your research program, and what kind of feedback you're hoping for (e.g. collaborators, employers, investors)?

[-]Joseph Van Name8mo30

I have originally developed LSRDRs to investigate the cryptographic security of the mining algorithm for the cryptocurrency Circcash and compare Circcash mining with similar mining algorithms. I launched the cryptocurrency Circcash so that Circcash mining accelerates the development of reversible computing hardware. But to do this, I needed to construct my own cryptographic algorithm. Unfortunately, I was unable to thoroughly investigate the cryptographic security of mining algorithms before launching Circcash. I decided that it was best to develop some of my own techniques for evaluating the cryptographic security of Circcash including LSRDRs, normal forms, and other techniques, but I still have not completely investigated Circcash using LSRDRs since I need more computational power to reduce the dimension of collections of 2^32-1 by 2^32-1 matrices.

But it looks like LSRDRs and similar algorithms have other uses (such as producing word embeddings, graph/hypergraph embeddings, etc), and I expect to expand the capabilities of algorithms like LSRDRs.

Here is a post that I have made about how we still need to calculate LSRDRs of cryptographic functions to evaluate the cryptographic security of these cryptographic functions:

https://github.com/jvanname/circcash/issues/10

Since Circcash mining will accelerate the development of more advanced AI, it is a good idea to use the knowledge that I have produced by trying to investigate the cryptographic security of Circcash to try to get reversible computing to be used for good. Here is a post about how Circcash needs to be used to advance AI safety instead of just computational power.

https://github.com/jvanname/circcash/issues/13

Yes. I am looking for investors in Circcash, and I am willing to consult on the use of LSRDRs and similar algorithms in machine learning.

LESSWRONG
LW

Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices

15

New to LessWrong?

15