Here are some empirical observations that I have made on August 14, 2023 to August 19, 2023 that are characteristics of the interpretability of my own matrix dimensionality reduction algorithm. These phenomena that we observe do not occur on all inputs (they sometimes occur partially); and it would be nice if there were a more complete mathematical theory with proofs that explains these empirical phenomena.

Given a (possibly directed and possibly weighed) graph with  nodes represented as a collection of -matrices , we will observe that a dimensionality reduction  of  where each  is a -matrix (I call this dimensionality reduction an LSRDR) is in many cases the optimal solution to a combinatorial problem for the graph. In this case, we have a complete interpretation of what the dimensionality reduction algorithm is doing.

For this post, let  denote either the field of real numbers or the field of complex numbers (everything also works well when  is the division ring of quaternions).

Notation:  is the spectral radius of the matrix  denotes the transpose of  while  denotes the adjoint of .  We say that a tuples of matrices  is jointly similar to  if there is an invertible  with  for  denotes the tensor product of  with .

Let . Suppose that  are -matrices with entries in . Then we say that a collection  of -matrices with entries in  is an -spectral radius dimensionality reduction (abbreviated LSRDR) of  if the following quantity is locally maximized:. LSRDRs may be computed using a variation of gradient ascent.

If  is an LSRDR of , then one will typically be able to find matrices  and some  where  for  and where and . We shall call  a -SRDR projection operator of  is jointly similar to  where  is the -zero matrix. The -SRDR projection operator  is typically unique in the sense that if we run the gradient ascent to obtain another -SRDR projection operator, then we will obtain the same -SRDR projection operator that we originally had. If  are -matrices, then let  denote the completely positive linear operator defined by . Let  denote the dominant eigenvector of  with , and let  denote the dominant eigenvector of  with . Then the eigenvectors  will typically be positive semidefinite matrices.

Suppose now that  are finite dimensional -inner product spaces. Let . Let  be linear transformations. Suppose that for , there are  where if  and , then. Suppose that  is a -SRDR projection operator for . Then we will typically be able to find linear operators  for  where . Since , we observe that  for all . As a consequence, there will be positive semidefinite operators  for  where .

Application: Weighed graph/digraph dominant clustering.

Let  be a vertex set. Suppose that . Let  be a function. For example, the function  could denote a weight matrix of a graph or neural network. For each , let  be the -matrix where the -th entry is  and all the other entries are zero. Then we will typically be able to find an SRDR projection operator  of  along with a set  where  is the diagonal matrix where the -th diagonal entry is  for  and  otherwise. The set  represents a dominant cluster of size  in the set . Let  be the -matrix where  for all . If , then set

 where  whenever  and  otherwise. In other words, if , then  whenever  and  otherwise. Then the dominant cluster  will typically be the subset of  of size  where the spectral radius  is maximized.

We say that a square matrix  with non-negative real entries is a primitive matrix if there is some  where each entry in  is positive. Suppose now that  is the direct sum of a primitive matrix and a zero matrix. Then the spectral radius  is the dominant eigenvalue of , and the root  of the characteristic polynomial of  has multiplicity . Furthermore, there is an vector  with non-negative real entries with . We shall call  the Perron-Frobenius eigenvector of .

For , let  be the dominant eigenvector of  where the sum of the entries in  is , and let  be the dominant eigenvector of  where the sum of the entries in  is . If  is a vector, then let  denote the matrix where  is the list of diagonal entries in . Then .

The problem of maximizing  is a natural problem that is meaningful for adjacency matrices of (weighted) graphs/digraphs and Markov chains. If  is the adjacency matrix of a graph or a digraph , then the value  is a measure of how internally connected the underlying graph is, and if the graph is undirected and simple, then  is maximized when  is a clique (recall that a subset of an simple undirected graph is a clique if all of the pairs of distinct nodes are connected to each other). More specifically, the number of paths in the induced subgraph  with  edges will be about . To make this statement precise, if there are  paths in the induced subgraph  of length , then . Therefore, the set  maximizes the number of paths in the induced subgraph  of length  for large .

The problem of finding a clique of size  in a graph is an NP-complete problem, so we should not expect for there to be an algorithm that always solves this problem efficiently. On the other hand, for many NP-complete problems, there are plenty of heuristic algorithms that give decent solutions in most cases. The use of LSRDRs to find the clique  is another kind of heuristic algorithm that can be used to find the largest clique in a graph and solve more general problems. But the NP-completeness of the problem of finding a clique of size  in a graph also indicates that LSRDRs most likely are unable to find produce cliques in exceptionally difficult graphs.

If  is the transition matrix of an irreducible and aperiodic Markov chain , then the probability that  will be approximately . More precisely, . In this case, the set  maximizes the probability  for large values .

Maximizing the total weight of edges of an induced subgraph: 

If  is a matrix, then let  denote the sum of all the entries in .

Proposition: Suppose that  is the  matrix where each entry of  is . Let  be a real-valued matrix. Then .

Let  be the -matrix where each entry of  is . Let  be a real -matrix. For simplicity, assume that the value  is distinct for each . Let . Then for sufficiently small , the spectral radius  is maximized (subject to the condition that ) precisely when the sum  is maximized. LSRDRs may therefore be used to find the subset  with  that maximizes .

Why do LSRDRs behave this way?

Suppose that  are complex matrices that generate the algebra . Then there is some invertible  and constant  where the operator  is a quantum channel (by a quantum channel, I mean a completely positive trace preserving superoperator), so LSRDRs should be considered to be dimensionality reductions of quantum channels . Primitive matrices can be associated with stochastic matrices in the same way; if  is a primitive matrix, then there is a diagonal matrix  and a constant  where  is a stochastic matrix. One should consider the LSRDR of  to be a dimensionality reduction for Markov chains. The most sensible way to take a dimensionality reduction of an -state Markov chain is to select  states so that those  states make a subset that is in some sense optimal. And, for LSRDRs, the best choice of a  element subset  of  is the option that maximizes .

Conclusions:

The LSRDRs of  have a notable combination of features of interpretability; these LSRDRs tend to converge to the same local maxima (up-to-joint similarity and a constant factor) regardless of the initialization, and we are able to give an explicit expression for these local maxima. We also have a duality between the problem of computing the LSRDR of  and the problem of maximizing  where . With this duality, the LSRDR of  is fully interpretable as a solution to a combinatorial optimization problem.

I hope to make more posts about some of my highly interpretable machine learning algorithms together with some of my tools that we can use to interpret AI.

Edited: 1/10/2024

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 2:24 AM

A meta-comment: You have an original research program, and as far as I know you don't have a paid research position. Is there a summary somewhere of the aims and methods of your research program, and what kind of feedback you're hoping for (e.g. collaborators, employers, investors)? 

I have originally developed LSRDRs to investigate the cryptographic security of the mining algorithm for the cryptocurrency Circcash and compare Circcash mining with similar mining algorithms. I launched the cryptocurrency Circcash so that Circcash mining accelerates the development of reversible computing hardware. But to do this, I needed to construct my own cryptographic algorithm. Unfortunately, I was unable to thoroughly investigate the cryptographic security of mining algorithms before launching Circcash. I decided that it was best to develop some of my own techniques for evaluating the cryptographic security of Circcash including LSRDRs, normal forms, and other techniques, but I still have not completely investigated Circcash using LSRDRs since I need more computational power to reduce the dimension of collections of 2^32-1 by 2^32-1 matrices.

But it looks like LSRDRs and similar algorithms have other uses (such as producing word embeddings, graph/hypergraph embeddings, etc), and I expect to expand the capabilities of algorithms like LSRDRs.

Here is a post that I have made about how we still need to calculate LSRDRs of cryptographic functions to evaluate the cryptographic security of these cryptographic functions:

https://github.com/jvanname/circcash/issues/10 

Since Circcash mining will accelerate the development of more advanced AI, it is a good idea to use the knowledge that I have produced by trying to investigate the cryptographic security of Circcash to try to get reversible computing to be used for good. Here is a post about how Circcash needs to be used to advance AI safety instead of just computational power.

https://github.com/jvanname/circcash/issues/13

Yes. I am looking for investors in Circcash, and I am willing to consult on the use of LSRDRs and similar algorithms in machine learning.