LESSWRONG
LW

Interpretability (ML & AI)AI
Frontpage

5

Spectral radii dimensionality reduction computed without gradient calculations

by Joseph Van Name
28th May 2025
7 min read
4

5

Interpretability (ML & AI)AI
Frontpage

5

Spectral radii dimensionality reduction computed without gradient calculations
2Logan Riggs
8Joseph Van Name
1Logan Riggs
1Joseph Van Name
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 10:43 AM
[-]Logan Riggs3mo2-1

Is the LSRDR a proposed alternative to NN’s in general?

What interpretability do you gain from it?

Could you show a comparison between a transformer embedding and your method with both performance and interpretability? Even MNIST would be useful.

Also, I found it very difficult to understand your post (Eg you didn’t explain your acronym! I had to infer it). You can use the “request feedback” feature on LW in the future; they typically give feedback quite quickly.

Reply
[-]Joseph Van Name3mo*80

In this post, the existence of a non-gradient based algorithm for computing LSRDRs is a sign that LSRDRs behave mathematically and are quite interpretable. Gradient ascent is a general purpose optimization algorithm that works in the case when there is no other way to solve the optimization problem, but when there are multiple ways of obtaining a solution to an optimization problem, the optimization problem is behaving in a way that should be appealing to mathematicians.

LSRDRs and similar algorithms are pseudodeterministic in the sense that if we train the model multiple times on the same data, we typically get identical models. Pseudodeterminism is a signal of interpretability for several reasons that I will go into more detail in a future post:

  1. Pseudodeterministic models do not contain any extra random or even pseudorandom information that is not contained in the training data already. This means that when interpreting these models, one does not have to interpret random information.
  2. Pseudodeterministic models inherit the symmetry of their training data. For example, if we train a real LSRDR using real symmetric matrices, then the projection P will itself by a symmetric matrix.
  3. In mathematics, a well-posed problem is a problem where there exists a unique solution to the problem. Well-posed problems behave better than ill-posed problems in the sense that it is easier to prove results about well-posed problems than it is to prove results about ill-posed problems.

In addition to pseudodeterminism, in my experience, LSRDRs are quite interpretable since I have interpreted LSRDRs already in a few posts:

Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices — LessWrong

When performing a dimensionality reduction on tensors, the trace is often zero. — LessWrong

I have Generalized LSRDRs so that they are starting to behave like deeper neural networks. I am trying to expand the capabilities of generalized LSRDRs so they behave more like deep neural networks, but I still have some work to expand their capabilities while retaining pseudodeterminism. In the meantime, generalized LSRDRs may still function as narrow AI for specific problems and also as layers in AI.

Of course, if we want to compare capabilities, we should also compare NNs to LSRDRs at tasks such as evaluating the cryptographic security of block ciphers, solving NP-complete problems in the average case, etc.

As for the difficulty of this post, it seems like that is the result of the post being mathematical. But going through this kind of mathematics so that we obtain inherently interpretable AI should be the easier portion of AI interpretability. I would much rather communicate about the actual mathematics than about how difficult the mathematics is.

Reply
[-]Logan Riggs3mo1-1

That does clarify a lot of things for me, thanks!

Looking at your posts, there’s no hooks or trying to sell your work, which is a shame cause LSRDR’s seem useful. Since they are you useful, you should be able to show it.

For example, you trained an LSRDR for text embedding, which you could show at the beginning of the post. Then showing the cool properties of pseudo-determinism & lack of noise compared to NN’s. THEN all the maths. So the math folks know if the post is worth their time, and the non-math folks can upvote and share with their mathy friends.
 

I am assuming that you care about [engagement, useful feedback, connections to other work, possible collaborators] here. If not, then sorry for the unwanted advice!

I’m still a little fuzzy on your work, but possible related papers that come to mind are on tensor networks.

  1. Compositionality Unlocks Deep Interpretable Models - they efficiently train tensor networks on [harder MNIST], showing approximately equivalent loss to NN’s, and show the inherent interpretability in their model.
  2. Tensorization is [Cool essentially] - https://arxiv.org/pdf/2505.20132 - mostly a position and theoretical paper arguing why tensorization is great and what limitations.

    Im pretty sure both sets of authors here read LW as well.

Reply
[-]Joseph Van Name3mo*10

I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don't think the same way as I do.

Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs

Stuff that is popular is usually garbage.

But here is my post about the word embedding.

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima — LessWrong

And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.

Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.

Reply
Moderation Log
More from Joseph Van Name
View more
Curated and popular this week
4Comments

In this post, I shall describe a fitness function that can be locally maximized without gradient computations. This fitness function is my own. I initially developed this fitness function in order to evaluate block ciphers for cryptocurrency technologies, but I later found that this fitness function may be used to solve other problems such as the clique problem (which is NP-complete) in the average case and some natural language processing tasks as well. After describing algorithms for locally maximizing this fitness function, I conclude that such a fitness function is inherently interpretable and mathematical which is what we need for AI safety.

Let K denote either the field of real numbers, complex numbers, or the division ring of quaternions. Given (A1,…,Ar)∈Mn(K)r and 1≤d<n, the goal is to find a tuple (X1,…,Xr)∈Md(K)r most similar to (A1,…,Ar). In other words, we want to approximate the n×n-matrices (A1,…,Ar) with d×d-matrices.

Suppose that A1,…,Ar∈Mn(K),X1,…,Xr∈Md(K). Define a function Γ(A1,…,Ar;X1,…,Xr):Mn,d(K)→Mn,d(K) by setting

Γ(A1,…,Ar;X1,…,Xr)(X)=∑jAjXX∗j, and defineΦ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar). 

If X is a matrix, then let ρ(X) denote the spectral radius of X.

Define the L2-spectral radius similarity between (A1,…,Ar) and (X1,…,Xr)

by setting ∥(A1,…,Ar)≃(X1,…,Xr)∥2

=ρ(Γ(A1,…,Ar;X1,…,Xr))ρ(Φ(A1,…,Ar))1/2ρ(Φ(X1,…,Xr))1/2. 

The quantity ∥(A1,…,Ar)≃(X1,…,Xr)∥2 is always a real number in the interval [0,1] (the proof is a generalization of the Cauchy-Schwarz inequality).

If 1≤d<n, and A1,…,Ar∈Mn(K), then we say that (X1,…,Xr)∈Md(K)r is an L2,d-spectral radius dimensionality reduction (LSRDR) of (A1,…,Ar) if the similarity ∥(A1,…,Ar)≃(X1,…,Xr)∥2 is locally maximized.

One can produce an LSRDR of (A1,…,Ar) simply by locally maximizing (X1,…,Xr) using gradient ascent. But there is another way to obtain LSRDRs since they behave mathematically.

Let Z(K) denote the center of the algebra K. In particular,Z(R)=R,Z(C)=C,Z(H)=R.

Empirical observation: If (X1,…,Xr)∈Md(K)r is an LSRDR of (A1,…,Ar), then typically exists λ∈Z(K) along with matrices R,S with Xj=λRAjS for all j and where if P=SR, then P is a (typically non-orthogonal) projection matrix. The projection matrix P is typically unique in the sense that if we train an LSRDR of (A1,…,Ar) with rank d multiple times, then we generally end up with the same projection matrix P. If the projection matrix P is unique, then we shall call P the canonical LSRDR projection matrix of rank d. Let H denote the dominant eigenvector of Γ(A1,…,Ar;PA1P,…,PArP) with Tr(H)=1 (in the quaternionic case, we should define the trace as the real part of the sum of diagonal entries in the matrix). Let G denote the dominant eigenvector of Γ(A1,…,Ar;PA1P,…,PArP)∗ with Tr(G)=1. Then G,H are positive semidefinite matrices with Im(G)=Im(P∗)=ker(P)⊥ and Im(H)=Im(P)=ker(P∗)⊥.

Said differently, P,G,H along with the eigenvalue λ is a solution to the following equations:

  1. P2=P.
  2. Γ(A1,…,Ar;PA1P,…,PArP)(H)=λH.
  3. Γ(A1,…,Ar;PA1P,…,PArP)∗(G)=¯¯¯λG.
  4. Im(H)=Im(P).
  5. Im(G)=Im(P∗).
  6. Tr(G)=Tr(H)=1.

One may often solve an equation or system of equations using iteration. For example, if (X,d) is a complete metric space, α<1 and f:X→X is a function with d(f(x),f(y))≤α⋅d(x,y) for all x,y∈X, and y0=limn→∞fn(x0), then f(y0)=y0 by the contraction mapping theorem. We shall also apply this idea to produce the P,G,H in an LSRDR since the P,G,H satisfy a system of equations.

We need a function f that can be easily applied to matrices but where each projection matrix P is an attractive fixed point for f. Consider the function f(x)=3x2−2x3. We observe that f(0)=0,f(1)=1,f′(0)=0=f′(1). Furthermore, if x∈[0,1/2), then the sequence fn(x) converges to 0 very quickly and if x∈(1/2,1], then fn(x) converges to 1 very quickly. Actually, there are neighborhoods U of 0 and V of 1 in the complex plane where if w∈U, then fn(w) converges to 0 quickly and if z∈V, then fn(z) converges to 1 quickly. As a consequence, if P is a projection matrix, then there is a neighborhood O of P where if Q∈O, then fn(Q) converges to some projection matrix very quickly. Let Define a partial function f∞ by setting f∞(Q)=limn→∞fn(Q).

Iterative algorithm for computing LSRDRs: Let t>0 but set t to a sufficiently small value. Let s∈N∪{∞}∖{0}. Let P0 be an random orthonormal projection of rank d. Let G0,H0 be random matrices in Mn(K). Define GN,HN,PN,G♯N,H♯N for all N≥0 recursively by setting

  1. G♯N=Γ(A1,…,Ar;PNA1PN,…,PNArPN)∗(GN),
  2. H♯N=Γ(A1,…,Ar;PNA1PN,…,PNArPN)(HN),
  3. GN+1=G♯N/Tr(G♯N),
  4. HN+1=H♯N/Tr(H♯N), and
  5. PN+1=fs(PN+t⋅(PNG∗N+HNPN)).

Then (PN)N typically converges to a matrix P of rank d at an exponential rate. Furthermore, the matrix P typically does not depend on the initialization G0,H0,P0, but P does depend on t and s. Therefore set P=Pros(A1,…,Ar) whenever P does not depend on the initialization (just assume that the limit converges). If s=∞, then P is typically the canonical rank d LSRDR projection operator. 

The iterative algorithm for computing LSRDRs is a proper extension of the notion of an LSRDR in some cases. For example, if we simply count the number of parameters, the matrices R,S have 2⋅n⋅d parameters while the matrices (X1,…,Xr) have d2⋅r parameters. Therefore, if d⋅r≪2⋅n, then the matrices R,S (and the projection matrix P) have more parameters than (X1,…,Xr). This means that we cannot obtain a unique canonical LSRDR projection of dimension d when d⋅r≪2⋅n. On the other hand, even when d⋅r≪2⋅n, the matrix Pro∞(A1,…,Ar) typically exists, and if we set P=Pro∞(A1,…,Ar), there are R,S where P=SR and where (RA1S,…,RArS) is an LSRDR of (A1,…,Ar). This means that the iterative algorithm for computing LSRDRs gives more information than simply an LSRDR. The iterative algorithm for computing LSRDRs gives a projection operator.

Interpretability: Since there are several ways of computing LSRDRs, LSRDRs behave mathematically. A machine learning algorithm that typically produces the same output is more interpretable than one that does not for a few reasons including the conclusion that when there is only one output of a machine learning algorithm, that one output only depends on the input and it does not have any other source of random information contributing to it. Since we typically (but not always) attain the same local maximum when training LSRDRs multiple times, LSRDRs are both interpretable and mathematical. This sort of mathematical behavior is what we need to make sense of the inner workings of LSRDRs and other machine learning algorithms. There are several ways to generalize the notion of an LSRDR, but these generalized machine learning algorithms still tend to behave mathematically; they still tend to produce the exact same trained model after training multiple times.

Capabilities: Trained LSRDRs can solve NP-complete problems such as the clique problem. I have also trained LSRDRs to produce word embeddings for natural language processing and to analyze the octonions. I can generalize LSRDRs so that they behave more like deep neural networks, but we still have a way to go until these generalized LSRDRs perform as well as our modern AI algorithms, and I do not know how far we can push the performance of generalized LSRDRs while retaining their inherent interpretability and mathematical behavior. 

If mathematical generalized LSRDRs can compete in performance with neural networks, then we are closer to solving the problem of general AI interpretability. But if not, then mathematical generalized LSRDRs could possibly be used as a narrow AI tool or even as a component of general AI; in this case, generalized LSRDRs could still improve general AI interpretability a little bit. An increased usage of narrow AI will (slightly) decrease the need for general AI.

Added 5/28/2025

When d≪n, the iterative algorithm for computing LSRDRs produces a low rank projection matrix, but a matrix P of rank d can be factored as SR where S∈Mn,d(K),R∈Md,n(K). To save computational resources, we may just work with S,R during the training.

Suppose that P=SR with S∈Mn,d(K),R∈Md,n(K). Then there are several ways to easily factor f(P) as the product of an n×d-matrix with a d×n-matrix including the following factorizations:

f(P)=3(SR)2−2(SR)3=3SRSR−2SRSRSR=S⋅(3RSR−2RSRSR)

=(3SRS−2SRSRS)⋅R=SRS⋅(3R−2RSR)=(3S−2SRS)⋅RSR.

Therefore, define 

f1(S,R)=(S,3RSR−2RSRSR),f2(S,R)=(3SRS−2SRSRS,R)

f3(S,R)=(SRS,3R−2RSR),f4(S,R)=(3S−2SRS,RSR).

In particular, if fi(S,R)=(S1,R1), then S1R1=f(SR).

Let iN∈{1,2,3,4} for all N, and define F(S,R)=limN→∞fiN…fi1(S,R).

 

In the following algorithm, we will need to sometimes replace a pair S,R with a new pair S1,R1 such that S1R1=SR but where (S1,R1) has norm smaller than (S,R) because if we do not do this, after much training, the norm of (S,R) will grow very large while SR is just a projection matrix. To do this, we decrease the norm of (S,R) using gradient descent. In particular, we observe that SR=S(I+X)(I+X)−1R. We are taking the gradient when X=0, so we may have SR≈S(I+X)(I−X)R when X is small. We want to move in the direction that minimizes the sum ∥S(I+X)∥22+∥(I−X)R∥22, but 

∇X(∥S(I+X)∥22+∥(I−X)R∥22)|X=0=2(S∗S−RR∗). 

 

Iterative algorithm for computing LSRDRs with low rank factorization:

Let t>0 but t needs to be sufficiently small. Let r>0. Set S0∈Mn,d(K),R0∈Md,n(K). Let G0,H0 be random rank d matrices. Define GN,HN,G♯N,H♯N,SN,RN,S♯N,R♯N,TN recursively for all N≥0 by setting

  1. G♯N=Γ(A1,…,Ar;SNRNA1SNRN,…,SNRNArSNRN)∗(GN),
  2. H♯N=Γ(A1,…,Ar;SNRNA1SNRN,…,SNRNArSNRN)(HN),
  3. GN+1=G♯N/Tr(G♯N),
  4. HN+1=H♯N/Tr(H♯N),
  5. (S♯N,R♯N)=F(SN+t⋅HN⋅SN,RN+t⋅RN⋅G∗N),
  6. TN=(S♯N)∗S♯N−R♯N⋅(R♯N)∗,
  7. RN+1=(I+r⋅t⋅TN)⋅R♯N, and
  8. SN+1=S♯N⋅(I+r⋅t⋅TN)−1.

Then if everything goes right, SNRN would converge to Pro∞(A1,…,Ar) at an exponential rate.

Added 5/31/2025

The iterative algorithm for computing LSRDRs can be written in terms of completely positive operators. We define a completely positive superoperator as an operator of the form Φ(A1,…,Ar) where A1,…,Ar are square matrices over K.  Completely positive operators are typically defined when K=C for quantum information theory, but we are using a more general context here. If E=Φ(A1,…,Ar), then 

Γ(A1,…,Ar;PNA1PN,…,PNArPN)∗(GN)=E∗(GN⋅P)⋅P and

Γ(A1,…,Ar;PNA1PN,…,PNArPN)(HN)=E(HN⋅P∗)⋅P∗.

Added 6/7/2025

Iterative algorithm for computing LSRDRs apply not just to completely positive superoperators, but it also applies to positive operators that are not completely positive as long as the superoperators do not stray too far away from complete positivity. A linear superoperator E:Mm(C)→Mn(C) is said to be positive if E(P) is positive semidefinite whenever P is positive semidefinite. Clearly, every completely positive superoperator is positive, but not every positive operator is completely positive. For example, the transpose map T defined by T(P)=P⊤ is always positive but not completely positive whenever m=n>1. The difference of two completely positive superoperators from Mm(C) to Mn(C) is known as a Hermitian preserving map. Every positive map is Hermitian preserving. If E:Mm(C)→Mn(C) is linear but not too far from positivity, then use the iterative algorithm for computing LSRDRs, we typically get the same results every time we train, and if E is Hermitian preserving, then the operators GN,HN will be positive semidefinite after training.