Is the LSRDR a proposed alternative to NN’s in general?
What interpretability do you gain from it?
Could you show a comparison between a transformer embedding and your method with both performance and interpretability? Even MNIST would be useful.
Also, I found it very difficult to understand your post (Eg you didn’t explain your acronym! I had to infer it). You can use the “request feedback” feature on LW in the future; they typically give feedback quite quickly.
In this post, the existence of a non-gradient based algorithm for computing LSRDRs is a sign that LSRDRs behave mathematically and are quite interpretable. Gradient ascent is a general purpose optimization algorithm that works in the case when there is no other way to solve the optimization problem, but when there are multiple ways of obtaining a solution to an optimization problem, the optimization problem is behaving in a way that should be appealing to mathematicians.
LSRDRs and similar algorithms are pseudodeterministic in the sense that if we train the model multiple times on the same data, we typically get identical models. Pseudodeterminism is a signal of interpretability for several reasons that I will go into more detail in a future post:
In addition to pseudodeterminism, in my experience, LSRDRs are quite interpretable since I have interpreted LSRDRs already in a few posts:
When performing a dimensionality reduction on tensors, the trace is often zero. — LessWrong
I have Generalized LSRDRs so that they are starting to behave like deeper neural networks. I am trying to expand the capabilities of generalized LSRDRs so they behave more like deep neural networks, but I still have some work to expand their capabilities while retaining pseudodeterminism. In the meantime, generalized LSRDRs may still function as narrow AI for specific problems and also as layers in AI.
Of course, if we want to compare capabilities, we should also compare NNs to LSRDRs at tasks such as evaluating the cryptographic security of block ciphers, solving NP-complete problems in the average case, etc.
As for the difficulty of this post, it seems like that is the result of the post being mathematical. But going through this kind of mathematics so that we obtain inherently interpretable AI should be the easier portion of AI interpretability. I would much rather communicate about the actual mathematics than about how difficult the mathematics is.
That does clarify a lot of things for me, thanks!
Looking at your posts, there’s no hooks or trying to sell your work, which is a shame cause LSRDR’s seem useful. Since they are you useful, you should be able to show it.
For example, you trained an LSRDR for text embedding, which you could show at the beginning of the post. Then showing the cool properties of pseudo-determinism & lack of noise compared to NN’s. THEN all the maths. So the math folks know if the post is worth their time, and the non-math folks can upvote and share with their mathy friends.
I am assuming that you care about [engagement, useful feedback, connections to other work, possible collaborators] here. If not, then sorry for the unwanted advice!
I’m still a little fuzzy on your work, but possible related papers that come to mind are on tensor networks.
Tensorization is [Cool essentially] - https://arxiv.org/pdf/2505.20132 - mostly a position and theoretical paper arguing why tensorization is great and what limitations.
Im pretty sure both sets of authors here read LW as well.
I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don't think the same way as I do.
Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs
Stuff that is popular is usually garbage.
But here is my post about the word embedding.
And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.
Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.
In this post, I shall describe a fitness function that can be locally maximized without gradient computations. This fitness function is my own. I initially developed this fitness function in order to evaluate block ciphers for cryptocurrency technologies, but I later found that this fitness function may be used to solve other problems such as the clique problem (which is NP-complete) in the average case and some natural language processing tasks as well. After describing algorithms for locally maximizing this fitness function, I conclude that such a fitness function is inherently interpretable and mathematical which is what we need for AI safety.
Let denote either the field of real numbers, complex numbers, or the division ring of quaternions. Given and , the goal is to find a tuple most similar to . In other words, we want to approximate the -matrices with -matrices.
Suppose that Define a function by setting
, and define.
If is a matrix, then let denote the spectral radius of .
Define the -spectral radius similarity between and
by setting
The quantity is always a real number in the interval (the proof is a generalization of the Cauchy-Schwarz inequality).
If , and , then we say that is an -spectral radius dimensionality reduction (LSRDR) of if the similarity is locally maximized.
One can produce an LSRDR of simply by locally maximizing using gradient ascent. But there is another way to obtain LSRDRs since they behave mathematically.
Let denote the center of the algebra . In particular,.
Empirical observation: If is an LSRDR of , then typically exists along with matrices with for all and where if , then is a (typically non-orthogonal) projection matrix. The projection matrix is typically unique in the sense that if we train an LSRDR of with rank multiple times, then we generally end up with the same projection matrix . If the projection matrix is unique, then we shall call the canonical LSRDR projection matrix of rank . Let denote the dominant eigenvector of with (in the quaternionic case, we should define the trace as the real part of the sum of diagonal entries in the matrix). Let denote the dominant eigenvector of with . Then are positive semidefinite matrices with and .
Said differently, along with the eigenvalue is a solution to the following equations:
One may often solve an equation or system of equations using iteration. For example, if is a complete metric space, and is a function with for all , and , then by the contraction mapping theorem. We shall also apply this idea to produce the in an LSRDR since the satisfy a system of equations.
We need a function that can be easily applied to matrices but where each projection matrix is an attractive fixed point for . Consider the function . We observe that . Furthermore, if , then the sequence converges to very quickly and if , then converges to very quickly. Actually, there are neighborhoods of and of in the complex plane where if , then converges to quickly and if , then converges to quickly. As a consequence, if is a projection matrix, then there is a neighborhood of where if , then converges to some projection matrix very quickly. Let Define a partial function by setting .
Iterative algorithm for computing LSRDRs: Let but set to a sufficiently small value. Let . Let be an random orthonormal projection of rank . Let be random matrices in . Define for all recursively by setting
Then typically converges to a matrix of rank at an exponential rate. Furthermore, the matrix typically does not depend on the initialization , but does depend on and . Therefore set whenever does not depend on the initialization (just assume that the limit converges). If , then is typically the canonical rank LSRDR projection operator.
The iterative algorithm for computing LSRDRs is a proper extension of the notion of an LSRDR in some cases. For example, if we simply count the number of parameters, the matrices have parameters while the matrices have parameters. Therefore, if , then the matrices (and the projection matrix ) have more parameters than . This means that we cannot obtain a unique canonical LSRDR projection of dimension when . On the other hand, even when , the matrix typically exists, and if we set , there are where and where is an LSRDR of . This means that the iterative algorithm for computing LSRDRs gives more information than simply an LSRDR. The iterative algorithm for computing LSRDRs gives a projection operator.
Interpretability: Since there are several ways of computing LSRDRs, LSRDRs behave mathematically. A machine learning algorithm that typically produces the same output is more interpretable than one that does not for a few reasons including the conclusion that when there is only one output of a machine learning algorithm, that one output only depends on the input and it does not have any other source of random information contributing to it. Since we typically (but not always) attain the same local maximum when training LSRDRs multiple times, LSRDRs are both interpretable and mathematical. This sort of mathematical behavior is what we need to make sense of the inner workings of LSRDRs and other machine learning algorithms. There are several ways to generalize the notion of an LSRDR, but these generalized machine learning algorithms still tend to behave mathematically; they still tend to produce the exact same trained model after training multiple times.
Capabilities: Trained LSRDRs can solve NP-complete problems such as the clique problem. I have also trained LSRDRs to produce word embeddings for natural language processing and to analyze the octonions. I can generalize LSRDRs so that they behave more like deep neural networks, but we still have a way to go until these generalized LSRDRs perform as well as our modern AI algorithms, and I do not know how far we can push the performance of generalized LSRDRs while retaining their inherent interpretability and mathematical behavior.
If mathematical generalized LSRDRs can compete in performance with neural networks, then we are closer to solving the problem of general AI interpretability. But if not, then mathematical generalized LSRDRs could possibly be used as a narrow AI tool or even as a component of general AI; in this case, generalized LSRDRs could still improve general AI interpretability a little bit. An increased usage of narrow AI will (slightly) decrease the need for general AI.
Added 5/28/2025
When , the iterative algorithm for computing LSRDRs produces a low rank projection matrix, but a matrix of rank can be factored as where . To save computational resources, we may just work with during the training.
Suppose that with . Then there are several ways to easily factor as the product of an -matrix with a -matrix including the following factorizations:
.
Therefore, define
.
In particular, if , then .
Let for all , and define .
In the following algorithm, we will need to sometimes replace a pair with a new pair such that but where has norm smaller than because if we do not do this, after much training, the norm of will grow very large while is just a projection matrix. To do this, we decrease the norm of using gradient descent. In particular, we observe that . We are taking the gradient when , so we may have when is small. We want to move in the direction that minimizes the sum , but
Iterative algorithm for computing LSRDRs with low rank factorization:
Let but needs to be sufficiently small. Let . Set . Let be random rank matrices. Define recursively for all by setting
Then if everything goes right, would converge to at an exponential rate.
Added 5/31/2025
The iterative algorithm for computing LSRDRs can be written in terms of completely positive operators. We define a completely positive superoperator as an operator of the form where are square matrices over . Completely positive operators are typically defined when for quantum information theory, but we are using a more general context here. If , then
and
.
Added 6/7/2025
Iterative algorithm for computing LSRDRs apply not just to completely positive superoperators, but it also applies to positive operators that are not completely positive as long as the superoperators do not stray too far away from complete positivity. A linear superoperator is said to be positive if is positive semidefinite whenever is positive semidefinite. Clearly, every completely positive superoperator is positive, but not every positive operator is completely positive. For example, the transpose map defined by is always positive but not completely positive whenever . The difference of two completely positive superoperators from to is known as a Hermitian preserving map. Every positive map is Hermitian preserving. If is linear but not too far from positivity, then use the iterative algorithm for computing LSRDRs, we typically get the same results every time we train, and if is Hermitian preserving, then the operators will be positive semidefinite after training.