Joseph Van Name

I am not expecting any worldwide regulation on AI that prohibits people from using or training unaligned systems (I am just expecting a usual level of regulation). I am mainly hoping for spectral techniques to develop to the point where AI groups will want to use these spectral techniques (or some other method) more and more until they are competitive with neural networks at general tasks or at least complement the deficiencies of neural networks. I also hope that these spectral techniques will remain interpretable and aligned.

Right now, there are several kinds of tasks in which I would rather use spectral techniques than neural networks. I have been evaluating the cryptographic security of block ciphers with small message size and very small key size (for cryptocurrency research), and it seems like the spectral techniques that I have developed give consistent measures of security for such block ciphers (I am not done with the training yet) and these spectral techniques are better for cryptanalysis than neural networks. I have been able to use these spectral techniques for other problems such as the problem of finding the largest clique in a graph (this is not something that I would have expected before I did it), and right now these spectral techniques are the only way I know how to transform a non-commutative polynomial into something that other machine learning models can work with better.

Right now, I do not know how to use spectral techniques to replace deep neural networks. I do not know how to use spectral techniques to approximate a universal function and I do not know how to use spectral techniques to make machine learning models with many layers. I hope to be able to solve these problems of spectral techniques, but I agree that there will be a tradeoff between performance and interpretability. The goal is to make this tradeoff favor interpretable, aligned, and safe machine learning models.

I agree that interpretability research is risky, and that one should carefully consider whether it is worth it to perform this interpretability research. I propose that a safer alternative would be to develop machine learning models that are

- quite unrelated to the highest performing machine learning models (this is so that capability gains in these safer models do not translate very well to capability gains for the high performing models),
- as inherently interpretable as one can make these models,
- more limited in functionality than the highest performing machine learning models, but
- still functional enough to be useful in practice.

These characteristics are correlated with each other, so it is likely that one would get all of these characteristics at once. For example, one will most likely not get interpretability for free but instead trade interpretability for performance. Being unrelated to the highest performing machine learning models will also likely reduce the performance of the machine learning models.

It seems like spectral methods such as PageRank or the eigenvectors of the Laplacian of a graph satisfy some of these characteristics, but it would be good if there were more research into developing spectral methods further to increase their functionality.

Perhaps interpretability research has yielded such few results because we need to first develop inherently interpretable models.

I do not see any evidence that large language models are equipped to understand the structure behind prime numbers. But transformers along with other machine learning tools should be well-equipped to investigate other mathematical structures. In particular, I am thinking about the mathematical structures called Laver-like algebras that I have been researching on and off since about 2015.

I have developed an algorithm that is capable of producing new Laver-like algebras from old ones. From every Laver-like algebra, one can generate a sequence of non-commutative polynomials. From these non-commutative polynomials, one can recover the entire Laver-like algebra up to critical equivalence. Therefore, to use deep learning to investigate the structure of Laver-like algebras, we need a few things.

- We need a way of transforming the non-commutative polynomials into something that further layers of AI can work with. I suggest using my notion of an LSRDR that I developed for cryptographic purposes to transform these non-commutative polynomials into collections of matrices. These matrices are often (but not always) unique up-to-a constant scalar factor and orthogonal/unitary equivalence. But this is simply what I have come up with using my own methods. There are likely other ways that transform non-commutative polynomials into vectors that I have simply not thought of.
- We need a way of solving a machine learning problem from a sequence of vectors or matrices as input. This is where we can use transformers.
- We need a problem or problems to solve along with an algorithm for deciding which Laver-like algebras to go through next. One such problem would be to estimate the number of Laver-like algebras exist that satisfy certain conditions. Another such problem would be to determine whether an algebra arises from the rank-into-rank embeddings or not. Since transformers are good at working with a variety of problems, we may be able to train transformers to do a variety of tasks related to Laver-like algebras.

Let me know if you want more detailed information about this proposal.

While I believe that this is a sensible proposal, I do not believe there will be too much of a market (in the near future) for it for societal reasons. Our current society unfortunately does not have a sufficient appreciation for mathematics and mathematical theorems for this to have much of a market cap. To see why this is the case, we can compare this proposal to the proposal for cryptocurrency mining algorithms that are designed to advance science. I propose that a scientific cryptocurrency mining algorithm should attract a much larger market capitalization than a theorem-proof market, but since scientific cryptocurrencies do not have much of a market, one should not expect for shares of theorems to attract much of a market either.

A market for mathematical conjectures requires participants to not only actively understand the mathematics behind the conjectures, but to also understand these theorems in their complete formalism. Perhaps AI systems may be developed to translate the formalism into English, but such a system may be flawed and may be subject to hacking in the sense that the statement in English may not be exactly the same as the formal statement, and it may be really easy to prove the formal statement. Since such a market requires a substantial amount of effort from the users, it is unlikely that there will be many users who are willing to put in the effort into understanding the mathematics along with the formalism.

It is also likely for the market to favor simple propositions that are easy to understand such as the bounded gaps between primes conjecture. For example, Beal's conjecture states that there are no positive integers with and and where have no common factor. This conjecture was formulated by the billionaire Michael Beal. Beal attached a million dollar prize to this conjecture. This conjecture is a generalization of Fermat's last theorem. Beal's conjecture is a reasonably interesting conjecture, but it seems like Beal's conjecture is the very best mathematics problem that amateur mathematicians can come up with and fund with one's own money. This means that the market for mathematical theorems will be small and will mainly for theorems that anyone can understand such as the Twin Prime conjecture. It would be more satisfying if the market for mathematical theorems better represents the type of problems and questions that your typical mathematician were interested in.

Cryptocurrency mining algorithms can be used to solve scientific problems, but the users of the cryptocurrency do not even have to care very much about the underlying scientific problem. This is because a cryptocurrency is much more than its mining algorithm, and the details about the mining algorithm are relatively unimportant to the functionality of the cryptocurrency. If chosen correctly, the underlying scientific/mathematical problem does not harm or interfere very much with the cryptocurrency, so there is really no excuse for the cryptocurrency community to waste all those resources on cryptocurrencies on non-scientific problems because the cryptocurrency community does not have to. The cryptocurrency community needs to have a desire to not waste energy along with the philosophy that solving the most important scientific problem is profitable (this would be a self-fulfilling prophecy). The cryptocurrency community should also have a basic awareness of mining algorithms that are designed to solve a scientific problem (they do not need to know the specific details). This basic awareness should be part of the due diligence that every cryptocurrency enthusiast should partake in order to try to maximize profit, so this understanding should be natural for the cryptocurrency sector. Since there is still only one dominant minable cryptocurrency (namely Bitcoin), one only needs one scientific mining algorithm, so the cryptocurrency community does not even need to understand multiple scientific mining algorithms. They just need to know one. It is easier to know about one cryptocurrency mining algorithm than it is to know the market value of many theorems.

Right now Bitcoin has a market capitalization of about 500 billion dollars, and if Bitcoin was set up correctly, it could have had a mining algorithm that was designed to solve a very important scientific problem without the cryptocurrency community even thinking about this. Unfortunately, Bitcoin's mining algorithm was never designed to advance science, and at this point, the cryptocurrency community is not very interested (due to anti-intellectualism) in using cryptocurrency technologies to solve any scientific problem. The cryptocurrency community to a great extent has no concept or understanding of Bitcoin or what it means for a cryptocurrency mining algorithm to solve a scientific problem. And those who think they understand what it means for a cryptocurrency mining algorithm to be designed to solve a scientific problem are quite antagonistic to cryptocurrency mining being used for any scientific purpose. Since the cryptocurrency community hates using cryptocurrency mining for a scientific purpose so much, one should expect for any market for proofs of theorems to be very small and undeveloped.

Nobody should expect for there to be a sizable theorem/proof market when that requires ~100 times as much effort from users than a market for cryptocurrencies with scientific mining algorithms and when there is no sizable scientific mining algorithm market.

With everything being said, we make a tradeoff with cryptocurrencies with scientific mining algorithms. Cryptocurrency mining algorithms such as SHA-256 must satisfy stringent conditions, so there are limited options in designing a cryptocurrency with a mining algorithm to advance science, and it is hard to fork a cryptocurrency to replace its scientific problem.

I would rather see people work on making scientific cryptocurrency mining more of a reality than imagining a market that is unlikely to exist any time soon because scientific cryptocurrencies are more feasible with the current state of the world.

You claim that Terry Tao says that Benford's law lack's a satisfactory explanation. That is not correct. Terry Tao actually gave an explanation for Benford's law. And even if he didn't, if you are moderately familiar with mathematics, an explanation for Benford's law would be obvious or at least a standard undergraduate exercise.

Let be a random variable that is uniformly distributed on the interval for some . Let denote the first digit of . Then show that .

For all , let be a uniformly distributed random variable on the interval . Let denote the first digit of . Suppose that . Prove that .

Terry Tao in that blog post lists the characteristics that imply Zipf's law here:

"These laws govern the asymptotic distribution of many statistics which

- (i) take values as positive numbers;
- (ii) range over many different orders of magnitude;
- (iiii) arise from a complicated combination of largely independent factors (with different samples of arising from different independent factors); and
- (iv) have not been artificially rounded, truncated, or otherwise constrained in size."

Terry Tao merely stated that Benford's law cannot be proven the same way mathematical theorems are proven; "Being empirically observed phenomena rather than abstract mathematical facts, Benford’s law, Zipf’s law, and the Pareto distribution cannot be “proved” the same way a mathematical theorem can be proved. However, one can still *support* these laws mathematically in a number of ways, for instance showing how these laws are compatible with each other, and with other plausible hypotheses on the source of the data."

I have originally developed LSRDRs to investigate the cryptographic security of the mining algorithm for the cryptocurrency Circcash and compare Circcash mining with similar mining algorithms. I launched the cryptocurrency Circcash so that Circcash mining accelerates the development of reversible computing hardware. But to do this, I needed to construct my own cryptographic algorithm. Unfortunately, I was unable to thoroughly investigate the cryptographic security of mining algorithms before launching Circcash. I decided that it was best to develop some of my own techniques for evaluating the cryptographic security of Circcash including LSRDRs, normal forms, and other techniques, but I still have not completely investigated Circcash using LSRDRs since I need more computational power to reduce the dimension of collections of 2^32-1 by 2^32-1 matrices.

But it looks like LSRDRs and similar algorithms have other uses (such as producing word embeddings, graph/hypergraph embeddings, etc), and I expect to expand the capabilities of algorithms like LSRDRs.

Here is a post that I have made about how we still need to calculate LSRDRs of cryptographic functions to evaluate the cryptographic security of these cryptographic functions:

https://github.com/jvanname/circcash/issues/10

Since Circcash mining will accelerate the development of more advanced AI, it is a good idea to use the knowledge that I have produced by trying to investigate the cryptographic security of Circcash to try to get reversible computing to be used for good. Here is a post about how Circcash needs to be used to advance AI safety instead of just computational power.

https://github.com/jvanname/circcash/issues/13

Yes. I am looking for investors in Circcash, and I am willing to consult on the use of LSRDRs and similar algorithms in machine learning.

Massively downvoting mathematics without commenting at all just shows that the people on this site are very low quality specimens who do not care at all about rationality at all but who just want to pretend to be smart.

The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.

Machine learning models from generalized Laver tables?I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.

But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.

Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.

Let X be a set, and let ∗ be a binary operation on X that satisfies the self-distributivity identity x∗(y∗z)=(x∗y)∗(x∗z). Define the right powers x[n] for n≥1 inductively by setting x[1]=x and x[n+1]=x∗x[n]. We say that (X,∗,1) is a reduced nilpotent self-distributive algebra if 1 satisfies the identities 1∗x=x,x∗1=1 for x∈X and if for all x∈X there is an n≥1 with x[n]=1. A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra (X,∗) where if xn∈X for each n≥0, then x0∗⋯∗xN=1 for some N. Here we make the convention to put the implied parentheses on the left so that a∗b∗c∗d=((a∗b)∗c)∗d.

Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If (X,∗) is a self-distributive algebra, then we can define x∗0y=y,x∗n+1y=x∗n(x∗y)=x∗(x∗ny) andx∗∞y=limn→∞x∗ny in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define x⪯y precisely when x∗∞y=1 and x≃y precisely when x⪯y⪯x. We can define crit(x) to be the equivalence class of x with respect to ≃ and crit[X]=X/≃. The operations on crit[X] arecrit(x)→crit(y)=crit(x∗∞y) and crit(x)∧crit(y)=crit(x∘y) whenever we have our composition operation ∘. One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra (X,∗) that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra (Y,∗) where X,Y are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but |Y|>>|X| but there is a surjective homomorphism ϕ:Y→X. The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.

Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.

From Laver-like algebras, we can also produce non-commutative polynomials. If (X,∗) is a Laver-like algebra, and (xa)a∈A is a generating set for X, then for each non-maximal critical point α∈crit[X], we can define a non-commutative polynomial pα((xa)a∈A) to be the sum of all non-commutative monomials of the form xa1…xan where a1…an∈A∗ and crit(a1∗⋯∗an)=α but where crit(a1∗⋯∗am)<α for 1≤m<n. These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.

I have used these non-commutative polynomials to construct the infinite product formula

…pr(x1,…,xn)…p0(x1,…,xn)=11−(x1+⋯+xn) where the polynomials pn(x1,…,xn) are obtained from n different non-trivial rank-into-rank embeddings j1,…,jn:Vλ→Vλ. This means that the non-commutative ring operations +,⋅ in the rings of non-commutative polynomials are meaningful for Laver-like algebras.

If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.

Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.

Using deep learning to investigate generalized Laver tablesWe can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.

The n-th classical Laver table is the unique self-distributive algebraic structure An=({1,…,2n},∗n) wherex∗n1=x+1mod2n whenever x∈{1,…,2n}. The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.

Problem: Compute the n-th classical Laver table for as large n as we can achieve.

Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table A48 and with machine learning, I have personally gotten up to A768 (I think I have a couple of errors, but those are not too significant), and I have some of the computation of A1024.

Suppose now that 2N≤n≤3⋅2N. Then Dougherty has shown that one can easily recover the entire function ∗n from the restriction LN,n:{1,…,22N}×{1,…,2n}→{1,…,2n} where LN,n(x,y)=x∗ny for all x,y. Suppose now that 1≤x≤2n. Then let on(x) denote the least natural number m such that x∗n2m=2n. Then there is a number θn+1(x) called the threshold of x at An+1 such that 0≤θn+1(x)≤2on(x) and θn+1(2n)=0 and such that for every y with 1≤y≤2n, we have x∗n+1y=x∗ny whenever 1≤y≤θn+1(x) andx∗n+1y=(x∗ny)+2n for θn+1(x)<y≤2on(x). The classical Laver table An+1 can be easily recovered from the table An and the threshold function θn+1. To go beyond Dougherty's calculation of A48 to A768 and beyond, we exploit a couple of observations about the threshold function θn+1:

i: in most cases, θn+2(x)=θn+1(x), and

ii: in most cases, θn+1(x)=2i−2j for some i,j.

Let E(n)={x∈{1,…,2n−2}:θn−1(x)≠θn(x)}, and let E♯(n)=E(n)∩{1,…,2N−2} where N is the least natural number where n≤3⋅2N and Tn={(x,θn(x)):x∈E♯(n)}. Then one can easily compute An from An−1 and the data Tn by using Dougherty's algorithm. Now the only thing that we need to do is compute Tn in the first place. It turns out that computing Tn is quite easy except when n is of the form 2r+1 or 3⋅2r+1 for some r. In the case, when n=2r+1 or n=3⋅2r+1, we initially set Ttempn=∅. We then repeatedly modify Ttempn until we can no longer find any contradiction in the statement Tn=Ttempn. Computing Tn essentially amounts to finding new elements in the set E♯(n) and we find new elements in E♯(n) using some form of machine learning.

In my calculation of A768, I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in E♯(n) to find new elements in E♯(n), and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.

My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in E♯(n) belong to {1,…,2N} and {1,…,2N} has a natural tensor product structure. One can therefore consider E♯(n) to be a subset of (R2)⊗n. Furthermore, a tuple of r elements in E♯(n) can be considered as a subset of (R2))⊗n⊗Rr. This means that the linear layers from (R2))⊗n⊗Rr to (R2))⊗n⊗Rr should be Kronecker products A1⊗⋯⊗An+1 or Kronecker sums A1⊕⋯⊕An+1 (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices A,B are defined by setting (A⊗B)(u⊗v)=Au⊗Bv and A⊕B=(A⊗1)+(1⊗B) respectively. Perhaps neural networks can be used to generate new elements in E♯(n). These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in E♯(n) have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute A768 without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of E♯(n) since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.

Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.

Solution: The non-commutative polynomials p0,…,pn are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials p0,…,pn into unique sequences of real matrices A0,…,An (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices A0,…,An into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.

Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.

Solution: The set of all multigenic Laver tables over an alphabet A forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.

I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.

Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.