Joseph Van Name

# Comments

The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.

Machine learning models from generalized Laver tables?

I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.

But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.

Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.

Let  be a set, and let  be a binary operation on  that satisfies the self-distributivity identity . Define the right powers  for  inductively by setting  and . We say that  is a reduced nilpotent self-distributive algebra if  satisfies the identities  for  and if for all  there is an  with . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra  where if  for each , then  for some . Here we make the convention to put the implied parentheses on the left so that .

Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If  is a self-distributive algebra, then we can define  and in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define  precisely when  and  precisely when . We can define  to be the equivalence class of  with respect to  and . The operations on  are and  whenever we have our composition operation . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra  that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra  where  are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but  but there is a surjective homomorphism . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.

Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.

From Laver-like algebras, we can also produce non-commutative polynomials. If  is a Laver-like algebra, and  is a generating set for , then for each non-maximal critical point , we can define a non-commutative polynomial  to be the sum of all non-commutative monomials of the form  where  and  but where  for . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.

I have used these non-commutative polynomials to construct the infinite product formula

where the polynomials  are obtained from  different non-trivial rank-into-rank embeddings . This means that the non-commutative ring operations  in the rings of non-commutative polynomials are meaningful for Laver-like algebras.

If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.

Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.

Using deep learning to investigate generalized Laver tables

We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.

The n-th classical Laver table is the unique self-distributive algebraic structure  where whenever . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.

Problem: Compute the -th classical Laver table for as large  as we can achieve.

Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table  and with machine learning, I have personally gotten up to  (I think I have a couple of errors, but those are not too significant), and I have some of the computation of .

Suppose now that . Then Dougherty has shown that one can easily recover the entire function  from the restriction  where  for all . Suppose now that . Then let  denote the least natural number  such that . Then there is a number  called the threshold of  at  such that  and  and such that for every  with , we have  whenever  and for . The classical Laver table  can be easily recovered from the table  and the threshold function . To go beyond Dougherty's calculation of  to  and beyond, we exploit a couple of observations about the threshold function :

i: in most cases, , and

ii: in most cases,  for some .

Let , and let  where  is the least natural number where  and . Then one can easily compute  from  and the data  by using Dougherty's algorithm. Now the only thing that we need to do is compute  in the first place. It turns out that computing  is quite easy except when  is of the form  or  for some . In the case, when  or , we initially set  We then repeatedly modify  until we can no longer find any contradiction in the statement . Computing  essentially amounts to finding new elements in the set  and we find new elements in  using some form of machine learning.

In my calculation of , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in  to find new elements in , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.

My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in  belong to  and  has a natural tensor product structure. One can therefore consider  to be a subset of . Furthermore, a tuple of  elements in  can be considered as a subset of . This means that the linear layers from  to  should be Kronecker products  or Kronecker sums  (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices  are defined by setting  and  respectively. Perhaps neural networks can be used to generate new elements in . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in  have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute  without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of  since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.

Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.

Solution: The non-commutative polynomials  are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials  into unique sequences of real matrices  (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices  into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.

Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.

Solution: The set of all multigenic Laver tables over an alphabet  forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.

I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.

Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.

I think that all that happened here was the matrices  just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.

Suppose that  are natural numbers. Let . Let  be a complex number whenever . Let  be the fitness function defined by letting . Here,  denotes the spectral radius of a matrix  while  denotes the Schatten -norm of .

Now suppose that  is a tuple that maximizes . Let  be the fitness function defined by letting . Then suppose that  is a tuple that maximizes . Then we will likely be able to find an  and a non-zero complex number  where

In this case,  represents the training data while the matrices  is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model  without any distortion to the data values.

I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.

I forgot to mention another source of difficulty in getting the energy efficiency of the computation down to Landauer's limit at the CMB temperature.

Recall that the Stefan Boltzmann equation states that the power being emitted from an object by thermal radiation is equal to . Here,  stands for power,  is the surface area of the object,  is the emissivity of the object ( is a real number with ), is the temperature, and  is the Stefan-Boltzmann constant. Here,

Suppose therefore that we want a Dyson sphere with radius  that maintains a temperature of 4 K which is slightly above the CMB temperature. To simplify the calculations, I am going to ignore the energy that the Dyson sphere receives from the CMB so that I obtain a lower bound for the size of our Dyson sphere. Let us assume that our Dyson sphere is a perfect emitter of thermal radiation so that

Earth's surface has a temperature of about . In order to have a temperature of , our Dyson sphere needs to receive  the energy per unit of area. This means that the Dyson sphere needs to have a radius of about  astronomical units (recall that the distance from Earth to the sun is 1 astronomical unit).

Let us do more precise calculations to get a more exact radius of our Dyson sphere.

, so  which is about 15 percent of a light-year. Since the nearest star is 4 light years away, by the time that we are able to construct a Dyson sphere with a radius that is about 15 percent of a light year, I think that we will be able to harness energy from other stars such as Alpha Centauri.

The fourth power in the Stefan Boltzmann equation makes it hard for cold objects to radiate heat.

This post uses the highly questionable assumption that we will be able to produce a Dyson sphere that can maintain a temperature at the level of the cosmic microwave background before we will be able to use energy efficient reversible computation to perform operations that cost much less than  energy. And this post also makes the assumption that we will achieve computation at the level of about  per bit deletion before we will be able to achieve reversible computation. And it gets difficult to overcome thermal noise at an energy level well above  regardless of the type of hardware that one uses. At best, this post is an approximation for the computational power of a Dyson sphere that may be off by some orders of magnitude.

Let $$X,Y$$ be topological spaces. Then a function $$f:X\rightarrow Y$$ is continuous if and only if whenever $$(x_d)_{d\in D}$$ is a net that converges to the point $$x$$, the net $$(f(x_d))_{d\in D}$$ also converges to the point $$f(x)$$. This is not very hard to prove. This means that we do not have to discuss as to whether continuity should be defined in terms of open sets instead of limits because both notions apply to all topological spaces. If anything, one should define continuity in terms of closed sets instead of open sets since closed generalize slightly better to objects known as closure systems (which are like topological spaces, but we do not require the union of two closed sets to be closed). For example, the collection of all subgroups of a group is a closure system, but the complements of the subgroups of a group have little importance, so if we want the definition that makes sense in the most general context, closed sets behave better than open sets. And as a bonus, the definition of continuity works well when we are taking the inverse image of closed sets and when we are taking the closure of the image of a set.

With that being said, the good thing about continuity is that it has enough characterizations so that at least one of these characterizations is satisfying (and general topology texts should give all of these characterizations even in the context of closure systems so that the reader can obtain such satisfaction with the characterization of his or her choosing).

I have heard of filters and ultrafilters, but I have never heard of anyone calling any sort of filter a hyperfilter. Perhaps it is because the ultrafilters are used to make fields of hyperreal numbers, so we can blame this on the terminology. Similarly, the uniform spaces where the hyperspace is complete are called supercomplete instead of hypercomplete.

But the reason why we need to use a filter instead of a collection of sets is that we need to obtain an equivalence relation.

Suppose that  is an index set and  is a set with  for . Then let  be a collection of subsets of . Define a relation  on  by setting    if and only if . Then in order for  to be an equivalence relation,  must be reflexive, symmetric, and transitive. Observe that  is always symmetric, and  is reflexive precisely when .

Proposition: The relation  is transitive if and only if  is a filter.

Proof:

Suppose that  is a filter. Then whenever , we have

, so since

, we conclude that   as well. Therefore, .

. Suppose now that . Then let let  where  denotes the characteristic function. Then  and . Therefore,, so by transitivity,  as well, hence .

Suppose now that  and . Let  and set .

Observe that  and . Therefore, . Thus, by transitivity, we know that . Therefore, . We conclude that  is closed under taking supersets. Therefore,  is a filter.

Q.E.D.

Yes. We have 2=[(2,2,2,...)]. But we can compare 2 with (1,3,1,3,1,3,...) since (1,3,1,3,1,3,1,3,...)=1 (this happens when the set of all even natural numbers is in your ultrafilter) or (1,3,1,3,1,3,1,3,...)=3 (this happens when the set of all odd natural numbers is in your ultrafilter). Your partially ordered set is actually a linear ordering because whenever we have two sequences , one of the sets

is in your ultrafilter (you can think of an ultrafilter as a thing that selects one block out of every partition of the natural numbers into finitely many pieces), and if your ultrafilter contains

, then .

I trained a (plain) neural network on a couple of occasions to predict the output of the function  where  are bits and  denotes the XOR operation. The neural network was hopelessly confused despite the fact that neural networks usually do not have any trouble memorizing large quantities of random information. This time the neural network could not even memorize the truth table for XOR. While the operation  is linear over the field , it is quite non-linear over . The inability for a simple neural network to learn this function indicates that neural networks are better at learning when they are not required to stray too far away from linearity.

Load More