The intersection of machine learning and generalizations of Laver tables seems to have a lot of potential, so your question about this is extremely interesting to me.
Machine learning models from generalized Laver tables?
I have not been able to find any machine learning models that are based on generalized Laver tables that are used for solving problems unrelated to Laver tables. I currently do not directly see any new kinds of deep learning models that one can produce from generalizations of Laver tables, but I would not be surprised if there were a non-standard machine learning model from Laver like algebras that we could use for something like NLP.
But I have to be skeptical about applying machine learning to generalized Laver tables. Generalizations of Laver tables have intricate structure and complexity but this complexity is ordered. Furthermore, there is a very large supply of Laver-like algebras (even with 2 generators) for potentially any situation. Laver-like algebras are also very easy to compute. While backtracking may potentially take an exponential amount of time, when I have computed Laver-like algebras, the backtracking algorithm seems to always terminate quickly unless it is producing a large quantity or Laver-like algebras or larger Laver-like algebras. But with all this being said, Laver-like algebras seem to lack the steerability that we need to apply these algebras to solving real-world problems. For example, try interpreting (in a model theoretic sense) modular arithmetic modulo 13 in a Laver-like algebra. That is quite hard to do because Laver-like algebras are not built for that sort of thing.
Here are a couple of avenues that I see as most promising for producing new machine learning algorithms from Laver-like algebras and other structures.
Let be a set, and let be a binary operation on that satisfies the self-distributivity identity . Define the right powers for inductively by setting and . We say that is a reduced nilpotent self-distributive algebra if satisfies the identities for and if for all there is an with . A reduced Laver-like algebra is a reduced nilpotent self-distributive algebra where if for each , then for some . Here we make the convention to put the implied parentheses on the left so that .
Reduced nilpotent self-distributive algebras have most of the things that one would expect. Reduced nilpotent self-distributive algebras are often equipped with a composition operation. Reduced nilpotent self-distributive algebras have a notion of a critical point, and if our reduced nilpotent self-distributive algebra is endowed with a composition operation, the set of all critical points in the algebra forms a Heyting algebra. If is a self-distributive algebra, then we can define and in the discrete topology (this limit always exists for nilpotent self-distributive algebras). We define precisely when and precisely when . We can define to be the equivalence class of with respect to and . The operations on are and whenever we have our composition operation . One can use computer calculations to add another critical point to a reduced nilpotent self-distributive algebra and obtain a new reduced nilpotent self-distributive algebra with the same number of generators but with another critical point on top (and this new self-distributive algebra will be sub-directly irreducible). Therefore, by taking subdirect products and adding new critical points, we have an endless supply of reduced nilpotent self-distributive algebras with only 2 generators. I also know how to expand reduced nilpotent self-distributive algebras horizontally in the sense that given a finite reduced nilpotent self-distributive algebra that is not a Laver-like algebra, we can obtain another finite reduced nilpotent self-distributive algebra where are both generated by the same number of elements and these algebras both have the same implication algebra of critical points but but there is a surjective homomorphism . The point is that we have techniques for producing new nilpotent self-distributive algebras from old ones, and going deep into those new nilpotent self-distributive algebras.
Since reduced nilpotent self-distributive algebras are closed under finite subdirect products, subalgebras, and quotients, and since we have techniques for producing many nilpotent self-distributive algebras, perhaps one can make a ML model from these algebraic structures. On the other hand, there seems to be less of a connection between reduced nilpotent self-distributive algebras and large cardinals and non-commutative polynomials than with Laver-like algebras, so perhaps it is sufficient to stick to Laver-like algebras as a platform for constructing ML models.
From Laver-like algebras, we can also produce non-commutative polynomials. If is a Laver-like algebra, and is a generating set for , then for each non-maximal critical point , we can define a non-commutative polynomial to be the sum of all non-commutative monomials of the form where and but where for . These non-commutative polynomials capture all the information behind the Laver-like algebras since one can reconstruct the entire Laver-like algebra up to critical equivalence from these non-commutative polynomials. These non-commutative polynomials don't work as well for nilpotent self-distributive algebras; for nilpotent self-distributive algebras, these non-commutative polynomials will instead be rational expressions and one will not be able to recover the entire nilpotent self-distributive algebra from these rational expressions.
I have used these non-commutative polynomials to construct the infinite product formula
where the polynomials are obtained from different non-trivial rank-into-rank embeddings . This means that the non-commutative ring operations in the rings of non-commutative polynomials are meaningful for Laver-like algebras.
If I wanted to interpret and make sense of a Laver-like algebra, I would use these non-commutative polynomials. Since non-commutative polynomials are closely related to strings (for NLP) and since the variables in these non-commutative polynomials can be substituted with matrices, I would not be surprised if one can turn Laver-like algebras into a full ML model using these non-commutative polynomials in order to solve a problem unrelated to Laver-like algebras.
Perhaps, one can also use strong large cardinal hypotheses and logic to produce ML models from Laver-like algebras. Since the existence of a non-trivial rank-into-rank embedding is among the strongest of all large cardinal hypotheses, and since one can produce useful theorems about natural looking finite algebraic structures from these large cardinal hypotheses, perhaps the interplay between the logic using large cardinal hypotheses and finite algebraic structures may be able to produce ML models? For example, one can automate the process of using elementarity to prove the existence of finite algebraic structures. After one has proven the existence of these structures using large cardinal hypotheses, one can do a search using backtracking in order to actually find candidates for these algebraic structures (or if the backtracking algorithm is exhaustive and turns up nothing, then we have a contradiction in the large cardinal hierarchy or a programming error. Mathematicians need to expend a lot of effort into finding an inconsistency in the large cardinal hierarchy this way because I am confident that they won't find such an inconsistency they will instead find plenty of near misses.). We can automate the process of producing examples of Laver-like algebras that satisfy conditions that were first established using large cardinal hypotheses, and we can perhaps completely describe these Laver-like algebras (up-to-critical equivalence) using our exhaustive backtracking process. This approach can be used to produce a lot of data about rank-into-rank embeddings, but do not see a clear path that allows us to apply this data outside of set theory and Laver-like algebras.
Using deep learning to investigate generalized Laver tables
We can certainly apply existing deep learning and machine learning techniques to classical and multigenic Laver tables.
The n-th classical Laver table is the unique self-distributive algebraic structure where whenever . The classical Laver tables are up-to-isomorphism the only nilpotent self-distributive algebras generated by a single element.
Problem: Compute the -th classical Laver table for as large as we can achieve.
Solution: Randall Dougherty in his 1995 paper Critical points in an algebra of elementary embeddings II gave an algorithm for computing the 48-th classical Laver table and with machine learning, I have personally gotten up to (I think I have a couple of errors, but those are not too significant), and I have some of the computation of .
Suppose now that . Then Dougherty has shown that one can easily recover the entire function from the restriction where for all . Suppose now that . Then let denote the least natural number such that . Then there is a number called the threshold of at such that and and such that for every with , we have whenever and for . The classical Laver table can be easily recovered from the table and the threshold function . To go beyond Dougherty's calculation of to and beyond, we exploit a couple of observations about the threshold function :
i: in most cases, , and
ii: in most cases, for some .
Let , and let where is the least natural number where and . Then one can easily compute from and the data by using Dougherty's algorithm. Now the only thing that we need to do is compute in the first place. It turns out that computing is quite easy except when is of the form or for some . In the case, when or , we initially set We then repeatedly modify until we can no longer find any contradiction in the statement . Computing essentially amounts to finding new elements in the set and we find new elements in using some form of machine learning.
In my calculation of , I used operations like bitwise AND, OR, and the bitwise majority operation to combine old elements in to find new elements in , and I also employed other techniques similar to this. I did not use any neural networks though, but using neural networks seems to be a reasonable strategy, but we need to use the right kinds of neural networks.
My idea is to use something similar to a CNN where the linear layers are not as general as possible, but the linear layers are structured in a way that reduces the number of weights and so that the structure of the linear layers is compatible with the structure in the data. The elements in belong to and has a natural tensor product structure. One can therefore consider to be a subset of . Furthermore, a tuple of elements in can be considered as a subset of . This means that the linear layers from to should be Kronecker products or Kronecker sums (it seems like Kronecker sums with residual layers would work better than Kronecker products). Recall that the Kronecker product and Kronecker sum of matrices are defined by setting and respectively. Perhaps neural networks can be used to generate new elements in . These generative neural networks do not have to be too large nor do they have to be too accurate. A neural network that is 50 percent accurate will be better than a neural network that is 90 percent accurate but takes 10 times as long to compute and is much harder to retrain once most of the elements in have already been obtained and when the neural network has trouble finding new elements. I also favor smaller neural networks because one can compute without using complicated techniques in the first place. I still need to figure out the rest of the details for the generative neural network that finds new elements of since I want it to be simple and easy to compute so that it is competitive with things like bitwise operations.
Problem: Translate Laver-like algebras into data (like sequences of vectors or matrices) into a form that machine learning models can work with.
Solution: The non-commutative polynomials are already in a form that is easier for machine learning algorithms to use. I was able to apply an algorithm that I originally developed for analyzing block ciphers in order to turn the non-commutative polynomials into unique sequences of real matrices (these real matrices do not seem to depend on the initialization since the gradient ascent always seems to converge to the same local maximum). One can then feed this sequence of real matrices into a transformer to solve whatever problem one wants to solve about Laver-like algebras. One can also represent a Laver-like algebra as a 2-dimensional image and then pass that image through a CNN to learn about the Laver-like algebra. This technique may only be used for Laver-like algebras that are small enough for CNNs to fully see, so I do not recommend CNNs for Laver-like algebras, but this is at least a proof-of-concept.
Problem: Estimate the number of small enough Laver-like algebras (up-to-critical equivalence) that satisfy certain properties.
Solution: The set of all multigenic Laver tables over an alphabet forms a rooted tree. We can therefore apply the technique for estimating the number of nodes in a rooted tree described in the 1975 paper Estimating the Efficiency of Backtrack Programs by Donald Knuth. This estimator is unbiased (the mean of the estimated value is actually the number of nodes in the tree), but it will have a very high variance for estimating the number of multigenic Laver tables. In order to reduce the variance for this estimator, we need to assign better probabilities when finding a random path from the root to the leaves, and we should be able to use transformers to assign these probabilities.
I am pretty sure that there are plenty of other ways to use machine learning to investigate Laver-like algebras.
Now, nilpotent self-distributive algebras may be applicable in cryptography (they seem like suitable platforms for the Kalka-Teicher key exchange but nobody has researched this). Perhaps a transformer that learns about Laver-like algebras would do better on some unrelated task than a transformer that was never trained using Laver-like algebras. But I do not see any other practical applications of nilpotent Laver-like algebras at the moment. This means that Laver-like algebras are currently a relatively safe problem for investigating machine learning algorithms, so Laver-like algebras are perhaps applicable to AI safety since I currently do not see any way of misaligning an AI for solving a problem related to Laver-like algebras.
I have not heard about the IBM paper until now. This is inspired by my personal experiments training (obviously classical) machine learning models.
Suppose that are real or complex finite dimensional inner product spaces. Suppose that the training data consists of tuples of the form where are vectors. Let and let be bilinear for all . Then let
whenever . Then we define our polynomial by setting . In other words, my machine learning models are just compositions of bilinear mappings. In addition to wanting to approximate the label, we also include regularization that makes the machine learning model pseudodeterministically trained so that if we train it twice with different initializations, we end up with the same trained model. Here, the machine learning model has layers, but the addition of extra layers gives us diminishing returns since bilinearity is close to linearity, so I still want to figure out how to improve the performance of such a machine learning model to match deep neural networks (if that is even feasible).
I use quantum information theory for my experiments mainly because quantum information theory behaves well unlike neural networks.
Proof-of-stake is still wasteful since it promotes pump and dump scams and causes people to waste their money on scam projects. If the creators are able to get their reward at the very beginning of a project, they will be more interested in short-term gains rather than a long-term token that will last. Humans are not psychologically/socially equipped to invest in proof-of-stake cryptocurrencies since they tend to get scammed.
Bitcoin mining is a real-world example of a goal that people spend an enormous amount of resources to attain, but this goal is useless or at least horribly inefficient.
Recall that the orthogonality thesis states that it is possible for an intelligent entity to have bad or dumb goals and that it is also possible for a not-so-intelligent entity to have good goals. I would therefore consider Bitcoin mining to be a real-world prominent example of the orthogonality thesis as it in a sense a dumb goal attained intelligently (though, this example is imperfect).
Bitcoin's mining algorithm consists of computing many SHA-256 hashes relentlessly. The Bitcoin miners are rewarded whenever they compute a suitable SHA-256 hash that is lower than the target. These SHA-256 hashes establish decentralized consensus about the state of the blockchain, and they distribute newly minted bitcoins. But besides this, computing so many SHA-256 hashes is nearly useless. Computing so many SHA-256 hashes consumes large quantities of energy and creates electronic waste.
So what are some of the possible alternatives to Bitcoin mining? It seems like the best alternative that does not significantly change the nature of Bitcoin mining would be to replace SHA-256 mining with some other mining algorithm that serves some scientific purpose.
This is more difficult than it seems because Bitcoin mining must satisfy a list of cryptographic properties. If the mining algorithm did not satisfy these cryptographic problems, then it might not be feasible for newly minted bitcoins to be dispersed every 10 minutes, and we may enter a scenario where a single entity with a secret algorithm or slightly faster hardware were to put all the blocks on the blockchain.
Since Bitcoin mining must satisfy a list of cryptographic properties, it is difficult to come up with a more scientifically useful mining algorithm that satisfies these cryptographic properties. But in science, if there is a difficult problem, people should perform research on this scientific problem. While finding a useful cryptocurrency mining algorithm has its challenges, cryptocurrency mining algorithms are easy to produce since they can be made from cryptographic hash functions without requiring public key encryption or other advanced cryptographic algorithms, so difficulty seems more like an excuse rather than a legitimate reason not to investigate useful cryptocurrency mining algorithms. The cryptocurrency sector does not want to perform this research. I can think of several reasons why people refuse to support this sort of endeavor despite the great effort that people put into Bitcoin mining, but none of these reasons justify the lack of interest in useful cryptocurrency mining.
The diminishing quality of cryptocurrency users:
It seems like when altcoins were first being developed around 2014, people were much more interested in developing scientifically useful mining algorithms. But around 2017 when cryptocurrency really started to become popular, people simply wanted to make money from cryptocurrencies, yet they were not very interested in understanding how cryptocurrencies work or how to improve them.
Mining algorithms with questionable scientific use:
Some cryptocurrencies and proposals such as Primecoin and Gapcoin have more scientific mining algorithms, but these mining algorithms still have questionable usefulness. For example, the objective in Primecoin mining is to find suitable Cunningham chains. A Cunningham chain of the first kind is a sequence of prime numbers where whenever . The most interesting thing about Cunningham chains is that they can be used in cryptocurrency mining algorithms, but they are otherwise of minor importance to mathematics.
These questionable mining algorithms are supposed to steer the cryptocurrency community into a more scientific direction, but in reality, they have just steered the cryptocurrency community towards using mining to perform mathematical calculations that not even mathematicians care that much about.
Alternative solutions to the energy waste problem:
Many people just want to do away with cryptocurrency mining in an altcoin by replacing it with proof-of-stake or some other consensus mechanism. This solution is attractive to the cryptocurrency creators since they want complete control over all the coins at the beginning of the project, and they just use the energy usage of cryptocurrency as a marketing strategy to get people interested in their project. But this solution should not be appealing to anyone who wants to use the cryptocurrency even if a cryptocurrency is better funded without much mining (of course, if mining is replaced with another consensus mechanism after all the coins have been created, then this objection does not stand). After all, Satoshi Nakamoto did not fund Bitcoin by selling bitcoins. There are other ways to fund a cryptocurrency project without alternate consensus mechanisms.
Hostility against cryptocurrency technologies:
It seems like many members of society are hostile against cryptocurrency technologies and hate people who own or are in any way interested in cryptocurrency. This sort of hostility is a very good reason to conduct as many transactions using just cryptocurrency since I do not want to deal with all of those Karens. But this hostility may have turned people away from researching useful cryptocurrency mining algorithms even though the usefulness would probably not benefit the cryptocurrency directly.
Hardcore Bitcoiners:
If Bitcoin mining were magically replaced with a useful mining algorithm, barely anything about Bitcoin would change. But in my experience, Bitcoiners do not see it this way. They are so stuck in their ways that they reject all altcoins.
Conclusion:
While cryptocurrencies have a lot of monetary value, they are not exactly powerhouses of innovation, nor do I find them extremely interesting on their own. But a good scientific mining algorithm would make them much more innovative and interesting.
In this post, we shall go over a way to produce mostly linear machine learning classification models that output probabilities for each possible label. These mostly linear models are pseudodeterministically trained (or pseudodeterministic for short) in the sense that if we train them multiple times with different initializations, we will typically get the same trained model (up-to-symmetry and miniscule floating point differences).
The algorithms that I am mentioning in this post generalize to more complicated multi-layered algorithms in the sense that the multi-layered algorithms remain pseudodeterministic, but for simplicity, we shall stick to just linear operators here.
Let denote either the field of real numbers, the field of complex numbers, or the division ring of quaternions. Let be a finite dimensional inner product space over . The training data is a set of pairs where and where is the machine learning model input and is the label. The machine learning model is trained to predict the label when given the input . The trained model is a function that maps to the set of all probability vectors of length , so the trained model actually gives the probabilities for each possible label.
Suppose that is a finite dimensional inner product space over for each . Then the domain of the fitness function consists of tuples where each is a linear operator from to . Let , and let . The parameter is the exponent while is the regularization parameter. Define (almost total) functions by setting
.
Here, denotes the Schatten -norm which can be defined by setting
.
Set . Here, denotes our fitness function. The function what we really want to maximize, but unfortunately, is typically non-pseudodeterministic, so we need to add the regularization term to obtain pseudodeterminism. The regularization term also has the added effect of making relatively large compared to the norm for training data points . This may be useful in determining whether a pair should belong to either the training or test data in the first place.
We observe that is -homogeneous in the sense that for each non-zero scalar (in the quaternionic case, the scalars are just the real numbers).
Suppose now that we have obtained a tuple that maximizes the fitness . Let denote the set of all probability vectors of length . Then define an almost total function by setting
If belongs to the training data set, then the -th entry of is the machine learning model's estimate of the probability that . I will let the reader justify this calculation of the probabilities.
We can generalize the function to pseudodeterministically trained machine learning models with multiple layers by replacing the linear operators with some non-linear or multi-linear operators. Actually, there are quite a few ways of generalizing the fitness function , and I have taken some liberty in the exact formulation for .
In addition to being pseudodeterministic, the fitness function has other notable desirable properties. For example, when maximizing using gradient ascent, one tends to converge to the local maximum at an exponential rate without needing to decay the learning rate.
Whether one takes or should take a cold shower or not depends on a lot of factors including whether one exercises, one's health, one's personal preferences, the air temperature, the cold water temperature, the humidity level, and the hardness of the shower water. But it seems like most people can't fathom taking a cold shower simply because they are cold intolerant even though cold showers have many benefits.
In addition to the practical benefits of cold showers, cold showers also may offer health benefits.
Cold showers could improve one's immune system (though we should).
The Effect of Cold Showering on Health and Work: A Randomized Controlled Trial - PMC
Cold showers may boost mood or alleviate depression.
Scientific Evidence-Based Effects of Hydrotherapy on Various Systems of the Body - PMC
Adapted cold shower as a potential treatment for depression - ScienceDirect
Cold showers could also improve circulation and metabolism.
Cold showers also offer other benefits.
I always use the exhaust fan. It is never powerful enough to reduce the humidity faster than a warm shower increases the humidity. I also lock the door when taking a shower, and I do not know why anyone would take a shower without locking the door. Opening the door while showering just makes the rest of the home humid as well, and we can't have that.
I exercise daily, so out of habit, I always take a shower after I exercise, and most of my showers are after exercise. Even if I spend a few minutes cooling down after exercise, I need the shower to cool down even more, and by taking a warm shower, I cannot cool down as effectively, so I end up sweating after taking the shower. And I sometimes take my temperature after exercise and the shower and even after the shower, I tend to have a mouth temperature of 99.0 to 99.5 degrees Fahrenheit. I doubt that people who barely need to take a shower after exercising are doing much exercise or perhaps they are doing weights instead of cardio which produces less sweat, but in any case, I have never exercised and thought that I do not need a shower regardless of whether I am doing cardio, weights, or whatever.
Soap scum left over after taking a cold shower seems to be a problem for you and for you only.
Added 8/20/2025: And taking a hot shower produces all the condensation that helps all that mirror bacteria grow. Biological risk from the mirror world — LessWrong
Instead of not taking showers, we should all take cold showers for many reasons.
I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.
We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.
Define for all complex numbers . Then , and there are neighborhoods of respectively where if , then quickly and if , then quickly. Set . The function serves as error correction for projection matrices since if is nearly a projection matrix, then will be a projection matrix.
Suppose that is either the field of real numbers, complex numbers or quaternions. Let denote the center of . In particular, .
If are -matrices, then define by setting . Then we say that an operator of the form is completely positive. We say that a -linear operator is Hermitian preserving if is Hermitian whenever is Hermitian. Every completely positive operator is Hermitian preserving.
Suppose that is -linear. Let . Let be a random orthogonal projection matrix of rank . Set for all . Then if everything goes well, the sequence will converge to a projection matrix of rank , and the projection matrix will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix . If is Hermitian preserving, then the projection matrix will typically be an orthogonal projection. This experiment performs well especially when is completely positive or at least Hermitian preserving or nearly so. The projection matrix will satisfy the equation .
In the case when is a quantum channel, we can easily explain what the projection does. The operator is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel . In particular, the image is spanned by the top eigenvectors of . This means that if we send the completely mixed state through the quantum channel and we measure the state with respect to the projective measurement , then there is an unusually high probability that this measurement will land on instead of .
Let us now use the algorithm that obtains from to solve a problem in many cases.
If is a vector, then let denote the diagonal matrix where is the vector of diagonal entries, and if is a square matrix, then let denote the diagonal of . If is a length vector, then is an -matrix, and if is an -matrix, then is a length vector.
Problem Input: An -square matrix with non-negative real entries and a natural number with .
Objective: Find a subset with and where if , then the largest entries in are the values for .
Algorithm: Let be the completely positive operator defined by setting . Then we run the iteration using to produce an orthogonal projection with rank . In this case, the projection will be a diagonal projection matrix with rank where and where is our desired subset of .
While the operator is just a linear operator, the pseudodeterminism of the algorithm that produces the operator generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.
I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don't think the same way as I do.
Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs
Stuff that is popular is usually garbage.
But here is my post about the word embedding.
And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.
Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.
Comments