I'm confused. Isn't one of the standard justification for the Solomonoff prior that you can get it without talking about K-complexity, just by assuming a uniform prior over programs of length on a universal monotone Turing machine and letting tend to infinity?[1] How is that different from your ? It's got to be different right, since you say that is not equivalent to the Solomonoff prior.
See e.g. An Introduction to Universal Artifical Intelligence, pages 145 and 146.
I'm confused. Isn't one of the standard justification for the Solomonoff prior that you can get it without talking about K-complexity, just by assuming a uniform prior over programs of length on a universal monotone Turing machine and letting tend to infinity?
What you describe is not the Solomonoff prior on hypotheses, but the Solomonoff a priori distribution on sequences/histories! This is the distribution I call in my post. It can then be written as a mixture of LSCSMs, with the weights given either by the Solomonoff prior (involving Kolmogorov complexity) or the a priori prior in my work. Those priors are not the same.
This post has been written in collaboration with Iliad in service of one of Iliad's longer-term goals of understanding the simplicity bias of learning machines.
Solomonoff induction is a general theoretical ideal for how to predict sequences that were generated by a computable process. It posits that in order to predict the continuation of our world, we should weight hypotheses by their simplicity ("Occam's razor'') and update the weighting using Bayesian reasoning.
In this post, I introduce Solomonoff induction at a technical level, hoping to be more elegant and satisfying than introductions you might find elsewhere. In particular, the introduction avoids Kolmogorov complexity: Instead of making predictions with the Solomonoff prior probability of hypotheses — given as two to the negative of their shortest encodings, i.e., Kolmogorov complexities — we directly use the a priori probability that a hypothesis is sampled from code sampled uniformly at random. This is closely related to the alt-complexity sketched by Nate Soares before. The approach leads to more elegant results that are simpler to prove than for typical expositions. Yet, since only the prior over hypothesis is changed, the introduction remains compatible with other expositions and should be broadly valuable to many readers.
Later, we will see that this choice has both mathematical advantages and disadvantages compared to an approach based on Kolmogorov complexity, and the latter is likely a more powerful basis to build more theory on. These drawbacks notwithstanding, I hope that this introduction is philosophically satisfying, by touching on questions like "Why should we expect our world to be learnable or predictable in the first place?', "What is a universally good way to predict?", and connections to Occam's razor and why one should favor simple hypotheses when predicting the world. In the FAQ at the end, I will touch on several anticipated questions, including briefly on how the theory might be relevant for neural networks.
Audience. I try to not strictly assume much prior knowledge on Turing machines or related concepts, but prior exposure will still be very useful. Some amount of mathematical or scientific maturity can probably replace exposure to these topics.
Acknowledgments. This post benefitted most from discussions with Alexander Oldenziel, and it was caused by his inquiry about the potential simplicity of learned neural network functions based on analogies with algorithmic information theory (AIT). This made me want to think more carefully about a "degeneracy-first" perspective on AIT itself, from which this post emerged. I'm also grateful to Cole Wyeth for discussions and his feedback on an earlier draft that led to some changes. He also explained to me the argument for why the Solomonoff prior gives the optimal prediction bound among all lower semicomputable priors. Thanks also to Matt Dellago for discussions on the content of this post, and to CEEALAR for hosting me when I finished writing this post.
If we make a sequence of observations in the world, e.g., a sequence of numbers
or a sequence of physical measurements, we are often able to continue the sequence in a way that coincides with its true continuation "in the wild". This seems to broadly work in much of the real world, meaning the world is learnable — we can learn from past experience to predict the future.
Why is that? This question has puzzled humanity for thousands of years. For example, Sextus Empiricus expressed that any rule inferred from a finite amount of data may always be contradicted by more data, and that it is impossible to take into account all data:
When they propose to establish the universal from the particulars by means of induction, they will affect this by a review of either all or some of the particulars. But if they review some, the induction will be insecure, since some of the particulars omitted in the induction may contravene the universal; while if they are to review all, they will be toiling at the impossible, since the particulars are infinite and indefinite.
— Sextus Empiricus (160-210)
René Descartes notices that our minds are so fallible that we cannot even be sure of being awake:
How often, asleep at night, am I convinced of just such familiar events — that I am here in my dressing-gown, sitting by the fire — when in fact I am lying undressed in bed! Yet at the moment my eyes are certainly wide awake when I look at this piece of paper; I shake my head and it is not asleep; as I stretch out and feel my hand I do so deliberately, and I know what I am doing. All this would not happen with such distinctness to someone asleep. Indeed! As if I did not remember other occasions when I have been tricked by exactly similar thoughts while asleep! As I think about this more carefully, I see plainly that there are never any sure signs by means of which being awake can be distinguished from being asleep. The result is that I begin to feel dazed, and this very feeling only reinforces the notion that I may be asleep.
— René Descartes (1596-1650)
And in the following passage, David Hume seems to imply that while we can make future inferences about cause and effect from analogizing about past patterns, there is no principle that can be discovered in nature itself that would justify doing so:
All our reasonings concerning matter of fact are founded on a species of Analogy, which leads us to expect from any cause the same events, which we have observed to result from similar causes. [...] It is impossible, that this inference of the animal can be founded on any process of argument or reasoning, by which he concludes, that like events must follow like objects, and that the course of nature will always be regular in its operations. For if there be in reality any arguments of this nature, they surely lie too abstruse for the observation of such imperfect understandings; since it may well employ the utmost care and attention of a philosophic genius to discover and observe them.
— David Hume (1711-1776)
Similar reasoning as those is expressed by the no free lunch theorems, which roughly state that a priori, no learning algorithm should perform better than any other.
We conclude that without placing any assumptions on the world, learning and prediction should indeed be impossible: If the world were truly (uniformly) random, then any way to use past data to predict the future would indeed be doomed. So why is it possible to learn?
One pre-formal answer is Occam's razor, which exists in various formulations and states that among all explanations that are compatible with our observations, we should choose the one making the least amount of assumptions. There are two quotes often attributed to Occam to this effect:
"Plurality must never be posited without necessity."
"It is futile to do with more things that which can be done with fewer."
— William of Ockham (1287-1347)
Occam's razor is a useful starting point, but it is not formal. Additionally, it would be nice to be able to justify Occam's razor from some minimal assumption that does not simply assume that we should focus on simple hypotheses in our reasoning. How can one do this?
In this post, I motivate the answer by starting with the minimal assumption that our world is computed. Not knowing what it is computed by, we assume that we are generated by running a uniformly randomly sampled program code through a universal Turing machine (Section link).
A consequence of this assumption will be that we can also think of our world as being sampled in a two-stage process: First by sampling a semicomputable (perhaps probabilistic) universe from a prior distribution, and then by generating our history from said universe. As it turns out, the prior on universes — which I call a priori prior — will automaically have a simplicity-weighting built in, thus justifying Occam's razor (Section link).
Repurposing this two-stage distribution for making predictions then leads to Solomonoff induction, which has remarkable predictive power no matter what computation truly underlies our universe. I.e., even if the program code underlying our true universe was in fact not sampled randomly, but chosen adversarially to make our prediction task hard, we would eventually, for a sufficiently long observed history, converge to perfect predictions (Section link).
I will then discuss drawbacks of my choice of an a priori prior. In particular, the weights are likely not lower semicomputable, and the a priori prior is not invariant up to a constant under a change of the universal Turing machine, different from the properties of the classical Kolmogorov-based Solomonoff prior. Additionally, the Solomonoff prior satisfies an optimality property for sequence prediction that does not exist in similar form for the a priori prior. This also means that my choice of prior — which can be regarded as being based on the probability, or degeneracy, of hypotheses — is meaningfully different than an approach based on description length (Section link).
This shows that the maxime that description length equals degeneracy is not universal in algorithmic information theory. Relatedly, in a footnote I discuss the relationship to Nate Soare's alt-complexity and will explain that many commenters under that post were probably wrong about K-complexity and alt-complexity being identical up to a constant. These problems notwithstanding, I do think my choice of prior is a priori easier to justify, depends on fewer choices, and leads to some results being more elegant.
I then conclude and anticipate questions in my FAQ, focusing briefly on the question of how to think about Solomonoff induction in the context of agents observing only part of the universe's history, practical usefulness, potential applicability of Solomonoff induction as a theoretical ideal to understand neural networks, the question of whether Solomonoff induction can be considered a theory of everything, and the sensitivity of codelength to the Turing machine model to work with. The answers are very preliminary, and much more could be said or studied about them.
Observing our world, it seems like the future is determined by the past with precise rules. We capture this by the minimal assumption that our world is computed. Without a priori knowledge of the program that generated us, we simply assume that our history emerged by sampling a uniformly random computer program and piping it through a universal programming language that outputs sequences.
In this section I formalize what this means. Let be our alphabet of bits. is the set of binary strings of arbitrary finite length (where the length string is explicitly allowed!). is the set of infinite binary sequences. And is the set of finite or infinite binary sequences. For two sequences , we write for their concatenation. is then said to be a prefix of , written .
How do we capture that we're "computed"? One idea is to assume that we are generated one "frame" at a time by a program that never changes past frames. This leads to the formalization of a monotone Turing machine in Definition 4.5.3 of Li and Vitányi.[1] Formally, a monotone Turing machine is a computable function with the monotonicity property: For all with , we have .[2] There is a lot to unpack here, and I hope that the following semi-formal explanations give enough intuitions to read the rest of the post:
Not all monotone Turing machines are equally interesting. For example, imagine a machine which simply outputs an empty sequence on any input: This satisfies all properties stated above, but it is very boring.
The other extreme is a universal monotone Turing machine, which can simulate all others. It does this by operating on codes that first describe an arbitrary monotone Turing machine using a binary code , followed by the input to said machine. One tricky detail is that we cannot simply concatenate and to since then we may forget where the description of ends and the input starts.
A solution is to use the following prefix-free encoding of descriptions :
Let's unpack this: is the length of . , for a natural number , is an encoding of as a binary sequence, usually by ordering binary sequences by length and then lexicographically. then first contains a sequence of 's followed by a zero. The number of 's is precisely the length of . After reading that, one knows how many of the following bits encode . After reading those bits, one knows the length of , and so after reading so many further bits, one knows . Thus, it is algorithmically possible to disentangle a binary sequence of the form into and since itself tells you when will end.
Then, a universal monotone Turing machine only produces non-trivial outputs for inputs of the form . Additionally, for every , is itself a monotone Turing machine, and every monotone Turing machine is of the form for at least one . Note that will never produce output bits before reading since only encodes a monotone Turing machine . Thus, for any prefix of , is simply the empty string.
This is a nice definition, but does such a universal machine exist? The answer is yes, and it is proved by enumerating the set of monotone Turing machines in a computable way, which Li and Vitányi[1] state to be "clearly" possible right before Definition 4.5.5 —without giving a proof. Intuitively, this is believable by thinking about universal programming languages: Any "monotone Turing machine" can be implemented by a finite code in any universal programming language like Python.
I fix a universal monotone Turing machine for the rest of the post. In a later section we will see that the theory is quite sensitive to the choice of this machine, but for now, we will not worry about this.
A monotone Turing machine induces another function (by abuse of notation with the same notation) whose output on an infinite input is defined to be the limit of the outputs of finite prefixes:[3]
Now, let be the uniform distribution on , which samples infinite binary sequences one bit at a time, each with probability 50% to be or . Let be a random input sampled from , and let be its output (possibly infinite). For finite , let be the resulting probability that a history starts with . Formally:
Definition 1 (Universal a priori distribution). Let be a universal monotone Turing machine and be the uniform distribution on infinite binary codes. The universal a priori probability that a history starts with is defined as
I invite you to pause for a moment about what a profound perspective shift this is compared to the historical puzzlement I emphasized in the introduction: When incorrectly assuming that our history is sampled uniformly, we cannot predict the future, as the skeptical views expressed by Empiricus, Descartes, and Hume make clear. However, the universal a priori distribution does not assume a uniformly sampled history. Instead, it assumes a uniformly sampled code, from which the history is generated with the fixed, deterministic universal monotone Turing machine . In other words, we take our universe to have a uniformly sampled description instead of history.
Should we expect any regularities in the histories sampled from ? And if so, does this regularity give us a hint about how to predict sequence continuations in our actual universe? These questions are answered in the next two sections.
We now state and prove an equivalent description of the universal a priori distribution , which will show that has some simplicity-weighting. Intuitively, it it will turn out that sampling histories from is as if you were to first sample "probabilistic physical laws" in a way that obeys Occam's razor, and then use those laws to sample the history.
We formalize "probabilistic physical laws" by the notion of a semimeasure:
Definition 2 ((Lower semicomputable) semimeasure). A semimeasure is a function with [4] and for all , where is the empty binary string. is called lower semicomputable if it can be computably approximated from below; This means there is an algorithm which, on input , produces progressively better approximations of from below. I abbreviate "lower semicomputable semimeasure" with LSCSM from now on.
For a LSCSM , is interpreted as the probability, under , that an infinite sequence starts with . We can have since the sequence may not continue after starting with . This is akin to the possibilities that the universe can end at any moment.
As an aside, LSCSMs also include deterministic sequences: Let be a fixed sequence that can be generated by a computer program. Define for all prefixes of , and , else. Then for a given , we can compute by computing and determining whether is a prefix. This shows that is an LSCSM that generates the sequence .
How can we obtain LSCSMs? Well, for one, we already know one instance: is a semimeasure since
and for every :
leading to . is also lower semicomputable. Indeed, to approximate from below, simply enumerate all finite binary sequences , compute over all in parallel, and record whether . If this is the case, then for all infinite extensions of , we have , and they together have probability , with the length of . Thus, we can add to our estimate of and then skip any extensions of in our algorithmic procedure. The estimate then progressively approximates since for every with , there is a finite prefix with , meaning we will find the contribution eventually in our approximation.
What other ways are there to obtain LSCSMs? Well, you may have noticed that I did not use the universality of in the explanation above. In other words, every monotone Turing machine gives rise to an LSCSM by piping uniformly random noise through :
And this is all: All LSCSMs emerge from a monotone Turing machine in this way, see Theorem 4.5.2 in Li and Vitányi.[1][5]
Now, for , define the function by
which is itself a monotone Turing machine. Define the LSCSM .
Definition 3 (A priori prior). The a priori prior is given for any LSCSM by
which is the probability[6] that a uniformly random infinite sequence starts with an encoding that gives rise to the LSCSM .
The reason I call this the "a priori prior" is that it simply emerged from our definitions without further assumptions, which distinguishes it from the Kolmogorov-based Solomonoff prior that I will investigate later (Section link).
Finally, let be the set of all LSCSMs, where "sol" is shorthand for "Solomonoff". We obtain:
Theorem 4 (Mixture Equivalence). The universal distribution is a universal mixture over all LSCSMs with weights :
Proof. Since only has outputs for inputs of the form for , we obtain
We obtain
That proves the claim.
This means that if our history was sampled from , we can also think of it as being sampled by first sampling a LSCSM with probability , and then sampling the history successively from .
If an LSCSM has an encoding , then . This gives a simplicity bias: Universes with a simple (i.e., short) description are more likely under this a priori prior. This can be interpreted as a form of Occam's razor, in that hypotheses with a short description receive exponentially more weight.[7]
So far, I have started with the assumption that our universe is the result of sampling a uniformly random program, and arrived at the conclusion that this is equivalent to first sampling an LSCSM with a prior that gives more weight to simple hypotheses, and then sampling our history from it. Thus, under our assumption, we should expect that the world is more predictable than sceptical paradoxes suggest: We can concentrate our reasoning to simple universes and should have a chance to predict the future better than chance. We will make this case in this section in mathematical detail.
How should we make predictions in light of these thoughts? The answer: We predict using the Solomonoff mixture distribution itself, predicting sequence continuations via familiar conditional probabilities upon observing an initial history . This is the most parsimonious way to predict, assuming we a priori only make the assumption of having emerged from a program being sampled uniformly at random.
We now explain in more detail how to predict using by relating the conditionals to Bayesian updating. For any LSCSM , define the conditional probability . We obtain:
where I defined in the suggested way. This is actually a Bayesian posterior in the conventional sense; One can notationally see this more clearly by writing , the likelihood of history conditional on the hypothesis . Then the inner formula becomes:
This is the classical Bayes theorem.
Thus, we arrive at a very satisfying description of Solomonoff induction:
This is very appealing:
One often-noted drawback is that it is incomputable since we can't precisely determine the posterior or the conditional . However:
All of this is not useful if it would not result in actual predictability of the world. After all, our question was why is our actual universe predictable. Fortunately, it turns out that Solomonoff induction is actually an exquisite predictor under the reasonable assumption that our own universe is a lower semicomputable measure, instead of just a semimeasure.[8]
Theorem 5 (Solomonoff Bound). Let be any lower semicomputable measure on infinite sequences, and assume it generates our actual universe. Define the cumulative expected prediction error of predicting sequences sampled by via as:
Then we have
In other words, the prediction error is upper-bounded by the cross entropy between the Dirac measure on our true universe and our a priori prior probability . In particular, for increaingly long histories, the total remaining prediction error goes to zero.[9]
Proof. We have (with explanations after the computation):
Step 1 is the definition. Step 2 is is a standard inequality that holds whenever is a measure, see Corollary 2.8.8 in Hutter et al.[10] (whereas is allowed to be only a semimeasure). Step 3 uses the definition of the KL divergence and changes the infinite series to a limit of partial sums. Step 4 substitutes in the definition of the conditional KL divergence.
Step 5 is a crucial step. Similar to how Shannon entropy satisfies the well-known chain rule , KL divergence also satisfies the chain rule . Intuitively, the divergence of and measured in the variable plus the divergence in once is already known equals the total divergence of and over . Telescoping this chain rule from till gives precisely the result of step . A full proof of this telescope property can be found in Lemma 3.2.4 in Hutter et al.[10]. Step 6 is just the definition of the KL divergence again.
Now, note that by Theorem 4, we have
With this inequality, we obtain
That finishes the proof.
Thus, adding to our earlier investigations, not only do we have reason to believe our universe is predictable, there also is an amazing predictor as long as our universe is generated from a lower semicomputable measure; This resolves the skeptical paradoxes from the introduction. Furthermore, this predictor is given by a blank-slate probability distribtion that assumes no prior knowledge of what our universe is. While the prediction error is lower for universes that are more likely under (which includes universes with a short description), I note that it goes to zero for long histories regardless of .
This section compares our constructions of Solomonoff induction to the more familiar construction where the prior is replaced by the Solomonoff prior . If you are not interested in this section, read on with the conclusion. This section assumes more prior knowledge than previous sections.
Assume we have a computable enumeration of LSCSMs. We can, for example, obtain this by mapping a natural number to its binary representation — a code — and then defining as above. Furthermore, let be a separate universal prefix Turing machine, which is not directly related to the universal monotone Turing machine . Define the Kolmogorov complexity of a natural number as the length of the shortest input to which computes a binary encoding of . The Solomonoff prior is then given by
One can then prove that approximately (i.e., up to at most a constant multiplicative error), we have
see Li and Vitányi,[1] Theorem 4.5.3. I will prove this statement at the start of the second subsection to reveal an advantage of .[11]
The Solomonoff prior has some distinct advantages that are missing in our definition:
Lower semicomputability. The weights are lower semicomputable, which our weights are probably not. The reason is that to approximate , we would need to be able to algorithmically determine for any binary string whether . There seems to be no method to do this. See our investigation in a footnote for more details on missing lower semicomputability in the related case of Nate's alt-complexity.[12]
Invariance under a change of the UTM. The weights are, up to a multiplicative constant, invariant under a change of . In contrast, is not invariant, not even up to a multiplicative constant, under a change of the universal monotone machine . I learned the following argument for this from Cole Wyeth. Namely, take your universal monotone Turing machine . Define a second machine from it, , which is given by , and with empty output on any other input. We can easily see that this is again a universal monotone Turing machine. With suggestive notation, we obtain:
Thus, if is very small, then becomes extremely small, and there can be no constant factor with . We note that in the approximate inequality, we use , which only holds up to a logarithmic error term that probably does not change the conclusion substantially.
Besides, this also means that the a priori prior has no simple relationship to the K-complexity-based Solomonoff prior that can hold independently of the choices of the involved universal Turing machines. This may be surprising given that the coding theorem shows that the a priori probability of a string is up to a constant identical to its K-complexity. This observation is related to my belief that the alt-complexity and K-complexity in Nate's post are not identical up to a constant, contrary to what several commenters claimed. I discuss this in more detail in a footnote.[12]
Optimality of the Solomonoff prior. Assume we remain in the realm of lower semicomputable semiprobability mass functions , of which is an example, and consider mixtures There is a sense in which the Solomonoff prior is optimal under all these choices for . Namely, let be an enumeration of all lower semicomputable semiprobability mass functions. Then for each under consideration, there is a with . By the coding theorem (Theorem 4.3.3 in Li and Vitányi[1]), we have
up to the multiplicative constant in the last step. In other words, has "more weight" on all indices than . Consequently, when adapting Theorem 5 to general lower semicomputable mixtures in place of , provides a better bound than (up to an additive constant):
There seems to not be any analogous result for our a priori prior . Indeed, it would be surprising if such a result exists since this prior is very non-invariant under the change of the universal monotone Turing machine. On the flip side, the fact that is likely not lower semicomputable means that may not beat .
Dependence on fewer choices. The definition of depends on the choice of a universal prefix machine , on top of the choice of that we already made use of before. has no such further dependence.
is platonic, is not. Theorem 4 is very natural: The proof is essentially revealed from the definition of itself, giving the definition intrinsic justification due to the beautiful result. In contrast, the analogous statement that
only very losely depends on the definitions of and the Solomonoff prior at all, meaning we could have chosen a very different prior without many negative consequences. To explain this, I'm stating the proof of the result: itself is an LSCSM, meaning that for an . This results in
which already establishes the inequality (up to a multiplicative constant) in one direction. For the other direction, first note that is lower semicomputable since is lower semicomputable and each is lower semicomputable. Furthermore, we have
This almost establishes as an LSCSM, except that it doesn't satisfy our assumption that . Thus, define to be identical to except that . Then this is an LSCSM and we obtain
which is precisely the desired inequality in the other direction. As we can see, this theorem only depended on the fact that both and are (essentially) LSCSMs which are mixtures of all LSCSMs; the specifics of and are irrelevant, and the same comparison would hold true for any other lower semicomputable mixture of all LSCSMs. It then appears to me that the Solomonoff prior is ultimately arbitrary, and has no a priori justification.
leads to fewer constants. Related to the previous points, the representation of as holds only up to a multiplicative constant, whereas Theorem 4 was an exact equality. Theorem 3.8.8 in Hutter et al.[10] claims that with a smart choice of the two universal Turing machines and , an exact equality can be achieved, but I am unsure if such a choice can be justified "a priori" in a way that feels satisfying.
In this post, I have introduced Solomonoff induction without using K-complexity: A hypothesis for our universe, modeled as lower semicomputable semimeasure (LSCSM), is considered to be as likely "as it truly is" when sampling uniformly random program codes and piping them through a universal monotone Turing machine. Almost tautologically, this means that the probability of sampling a history with the universal machine equals the probability of sampling the history from an LSCSM after first sampling an LSCSM from the a priori prior (Theorem 4).
I then defined Solomonoff induction as the process of predicting sequence continuations in a universe under the assumption that they were generated from a uniformly random computer program. Applying Theorem 4 then leads to an equivalent description in terms of the a priori prior: To continue a sequence, first form the posterior over hypotheses with Bayesian reasoning, and then use the mixture over the posterior to predict the continuation. Since the prior over hypotheses has a simplicity-bias (hypotheses with a shorter encoding are more likely), this can be considered a formalization of Occam's razor.
In Theorem 5, we saw that this leads to the well-known Solomonoff bound. It shows that the cumulative error of predicting continuations in any computed universe is upper bounded by the negative logarithm of the probability of the universe under the a priori prior, i.e., by a cross-entropy loss. This resolves the historical puzzlements in the introduction and explains how it is possible to predict at all, using just the reasonable assumption that our universe is computed.
I then compared the a priori prior to the Solomonoff prior that is based on K-complexity, and found that the latter has some distinct mathematical advantages: The prior is lower semicomputable, invariant under a change of the universal prefix Turing machine it is based on, and it leads to optimal Solomonoff bounds when compared to other priors that are lower semicomputable (which, however, is not necessarily an advantage against the bound in Theorem 5 since the a priori prior is likely not lower semicomputable). This means that the Solomonoff prior is probably the better mathematical basis for developing further theory.
Yet, the a priori prior seems ultimately more natural, in that it does not depend on an additional choice of a universal prefix machine, and the mixture description of the universal distribution (Theorem 4) is tautologically true — rather than being achieved by a post-hoc analysis involving a further constant multiplicative error, as is the case for the Solomonoff prior. As mentioned to me by Cole Wyeth, perhaps it is possible to unify the two perspectives with a construction akin to a Friedberg numbering by finding a unique encoding of each LSCSM. This might then salvage the coding theorem (aka the correspondence between degeneracy and simplicity, or between alt-complexity and K-complexity) in this setting.
Here I answer some questions that I have frequently anticipated.
Q: In our universe, we don't observe the entire history up until this point in time. Instead, we observe a small part of the universe for a finite amount of time. How is sequence prediction in reality then compatible with Solomonoff induction?
A: The hypothesis in reality is not only the simulator of the universe, but instead the simulator of your precise observations in our universe. Thus, then contains a description of the universe and a description of your location in time and space. One further complicationm is that our perceptions are "qualia", which we do not know how to relate to the formalism of symbol strings, see Wei Dai's post of open problems.
Q: Should we try to predict the real world using Solomonoff induction?
A: That's fairly intractable, so we have to use alternatives. The question is then whether it would be a useful target to try to approximate. My guess is that this depends on the timescales: The shorter the history of our observations, the more we should trust the priors ingrained into our skulls by evolution, for they might be better than the Solomonoff prior and more adapted to our actual universe. On large time horizons, perhaps it becomes increasingly useful to actually approximate Solomonoff induction.
Q: Do neural networks do Solomonoff induction in any meaningful sense?
A: There is an interesting analogy between neural networks and Solomonoff induction: Namely, in the same way that our version of Solomonoff induction predicts sequence continuations by piping random noise through a universal Turing machine, a neural network function is at initialization sampled by aranging random parameters in an architecture. If it additionally turns out that gradient descent is akin to Bayesian updating, the analogy might become quite close. If there turns out to be a sufficiently strong connection between probability and K-complexity in the Solomonoff context (as in the coding theorem for a specific case), one might then reason that neural networks also have a bias toward learned functions with short descriptions, perhaps explaining their generalization capabilities. See also here, here, and here.
Q: Is the Solomonoff a priori distribution just a useful model, or do you actually expect our world to be sampled from it? Is it a theory of everything?
A: My intuition is that it's just a useful model. It depends on the universal monotone Turing machine , which feels unnatural. I'm also uncertain whether our universe is really discrete and computed. Perhaps we live in a continuous mathematical structure instead, and time just emerges in our experience? I have not engaged much with the precise overlap and correspondence between mathematical structures and Turing machines and would be happy for someone to say more in the comments.
Q: How is the complexity of a string under a universal monotone Turing machine related to the one under a different Turing machine model, which may be more akin to programming languages in which we typically write code?
A: There are several bounds that closely relate different codelengths to each other, typically up to at most logarithmic error terms. For an example, see Theorem 3.8.7 in Hutter et al.[10]
M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Cham, 4th edition, 2019. ISBN 9783030112974., DOI: 10.1007/978-3-030-11298-1.
I found some ambiguity in the literature on whether monotone Turing machines are only partial functions, i.e., only producing an output for some of their inputs, or total functions. I settled on total functions since this makes things conceptually simpler and seems to be the most common view.
Fomally, one can define this limit as a supremum over the prefix-relation.
In the literature, one typically encounters the definition ; However, for our purpose, equality is more natural and will later ensure that all lower semicomputable semimeasures can be represented by a monotone Turing machine.
Confusingly, Li and Vitányi use a definition of LSCSMs where (different from our definition where ), and yet claim that all of them are covered by monotone Turing machines. Perhaps they use a somewhat different definition of monotone Turing machines in which it is allowed for outputs to be entirely undefined (thus, not even the empty string), which means that , whereas our monotone Turing machines are total functions. This paper thinks Li and Vitányi are making a mistake and writes: "It is equivalent to Theorem 4.5.2 in [LV08] (page 301) with a small correction: for any by construction, but may not be 1, so this case must be excluded."
is actually a proper probability mass function, meaning that . The reason is that each is of the form for a and , with all with the same prefix having weight together. Thus, .
The reverse may not necessarily hold: A hypothesis can receive a lot of weight simply by having exponentially many (longer) descriptions, with a priori no guarantee for there to then be a short description. For gaining intuitions on this perspective, I can recommend Nate's post on alt-complexity and the last footnote.[12]
In other words, the theorem assumes that our universe cannot simply end at any finite time, compared to semimeasures which can always stop a sequence continuation. This seems restrictive at first, and would exclude possibilities like a big crunch in which the universe eventually ends. However, I think such scenarios can likely be rescued even in the true measure formalism: For a sequence that ends, simply continue it with to an infinite sequence. With this method, it should be possible to assign a measure to any semimeasure, which hopefully has a similar a priori probability. If this were the case, then we would still be able to predict our actual universe effectively up until the moment when it ends, when predicting it correctly does not matter anymore. I did not check these claims in detail, but perhaps Lemma 1 in this post already contains the answer.
Note that we could have also written this error as the expected value
which makes conceptually clearer that it is an expected cumulative prediction error.
Marcus Hutter and David Quarel and Elliot Catt. An Introduction to Universal Artificial Intelligence. Chapman & Hall, 2024. ISBN: 9781032607023, DOI: 10.1201/9781003460299.
The reader may wonder why we go through the hassle of using a separate universal prefix machine at all. With , another option would be to define as the weight for . The issue with this definition is, perhaps paradoxically, that is then computable, which by Lemma 4.3.1 in Li and Vitányi[1] leads to not being universal, i.e., not dominating all other computable (let alone semicomputable) semimeasures. This, in turn, means that this prior will not lead to the same optimal bounds that we will explain the prior to have.
Nate assumes a universal programming language that can receive input codes that then result through to output functions . An example for is code for "quicksort" or "bubblesort", which both lead to the same function that receives a list as input and outputs its sorted version. He then defines the K-complexity
and the alt-complexity
Many commenters claimed that these differ only up to an additive constant, but I believe this to be wrong.
At first sight it seems true, since it seems identical to the coding theorem, which is about the equality of and defined for a universal prefix machine, with being a binary string. The proof of the coding theorem crucially uses the lower semicomputability of , which is achieved by dovetailing through all and checking whether the universal prefix machine sends them to , and increasing the estimate of by when this happens. These estimates are crucial to be able to find codewords for that get progressively shorter and approximate in length, which then establishes the approximate equality to .
The same proof strategy does not work for Nate's alt-complexity since there can, to my knowledge, not exist an algorithm that checks for any program and function whether ; after all, both sides of this equation are themselves functions that can in principle receive infinitely many inputs over which we can't all iterate (See Rice's theorem). Thus, the lower semicomputability of seems to fail, and thus I don't know of a method to construct codewords that approximate the length eventually. The coding theorem thus likely fails in this context, unless there is an entirely different proof strategy that succeeds.