An idea I discussed with @Leon Lang that did not make it into the post is to look for a Friedberg numbering of the l.s.c. semimeasures, which could be used to control the weights P_ap. A Friedberg numbering for the recursively enumerable sets / partial recursive functions is (just) a unique effective numbering, which surprisingly enough does exist (though it is not admissable, meaning it can't be translated to the standard numbering based on reasonable machine descriptions). If such a numbering existed, and U were chosen to implement it, then each l.s.c. semimeasure \nu would get prior weight (P_ap) exactly 2^(-l(p*)) where p* is the shortest program such that U(p*, _) generates the distribution of \nu.
l(p*) is one intuitive (but highly uncomputable) definition for K(\nu), but the normal definition is K(i) where i is the index of \nu (:= \nu_i) is some computable enumeration of l.s.c. semimeasures. Therefore, P_ap would still not match 2^-K(\nu_i) := 2^-K(i), even up to a constant, as Leon was initially hoping. With an additional step, we can make this true. Define a new UTM U' that takes a code q and runs it to obtain an index i, then treats i as an index into the Friedberg numbering and simulates that machine (in other words, U' decompresses a code for U). Now P_ap(\nu) for U' matches 2^-K(\nu) up to a constant by the coding theorem.
Again: this entire argument depends on the unverified guess that there is a Friedberg numbering of l.s.c. semimeasures, which Marcus Hutter once suggested to me (but I think he does not believe it).
I'm confused. Isn't one of the standard justifications for the Solomonoff prior that you can get it without talking about K-complexity, just by assuming a uniform prior over programs of length on a universal monotone Turing machine and letting tend to infinity?[1] How is that different from your ? It's got to be different right, since you say that is not equivalent to the Solomonoff prior.
See e.g. An Introduction to Universal Artifical Intelligence, pages 145 and 146.
I'm confused. Isn't one of the standard justification for the Solomonoff prior that you can get it without talking about K-complexity, just by assuming a uniform prior over programs of length on a universal monotone Turing machine and letting tend to infinity?
What you describe is not the Solomonoff prior on hypotheses, but the Solomonoff a priori distribution on sequences/histories! This is the distribution I call in my post. It can then be written as a mixture of LSCSMs, with the weights given either by the Solomonoff prior (involving Kolmogorov complexity) or the a priori prior in my work. Those priors are not the same.
If they have the same prior on sequences/histories, then in what relevant sense are they not the same prior on hypotheses? If they both sum to , how can their predictions come to differ?
Well, their induced mixture distributions are the same up to a constant, but the priors on hypotheses are different. I'm not sure if you consider the difference "relevant", perhaps you only care about the induced mixture distribution?
To make a simple example: Assume there were only three Turing machines , , and . Assume that and . Let , and be the LSCSMs induced by , , and . Notice that is a mixture of and : .
Let be the mixture distribution given as Then clearly, is also represented as . My viewpoint is that the prior distributions giving weight to each of the three hypotheses is different from the one giving weight to each of and , even if their mixture distributions are exactly the same.
And this is exactly the situation we're in with the true mixture distribution from the post. Some of the LSCSMs in the mixture are given by for a separate universal monotone Turing machine, which means that is itself a mixture of all LSCSMs. Any such mixtures in the LSCSMs allow to redistribute the prior weight from this LSCSM to all others, without affecting the mixture in any way.
This is also related to what makes a prior based on Kolmogorov complexity ultimately so arbitrary: We could have chosen just about anything and it would still essentially sum to . A posteriori the Kolmogorov complexity then has some mathematical advantages as outlined in the post, however.
My viewpoint is that the prior distributions giving weight to each of the three hypotheses is different from the one giving weight to each of and , even if their mixture distributions are exactly the same.
That's pretty unintuitive to me. What does it matter whether we happen to write out our belief state one way or the other? So long as the predictions come out the same, what we do and don't choose to call our 'hypotheses' doesn't seem particularly relevant for anything?
We made our choice when we settled on as the prior. Everything past that point just seems like different choices of notation to me? If our induction procedure turned out to be wrong or suboptimal, it'd be because was a bad prior to pick, not because we happened to write down in a weird way, right?
I answered in the parallel thread, which is probably going down to the crux now. To add a few more points:
It's also useful to emphasize why even if the mixtures are the same, having different priors can make a practical difference. E.g., imagine that in the example above we had one prior giving 100% weight to , and another prior giving 50% weight to each of and . They give the same mixture, but the first prior can't update, and the second prior can!
... Wait, are you saying we're not propagating updates into to change the mass it puts on inputs vs. ?
Okay, I think I overstated the extent to which the difference in priors matters in the previous comments and crossed out "practical".
Basically, I was right that the prior that gives 100% on cannot update, it gives all its weight to no matter how much data comes in. However, itself can update with more data and shift between and .
I can see that this feels perhaps very syntactic, but in my mind the two priors still feel different. One of them is saying "The world first samples a bit indicating whether the world will continue with world 0 or world 1", and the other one is saying "I am uncertain on whether we live in world 0 or world 1".
The difference is not a “practical” one as long as you only use the posterior predictive distribution, but in some AIXI variants (KSA, certain safety proposals) the posterior weights themselves are accessed and the form may matter. Arguably this is a defect of those variants.
Might be worth more explicitly noting in the post that P_sol and P_ap in fact define the same semimeasure over strings(up to a multiplicative factor) From a skim I was confused about this point "wait, is he saying that not only are alt-complexity and K-complexity different, but even define different probability distributions? That seems to contradict the universality of P_sol, doesn't it....?"
Good idea, I now added the following to the opening paragraphs of the section doing the comparisons:
Importantly, due to Theorem 4, this means that the Solomonoff prior and a priori prior lead up to a constant to the same predictions on sequences. The advantages of the priors that we analyze are thus not statements about their induced predictive distributions.
Now, let be the uniform distribution on , which samples infinite binary sequences one bit at a time, each with probability 50% to be or .
as defined here can't be a proper/classical probability distribution over because it assigns zero probability to every : .
Or am I missing something?
It’s a continuous probability measure, meaning it has no atoms, but it does assign positive probability to all cylinder sets. If you take the binary representation of reals in the [0,1] interval, P_u comes from the Lebesgue measure (a uniform distribution).
The coding theorem thus likely fails in this context, unless there is an entirely different proof strategy that succeeds
Indeed, alt-complexity and K-complexity are known to differ asymptotically in the infinite string case.
Yes. There are lots of different settings one could consider, e.g.:
For all of these cases, one can compare different notions of complexity (plain K-complexity, prefix complexity, monotone complexity, if applicable) with algorithmic probability. My sense is that the correspondence is only exact for universal prefix machines and finite strings, but I didn't consider all settings.
I think the proof in the Gacs paper can be adapted to LSCSMs and functions but haven't checked super carefully
This post has been written in collaboration with Iliad in service of one of Iliad's longer-term goals of understanding the simplicity bias of learning machines.
Solomonoff induction is a general theoretical ideal for how to predict sequences that were generated by a computable process. It posits that in order to predict the continuation of our world, we should weight hypotheses by their simplicity ("Occam's razor'') and update the weighting using Bayesian reasoning.
In this post, I introduce Solomonoff induction at a technical level, hoping to be more elegant and satisfying than introductions you might find elsewhere. In particular, the introduction avoids Kolmogorov complexity: Instead of making predictions with the Solomonoff prior probability of hypotheses — given as two to the negative of their shortest encodings, i.e., Kolmogorov complexities — we directly use the a priori probability that a hypothesis is sampled from code sampled uniformly at random. This is closely related to the alt-complexity sketched by Nate Soares before. The approach leads to more elegant results that are simpler to prove than for typical expositions. Yet, since only the prior over hypothesis is changed, the introduction remains compatible with other expositions and should be broadly valuable to many readers.
Later, we will see that this choice has both mathematical advantages and disadvantages compared to an approach based on Kolmogorov complexity, and the latter is likely a more powerful basis to build more theory on. These drawbacks notwithstanding, I hope that this introduction is philosophically satisfying, by touching on questions like "Why should we expect our world to be learnable or predictable in the first place?', "What is a universally good way to predict?", and connections to Occam's razor and why one should favor simple hypotheses when predicting the world. In the FAQ at the end, I will touch on several anticipated questions, including briefly on how the theory might be relevant for neural networks.
Audience. I try to not strictly assume much prior knowledge on Turing machines or related concepts, but prior exposure will still be very useful. Some amount of mathematical or scientific maturity can probably replace exposure to these topics.
Acknowledgments. This post benefitted most from discussions with Alexander Oldenziel, and it was caused by his inquiry about the potential simplicity of learned neural network functions based on analogies with algorithmic information theory (AIT). This made me want to think more carefully about a "degeneracy-first" perspective on AIT itself, from which this post emerged. I'm also grateful to Cole Wyeth for discussions and his feedback on an earlier draft that led to some changes. He also explained to me the argument for why the Solomonoff prior gives the optimal prediction bound among all lower semicomputable priors. Thanks also to Matt Dellago for discussions on the content of this post, and to CEEALAR for hosting me when I finished writing this post.
Introduction
If we make a sequence of observations in the world, e.g., a sequence of numbers
1,1,2,3,5,8,13,…or a sequence of physical measurements, we are often able to continue the sequence in a way that coincides with its true continuation "in the wild". This seems to broadly work in much of the real world, meaning the world is learnable — we can learn from past experience to predict the future.
Why is that? This question has puzzled humanity for thousands of years. For example, Sextus Empiricus expressed that any rule inferred from a finite amount of data may always be contradicted by more data, and that it is impossible to take into account all data:
René Descartes notices that our minds are so fallible that we cannot even be sure of being awake:
And in the following passage, David Hume seems to imply that while we can make future inferences about cause and effect from analogizing about past patterns, there is no principle that can be discovered in nature itself that would justify doing so:
Similar reasoning as those is expressed by the no free lunch theorems, which roughly state that a priori, no learning algorithm should perform better than any other.
We conclude that without placing any assumptions on the world, learning and prediction should indeed be impossible: If the world were truly (uniformly) random, then any way to use past data to predict the future would indeed be doomed. So why is it possible to learn?
One pre-formal answer is Occam's razor, which exists in various formulations and states that among all explanations that are compatible with our observations, we should choose the one making the least amount of assumptions. There are two quotes often attributed to Occam to this effect:
Occam's razor is a useful starting point, but it is not formal. Additionally, it would be nice to be able to justify Occam's razor from some minimal assumption that does not simply assume that we should focus on simple hypotheses in our reasoning. How can one do this?
In this post, I motivate the answer by starting with the minimal assumption that our world is computed. Not knowing what it is computed by, we assume that we are generated by running a uniformly randomly sampled program code through a universal Turing machine (Section link).
A consequence of this assumption will be that we can also think of our world as being sampled in a two-stage process: First by sampling a semicomputable (perhaps probabilistic) universe from a prior distribution, and then by generating our history from said universe. As it turns out, the prior on universes — which I call a priori prior — will automaically have a simplicity-weighting built in, thus justifying Occam's razor (Section link).
Repurposing this two-stage distribution for making predictions then leads to Solomonoff induction, which has remarkable predictive power no matter what computation truly underlies our universe. I.e., even if the program code underlying our true universe was in fact not sampled randomly, but chosen adversarially to make our prediction task hard, we would eventually, for a sufficiently long observed history, converge to perfect predictions (Section link).
I will then discuss drawbacks of my choice of an a priori prior. In particular, the weights are likely not lower semicomputable, and the a priori prior is not invariant up to a constant under a change of the universal Turing machine, different from the properties of the classical Kolmogorov-based Solomonoff prior. Additionally, the Solomonoff prior satisfies an optimality property for sequence prediction that does not exist in similar form for the a priori prior. This also means that my choice of prior — which can be regarded as being based on the probability, or degeneracy, of hypotheses — is meaningfully different than an approach based on description length (Section link).
This shows that the maxime that description length equals degeneracy is not universal in algorithmic information theory. Relatedly, in a footnote I discuss the relationship to Nate Soare's alt-complexity and will explain that many commenters under that post were probably wrong about K-complexity and alt-complexity being identical up to a constant. These problems notwithstanding, I do think my choice of prior is a priori easier to justify, depends on fewer choices, and leads to some results being more elegant.
I then conclude and anticipate questions in my FAQ, focusing briefly on the question of how to think about Solomonoff induction in the context of agents observing only part of the universe's history, practical usefulness, potential applicability of Solomonoff induction as a theoretical ideal to understand neural networks, the question of whether Solomonoff induction can be considered a theory of everything, and the sensitivity of codelength to the Turing machine model to work with. The answers are very preliminary, and much more could be said or studied about them.
On being generated by a random computer program
Observing our world, it seems like the future is determined by the past with precise rules. We capture this by the minimal assumption that our world is computed. Without a priori knowledge of the program that generated us, we simply assume that our history emerged by sampling a uniformly random computer program and piping it through a universal programming language that outputs sequences.
In this section I formalize what this means. Let B={0,1} be our alphabet of bits. B∗=∪n∈N≥0Bn is the set of binary strings of arbitrary finite length (where the length 0 string ϵ is explicitly allowed!). B∞={0,1}∞ is the set of infinite binary sequences. And B≤∞=B∗∪B∞ is the set of finite or infinite binary sequences. For two sequences x∈B∗, y∈B≤∞, we write xy for their concatenation. x is then said to be a prefix of xy, written x⪯xy.
Monotone Turing machines
How do we capture that we're "computed"? One idea is to assume that we are generated one "frame" at a time by a program that never changes past frames. This leads to the formalization of a monotone Turing machine T in Definition 4.5.3 of Li and Vitányi.[1] Formally, a monotone Turing machine is a computable function T:B∗→B≤∞ with the monotonicity property: For all p,q∈B∗ with p⪯q, we have T(p)⪯T(q).[2] There is a lot to unpack here, and I hope that the following semi-formal explanations give enough intuitions to read the rest of the post:
Universal monotone Turing machines
Not all monotone Turing machines T are equally interesting. For example, imagine a machine T which simply outputs an empty sequence on any input: This satisfies all properties stated above, but it is very boring.
The other extreme is a universal monotone Turing machine, which can simulate all others. It does this by operating on codes that first describe an arbitrary monotone Turing machine using a binary code p, followed by the input q to said machine. One tricky detail is that we cannot simply concatenate p and q to pq since then we may forget where the description of p ends and the input q starts.
A solution is to use the following prefix-free encoding of descriptions p∈B∗:
p′:=1l(bin(l(p)))0bin(l(p))p∈B∗.Let's unpack this: l(p) is the length of p. bin(n), for a natural number n, is an encoding of n as a binary sequence, usually by ordering binary sequences by length and then lexicographically. p′ then first contains a sequence of 1's followed by a zero. The number of 1's is precisely the length of bin(l(p)). After reading that, one knows how many of the following bits encode l(p). After reading those bits, one knows the length of p, and so after reading so many further bits, one knows p. Thus, it is algorithmically possible to disentangle a binary sequence of the form p′q into p and q since p′ itself tells you when p will end.
Then, a universal monotone Turing machine U:B∗→B≤∞ only produces non-trivial outputs for inputs of the form p′q. Additionally, for every p∈B∗, Tp:q↦Tp(q):=U(p′q) is itself a monotone Turing machine, and every monotone Turing machine T is of the form T=Tp for at least one p∈B∗. Note that U will never produce output bits before reading q since p′ only encodes a monotone Turing machine T. Thus, for any prefix r of p′, U(r) is simply the empty string.
This is a nice definition, but does such a universal machine exist? The answer is yes, and it is proved by enumerating the set of monotone Turing machines in a computable way, which Li and Vitányi[1] state to be "clearly" possible right before Definition 4.5.5 —without giving a proof. Intuitively, this is believable by thinking about universal programming languages: Any "monotone Turing machine" can be implemented by a finite code in any universal programming language like Python.
I fix a universal monotone Turing machine U for the rest of the post. In a later section we will see that the theory is quite sensitive to the choice of this machine, but for now, we will not worry about this.
The universal distribution
A monotone Turing machine T:B∗→B≤∞ induces another function (by abuse of notation with the same notation) T:B∞→B≤∞ whose output on an infinite input is defined to be the limit of the outputs of finite prefixes:[3]
∀s∈B∞: T(s):=limp∈B∗, p⪯sT(p).Now, let Pu be the uniform distribution on B∞, which samples infinite binary sequences one bit at a time, each with probability 50% to be 0 or 1. Let s∈B∞ be a random input sampled from Pu, and let x=U(s)∈B≤∞ be its output (possibly infinite). For finite x∈B∗, let M(x) be the resulting probability that a history starts with x. Formally:
Definition 1 (Universal a priori distribution). Let U be a universal monotone Turing machine and Pu be the uniform distribution on infinite binary codes. The universal a priori probability that a history starts with x∈B∗ is defined as
M(x):=Pu({s∈B∞∣x⪯U(s)}).I invite you to pause for a moment about what a profound perspective shift this is compared to the historical puzzlement I emphasized in the introduction: When incorrectly assuming that our history x is sampled uniformly, we cannot predict the future, as the skeptical views expressed by Empiricus, Descartes, and Hume make clear. However, the universal a priori distribution M does not assume a uniformly sampled history. Instead, it assumes a uniformly sampled code, from which the history is generated with the fixed, deterministic universal monotone Turing machine U. In other words, we take our universe to have a uniformly sampled description instead of history.
Should we expect any regularities in the histories x sampled from M? And if so, does this regularity give us a hint about how to predict sequence continuations in our actual universe? These questions are answered in the next two sections.
M is a universal mixture distribution
We now state and prove an equivalent description of the universal a priori distribution M, which will show that M has some simplicity-weighting. Intuitively, it it will turn out that sampling histories from M is as if you were to first sample "probabilistic physical laws" in a way that obeys Occam's razor, and then use those laws to sample the history.
Lower semicomputable semimeasures
We formalize "probabilistic physical laws" by the notion of a semimeasure:
Definition 2 ((Lower semicomputable) semimeasure). A semimeasure is a function ν:B∗→[0,1] with ν(ϵ)=1[4] and ν(x)≥ν(x0)+ν(x1) for all x∈B∗, where ϵ∈B∗ is the empty binary string. ν is called lower semicomputable if it can be computably approximated from below; This means there is an algorithm which, on input x, produces progressively better approximations of ν(x) from below. I abbreviate "lower semicomputable semimeasure" with LSCSM from now on.
For a LSCSM ν, ν(x) is interpreted as the probability, under ν, that an infinite sequence starts with x. We can have ν(x)>ν(x0)+ν(x1) since the sequence may not continue after starting with x. This is akin to the possibilities that the universe can end at any moment.
As an aside, LSCSMs also include deterministic sequences: Let s∈B∞ be a fixed sequence that can be generated by a computer program. Define ν(x)=1 for all prefixes x of s, and ν(x)=0, else. Then for a given x, we can compute ν(x) by computing s and determining whether x is a prefix. This shows that ν is an LSCSM that generates the sequence s.
How can we obtain LSCSMs? Well, for one, we already know one instance: M is a semimeasure since
M(ϵ)=Pu({s∣ϵ⪯U(s)})=P(B∞)=1,and for every x∈B∗:
{s∣x⪯U(s)}={s∣x0⪯U(s)} ˙∪ {s∣x1⪯U(s)} ˙∪ {s∣x=U(s)},leading to M(x)≥M(x0)+M(x1). M is also lower semicomputable. Indeed, to approximate M(x) from below, simply enumerate all finite binary sequences p, compute U(p) over all p in parallel, and record whether x⪯U(p). If this is the case, then for all infinite extensions s of p, we have x⪯U(s), and they together have probability 2−l(p), with l(p) the length of p. Thus, we can add 2−l(p) to our estimate of M(x) and then skip any extensions of p in our algorithmic procedure. The estimate then progressively approximates M(x) since for every s with x⪯U(s), there is a finite prefix p⪯s with x⪯U(p), meaning we will find the contribution eventually in our approximation.
What other ways are there to obtain LSCSMs? Well, you may have noticed that I did not use the universality of U in the explanation above. In other words, every monotone Turing machine T gives rise to an LSCSM νT by piping uniformly random noise through T:
νT(x):=Pu({s∈B∞∣x⪯T(s)})And this is all: All LSCSMs emerge from a monotone Turing machine in this way, see Theorem 4.5.2 in Li and Vitányi.[1][5]
The mixture equivalence
Now, for p∈B∗, define the function Tp:B∗→B≤∞ by
Tp(q):=U(p′q),which is itself a monotone Turing machine. Define the LSCSM νp:=νTp.
Definition 3 (A priori prior). The a priori prior is given for any LSCSM ν by
Pap(ν):=∑p∈B∗: νp=ν2−l(p′),which is the probability[6] that a uniformly random infinite sequence s∈B∞ starts with an encoding p′ that gives rise to the LSCSM νp=ν.
The reason I call this the "a priori prior" is that it simply emerged from our definitions without further assumptions, which distinguishes it from the Kolmogorov-based Solomonoff prior that I will investigate later (Section link).
Finally, let Msol be the set of all LSCSMs, where "sol" is shorthand for "Solomonoff". We obtain:
Theorem 4 (Mixture Equivalence). The universal distribution M is a universal mixture over all LSCSMs with weights Pap(ν):
M(x)=∑ν∈MsolPap(ν)ν(x).Proof. Since U only has outputs for inputs of the form p′s for p∈B∗, we obtain
{s∈B∞∣x⪯U(s)}=˙⋃p∈B∗{p′}×{s∈B∞∣x⪯U(p′s)=Tp(s)}.We obtain
M(x)=Pu({s∈B∞∣x⪯U(s)})=Pu(˙⋃p∈B∗{p′}×{s∈B∞∣x⪯U(p′s)=Tp(s)})=∑p∈B∗Pu(p′)⋅Pu({s∈B∞∣x⪯Tp(s)})=∑p∈B∗2−l(p′)⋅νp(x)∑ν∈Msol(∑p∈B∗: νp=ν2−l(p′))⋅ν(x)=∑ν∈MsolPap(ν)⋅ν(x).That proves the claim. □
This means that if our history x was sampled from M, we can also think of it as being sampled by first sampling a LSCSM ν with probability Pap(ν), and then sampling the history successively from ν.
If an LSCSM ν has an encoding ν=νp, then Pap(ν)≥2−l(p′). This gives a simplicity bias: Universes with a simple (i.e., short) description p are more likely under this a priori prior. This can be interpreted as a form of Occam's razor, in that hypotheses with a short description receive exponentially more weight.[7]
Solomonoff induction
So far, I have started with the assumption that our universe is the result of sampling a uniformly random program, and arrived at the conclusion that this is equivalent to first sampling an LSCSM with a prior that gives more weight to simple hypotheses, and then sampling our history from it. Thus, under our assumption, we should expect that the world is more predictable than sceptical paradoxes suggest: We can concentrate our reasoning to simple universes and should have a chance to predict the future better than chance. We will make this case in this section in mathematical detail.
Universal sequence prediction
How should we make predictions in light of these thoughts? The answer: We predict using the Solomonoff mixture distribution M(x)=∑ν∈MsolPap(ν)ν(x) itself, predicting sequence continuations via familiar conditional probabilities M(xt∣x<t):=M(x≤t)/M(x<t) upon observing an initial history x<t=x1,…,xt−1∈Bt−1. This is the most parsimonious way to predict, assuming we a priori only make the assumption of having emerged from a program being sampled uniformly at random.
We now explain in more detail how to predict using M by relating the conditionals to Bayesian updating. For any LSCSM ν, define the conditional probability ν(x∣y):=ν(yx)/ν(y). We obtain:
M(xt∣x<t)=M(x≤t)M(x<t)=∑ν∈MsolPap(ν)ν(x≤t)∑ν′∈MsolPap(ν′)ν′(x<t)=∑ν∈MsolPap(ν)ν(x<t)∑ν′∈MsolPap(ν′)ν′(x<t)⋅ν(xt∣x<t)=:∑ν∈MsolPap(ν∣x<t)⋅ν(xt∣x<t),where I defined Pap(ν∣x<t) in the suggested way. This is actually a Bayesian posterior in the conventional sense; One can notationally see this more clearly by writing ν(x<t)=P(x<t∣ν), the likelihood of history x<t conditional on the hypothesis ν. Then the inner formula becomes:
Pap(ν∣x<t)=Pap(ν)⋅P(x<t∣ν)∑ν′Pap(ν′)⋅P(x<t∣ν′)=Pap(ν)⋅P(x<t∣ν)P(x<t).This is the classical Bayes theorem.
Thus, we arrive at a very satisfying description of Solomonoff induction:
This is very appealing:
One often-noted drawback is that it is incomputable since we can't precisely determine the posterior Pap(ν∣x<t) or the conditional ν(xt∣x<t). However:
The Solomonoff bound
All of this is not useful if it would not result in actual predictability of the world. After all, our question was why is our actual universe predictable. Fortunately, it turns out that Solomonoff induction is actually an exquisite predictor under the reasonable assumption that our own universe is a lower semicomputable measure, instead of just a semimeasure.[8]
Theorem 5 (Solomonoff Bound). Let μ be any lower semicomputable measure on infinite sequences, and assume it generates our actual universe. Define the cumulative expected prediction error of predicting sequences sampled by μ via M as:
Sμ∞:=∞∑t=1∑x<t∈B∗μ(x<t)∑xt∈B(M(xt∣x<t)−μ(xt∣x<t))2.Then we have
Sμ∞≤−lnPap(μ).In other words, the prediction error is upper-bounded by the cross entropy between the Dirac measure on our true universe and our a priori prior probability Pap(μ). In particular, for increaingly long histories, the total remaining prediction error goes to zero.[9]
Proof. We have (with explanations after the computation):
Sμ∞=∞∑t=1∑x<t∈B∗μ(x<t)∑xt∈B(M(xt∣x<t)−μ(xt∣x<t))2≤∞∑t=1∑x<t∈B∗μ(x<t)∑xt∈Bμ(xt∣x<t)lnμ(xt∣x<t)M(xt∣x<t)=limn→∞n∑t=1∑x<tμ(x<t)DKL(μ(Xt∣x<t) ∥ M(Xt∣x<t))=limn→∞n∑t=1DKL(μ(Xt∣X<t) ∥ M(Xt∣X<t))=limn→∞DKL(μ(X1,…,Xn) ∥ M(X1,…Xn))=limn→∞∑x≤nμ(x≤n)lnμ(x≤n)M(x≤n).Step 1 is the definition. Step 2 is is a standard inequality that holds whenever μ is a measure, see Corollary 2.8.8 in Hutter et al.[10] (whereas M is allowed to be only a semimeasure). Step 3 uses the definition of the KL divergence and changes the infinite series to a limit of partial sums. Step 4 substitutes in the definition of the conditional KL divergence.
Step 5 is a crucial step. Similar to how Shannon entropy H satisfies the well-known chain rule H(X)+H(Y|X)=H(X,Y), KL divergence also satisfies the chain rule DKL(P(X) ∥ Q(X))+DKL(P(Y∣X) ∥ Q(Y∣X))=DKL(P(X,Y) ∥ Q(X,Y)). Intuitively, the divergence of P and Q measured in the variable X plus the divergence in Y once X is already known equals the total divergence of P and Q over (X,Y). Telescoping this chain rule from 1 till n gives precisely the result of step 5. A full proof of this telescope property can be found in Lemma 3.2.4 in Hutter et al.[10]. Step 6 is just the definition of the KL divergence again.
Now, note that by Theorem 4, we have
M(x≤n)=∑ν∈MsolPap(ν)ν(x≤n)≥Pap(μ)μ(x≤n).With this inequality, we obtain
Sμ∞≤limn→∞∑x≤nμ(x≤n)ln1Pap(μ)=−lnPap(μ)⋅limn→∞∑x≤nμ(x≤n)=−lnPap(μ).That finishes the proof. □
Thus, adding to our earlier investigations, not only do we have reason to believe our universe is predictable, there also is an amazing predictor as long as our universe is generated from a lower semicomputable measure; This resolves the skeptical paradoxes from the introduction. Furthermore, this predictor is given by a blank-slate probability distribtion M that assumes no prior knowledge of what our universe is. While the prediction error is lower for universes μ that are more likely under Pap (which includes universes with a short description), I note that it goes to zero for long histories regardless of μ.
Comparison to Solomonoff induction with K-complexity
This section compares our constructions of Solomonoff induction to the more familiar construction where the prior Pap is replaced by the Solomonoff prior Psol. If you are not interested in this section, read on with the conclusion. This section assumes more prior knowledge than previous sections.
Assume we have a computable enumeration ν1,ν2,… of LSCSMs. We can, for example, obtain this by mapping a natural number i to its binary representation — a code p — and then defining νi:=νp as above. Furthermore, let Upr be a separate universal prefix Turing machine, which is not directly related to the universal monotone Turing machine U. Define the Kolmogorov complexity K(i) of a natural number i as the length of the shortest input to Upr which computes a binary encoding of i. The Solomonoff prior is then given by
Psol(i):=2−K(i).One can then prove that approximately (i.e., up to at most a constant multiplicative error), we have
M(x)≈∑i∈NPsol(i)νi(x),see Li and Vitányi,[1] Theorem 4.5.3. I will prove this statement at the start of the second subsection to reveal an advantage of Pap.[11] Importantly, due to Theorem 4, this means that the Solomonoff prior Psol and a priori prior Pap lead up to a constant to the same predictions on sequences. The advantages of the priors that we analyze are thus not statements about their induced predictive distributions.
Advantages of Psol over Pap
The Solomonoff prior has some distinct advantages that are missing in our definition:
Lower semicomputability. The weights 2−K(i) are lower semicomputable, which our weights Pap(ν) are probably not. The reason is that to approximate Pap(ν), we would need to be able to algorithmically determine for any binary string p whether νp=ν. There seems to be no method to do this. See our investigation in a footnote for more details on missing lower semicomputability in the related case of Nate's alt-complexity.[12]
Invariance under a change of the UTM. The weights 2−K(i) are, up to a multiplicative constant, invariant under a change of Upr. In contrast, Pap(ν) is not invariant, not even up to a multiplicative constant, under a change of the universal monotone machine U. I learned the following argument for this from Cole Wyeth. Namely, take your universal monotone Turing machine U. Define a second machine from it, W, which is given by W((pp)′q):=U(p′q)=Tp(q), and with empty output on any other input. We can easily see that this is again a universal monotone Turing machine. With suggestive notation, we obtain:
PWap(ν)=∑r∈B∗: νWr=ν2−l(r′)=∑p∈B∗: νp=ν2−l((pp)′)≲(∑p∈B∗: νp=ν2−l(p′))2=Pap(ν)2.Thus, if Pap(ν) is very small, then PWap(ν) becomes extremely small, and there can be no constant factor c with PWap(ν)≥c⋅Pap(ν). We note that in the approximate inequality, we use l((pp)′)≈2l(p′), which only holds up to a logarithmic error term that probably does not change the conclusion substantially.
Besides, this also means that the a priori prior Pap has no simple relationship to the K-complexity-based Solomonoff prior Psol that can hold independently of the choices of the involved universal Turing machines. This may be surprising given that the coding theorem shows that the a priori probability of a string is up to a constant identical to its K-complexity. This observation is related to my belief that the alt-complexity and K-complexity in Nate's post are not identical up to a constant, contrary to what several commenters claimed. I discuss this in more detail in a footnote.[12]
Optimality of the Solomonoff prior. Assume we remain in the realm of lower semicomputable semiprobability mass functions ω(i), of which Psol(i) is an example, and consider mixtures ∑iω(i)νi. There is a sense in which the Solomonoff prior Psol(i)=2−K(i) is optimal under all these choices for ω. Namely, let P1,P2,… be an enumeration of all lower semicomputable semiprobability mass functions. Then for each ω under consideration, there is a j∗ with ω=Pj∗. By ty the coding theorem (Theorem 4.3.3 in Li and Vitányi[1]), we have
2−K(i)≈∑j≥12−K(j)Pj(i)≥2−K(j∗)Pj∗(i)≈ω(i),up to the multiplicative constant 2−K(j∗) in the last step. In other words, Psol has "more weight" on all indices i than ω. Consequently, when adapting Theorem 5 to general lower semicomputable mixtures in place of M, Psol provides a better bound than ω (up to an additive constant):
−logPsol(i)=K(i)≤−logω(i).There seems to not be any analogous result for our a priori prior Pap(ν). Indeed, it would be surprising if such a result exists since this prior is very non-invariant under the change of the universal monotone Turing machine. On the flip side, the fact that Pap is likely not lower semicomputable means that Psol may not beat Pap.
Advantages of Pap over Psol
Dependence on fewer choices. The definition of Psol depends on the choice of a universal prefix machine Upr, on top of the choice of U that we already made use of before. Pap has no such further dependence.
Pap is platonic, Psol is not. Theorem 4 is very natural: The proof is essentially revealed from the definition of Pap itself, giving the definition intrinsic justification due to the beautiful result. In contrast, the analogous statement that
M(x)≈∞∑i=12−K(i)νi(x)=:ξUpr(x)only very losely depends on the definitions of M and the Solomonoff prior Psol(i)=2−K(i) at all, meaning we could have chosen a very different prior without many negative consequences. To explain this, I'm stating the proof of the result: M itself is an LSCSM, meaning that M=νio for an i0∈N. This results in
ξUpr(x)≥2−K(i0)νi0(x)=c⋅M(x),which already establishes the inequality (up to a multiplicative constant) in one direction. For the other direction, first note that ξUpr is lower semicomputable since 2−K(i) is lower semicomputable and each νi is lower semicomputable. Furthermore, we have
ξUpr(x)=∞∑i=12−K(i)νi(x)≥∞∑i=12−K(i)(νi(x0)+νi(x1))=ξUpr(x0)+ξUpr(x1).This almost establishes ξUpr as an LSCSM, except that it doesn't satisfy our assumption that ξUpr(ϵ)=1. Thus, define ~ξUpr to be identical to ξUpr except that ~ξUpr(ϵ)=1. Then this is an LSCSM and we obtain
M(x)=∑ν∈MsolPap(ν)ν(x)≥Pap(~ξUpr)~ξUpr(x)≥Pap(~ξUpr)ξUpr(x),which is precisely the desired inequality in the other direction. As we can see, this theorem only depended on the fact that both M and ξUpr are (essentially) LSCSMs which are mixtures of all LSCSMs; the specifics of M and ξUpr are irrelevant, and the same comparison would hold true for any other lower semicomputable mixture of all LSCSMs. It then appears to me that the Solomonoff prior Psol(i)=2−K(i) is ultimately arbitrary, and has no a priori justification.
Pap leads to fewer constants. Related to the previous points, the representation of M(x) as ∑i2−K(i)νi(x) holds only up to a multiplicative constant, whereas Theorem 4 was an exact equality. Theorem 3.8.8 in Hutter et al.[10] claims that with a smart choice of the two universal Turing machines U and Upr, an exact equality can be achieved, but I am unsure if such a choice can be justified "a priori" in a way that feels satisfying.
Conclusion
In this post, I have introduced Solomonoff induction without using K-complexity: A hypothesis for our universe, modeled as lower semicomputable semimeasure (LSCSM), is considered to be as likely "as it truly is" when sampling uniformly random program codes and piping them through a universal monotone Turing machine. Almost tautologically, this means that the probability of sampling a history x with the universal machine equals the probability of sampling the history from an LSCSM after first sampling an LSCSM from the a priori prior (Theorem 4).
I then defined Solomonoff induction as the process of predicting sequence continuations in a universe under the assumption that they were generated from a uniformly random computer program. Applying Theorem 4 then leads to an equivalent description in terms of the a priori prior: To continue a sequence, first form the posterior over hypotheses with Bayesian reasoning, and then use the mixture over the posterior to predict the continuation. Since the prior over hypotheses has a simplicity-bias (hypotheses with a shorter encoding are more likely), this can be considered a formalization of Occam's razor.
In Theorem 5, we saw that this leads to the well-known Solomonoff bound. It shows that the cumulative error of predicting continuations in any computed universe is upper bounded by the negative logarithm of the probability of the universe under the a priori prior, i.e., by a cross-entropy loss. This resolves the historical puzzlements in the introduction and explains how it is possible to predict at all, using just the reasonable assumption that our universe is computed.
I then compared the a priori prior to the Solomonoff prior that is based on K-complexity, and found that the latter has some distinct mathematical advantages: The prior is lower semicomputable, invariant under a change of the universal prefix Turing machine it is based on, and it leads to optimal Solomonoff bounds when compared to other priors that are lower semicomputable (which, however, is not necessarily an advantage against the bound in Theorem 5 since the a priori prior is likely not lower semicomputable). This means that the Solomonoff prior is probably the better mathematical basis for developing further theory.
Yet, the a priori prior seems ultimately more natural, in that it does not depend on an additional choice of a universal prefix machine, and the mixture description of the universal distribution (Theorem 4) is tautologically true — rather than being achieved by a post-hoc analysis involving a further constant multiplicative error, as is the case for the Solomonoff prior. As mentioned to me by Cole Wyeth, perhaps it is possible to unify the two perspectives with a construction akin to a Friedberg numbering by finding a unique encoding of each LSCSM. This might then salvage the coding theorem (aka the correspondence between degeneracy and simplicity, or between alt-complexity and K-complexity) in this setting.
FAQ (Frequently anticipated questions)
Here I answer some questions that I have frequently anticipated.
Q: In our universe, we don't observe the entire history x1,…,xt up until this point in time. Instead, we observe a small part of the universe for a finite amount of time. How is sequence prediction in reality then compatible with Solomonoff induction?
A: The hypothesis ν in reality is not only the simulator of the universe, but instead the simulator of your precise observations in our universe. Thus, ν then contains a description of the universe and a description of your location in time and space. One further complicationm is that our perceptions are "qualia", which we do not know how to relate to the formalism of symbol strings, see Wei Dai's post of open problems.
Q: Should we try to predict the real world using Solomonoff induction?
A: That's fairly intractable, so we have to use alternatives. The question is then whether it would be a useful target to try to approximate. My guess is that this depends on the timescales: The shorter the history of our observations, the more we should trust the priors ingrained into our skulls by evolution, for they might be better than the Solomonoff prior and more adapted to our actual universe. On large time horizons, perhaps it becomes increasingly useful to actually approximate Solomonoff induction.
Q: Do neural networks do Solomonoff induction in any meaningful sense?
A: There is an interesting analogy between neural networks and Solomonoff induction: Namely, in the same way that our version of Solomonoff induction predicts sequence continuations by piping random noise through a universal Turing machine, a neural network function is at initialization sampled by aranging random parameters in an architecture. If it additionally turns out that gradient descent is akin to Bayesian updating, the analogy might become quite close. If there turns out to be a sufficiently strong connection between probability and K-complexity in the Solomonoff context (as in the coding theorem for a specific case), one might then reason that neural networks also have a bias toward learned functions with short descriptions, perhaps explaining their generalization capabilities. See also here, here, and here.
Q: Is the Solomonoff a priori distribution M just a useful model, or do you actually expect our world to be sampled from it? Is it a theory of everything?
A: My intuition is that it's just a useful model. It depends on the universal monotone Turing machine U, which feels unnatural. I'm also uncertain whether our universe is really discrete and computed. Perhaps we live in a continuous mathematical structure instead, and time just emerges in our experience? I have not engaged much with the precise overlap and correspondence between mathematical structures and Turing machines and would be happy for someone to say more in the comments.
Q: How is the complexity of a string under a universal monotone Turing machine related to the one under a different Turing machine model, which may be more akin to programming languages in which we typically write code?
A: There are several bounds that closely relate different codelengths to each other, typically up to at most logarithmic error terms. For an example, see Theorem 3.8.7 in Hutter et al.[10]
M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Cham, 4th edition, 2019. ISBN 9783030112974., DOI: 10.1007/978-3-030-11298-1.
I found some ambiguity in the literature on whether monotone Turing machines are only partial functions, i.e., only producing an output for some of their inputs, or total functions. I settled on total functions since this makes things conceptually simpler and seems to be the most common view.
Fomally, one can define this limit as a supremum over the prefix-relation.
In the literature, one typically encounters the definition ν(ϵ)≤1; However, for our purpose, equality is more natural and will later ensure that all lower semicomputable semimeasures can be represented by a monotone Turing machine.
Confusingly, Li and Vitányi use a definition of LSCSMs where ν(ϵ)≤1 (different from our definition where ν(ϵ)=1), and yet claim that all of them are covered by monotone Turing machines. Perhaps they use a somewhat different definition of monotone Turing machines T in which it is allowed for outputs to be entirely undefined (thus, not even the empty string), which means that νT(ϵ)<1, whereas our monotone Turing machines are total functions. This paper thinks Li and Vitányi are making a mistake and writes: "It is equivalent to Theorem 4.5.2 in [LV08] (page 301) with a small correction: λT(ϵ)=1 for any T by construction, but μ(ϵ) may not be 1, so this case must be excluded."
Pap is actually a proper probability mass function, meaning that ∑νPap(ν)=1. The reason is that each s∈B∞ is of the form s=p′t for a p∈B∗ and t∈B∞, with all s with the same prefix p′ having weight 2−l(p′) together. Thus, 1=P(B∞)=∑p∈B∗2−l(p′)=∑νPap(ν).
The reverse may not necessarily hold: A hypothesis can receive a lot of weight simply by having exponentially many (longer) descriptions, with a priori no guarantee for there to then be a short description. For gaining intuitions on this perspective, I can recommend Nate's post on alt-complexity and the last footnote.[12]
In other words, the theorem assumes that our universe cannot simply end at any finite time, compared to semimeasures which can always stop a sequence continuation. This seems restrictive at first, and would exclude possibilities like a big crunch in which the universe eventually ends. However, I think such scenarios can likely be rescued even in the true measure formalism: For a sequence that ends, simply continue it with 000… to an infinite sequence. With this method, it should be possible to assign a measure to any semimeasure, which hopefully has a similar a priori probability. If this were the case, then we would still be able to predict our actual universe effectively up until the moment when it ends, when predicting it correctly does not matter anymore. I did not check these claims in detail, but perhaps Lemma 1 in this post already contains the answer.
Note that we could have also written this error as the expected value
Sμ∞=Es∼μ[∞∑t=1∑xt∈B(M(xt∣s<t)−μ(xt∣s<t))2],which makes conceptually clearer that it is an expected cumulative prediction error.
Marcus Hutter and David Quarel and Elliot Catt. An Introduction to Universal Artificial Intelligence. Chapman & Hall, 2024. ISBN: 9781032607023, DOI: 10.1201/9781003460299.
The reader may wonder why we go through the hassle of using a separate universal prefix machine at all. With p=bin(i), another option would be to define P(i):=2−l(p′) as the weight for νi=νp. The issue with this definition is, perhaps paradoxically, that P is then computable, which by Lemma 4.3.1 in Li and Vitányi[1] leads to P not being universal, i.e., not dominating all other computable (let alone semicomputable) semimeasures. This, in turn, means that this prior will not lead to the same optimal bounds that we will explain the prior Psol(i)=2−K(i) to have.
Nate assumes a universal programming language U that can receive input codes p that then result through U to output functions f. An example for p is code for "quicksort" or "bubblesort", which both lead to the same function f that receives a list as input and outputs its sorted version. He then defines the K-complexity
KU(f):=min{l(p)∣evalU(p)=f},and the alt-complexity
KUalt(f):=−log2∑p: evalU(p)=f2−l(p)=:−log2PU(f).Many commenters claimed that these differ only up to an additive constant, but I believe this to be wrong.
At first sight it seems true, since it seems identical to the coding theorem, which is about the equality of K(x) and −log2P(x) defined for a universal prefix machine, with x being a binary string. The proof of the coding theorem crucially uses the lower semicomputability of P(x), which is achieved by dovetailing through all p and checking whether the universal prefix machine sends them to x, and increasing the estimate of P(x) by 2−l(p) when this happens. These estimates are crucial to be able to find codewords for x that get progressively shorter and approximate −log2P(x) in length, which then establishes the approximate equality to K(x).
The same proof strategy does not work for Nate's alt-complexity KUalt since there can, to my knowledge, not exist an algorithm that checks for any program p and function f whether evalU(p)=f; after all, both sides of this equation are themselves functions that can in principle receive infinitely many inputs over which we can't all iterate (See Rice's theorem). Thus, the lower semicomputability of PU seems to fail, and thus I don't know of a method to construct codewords that approximate the length −log2PU(f) eventually. The coding theorem thus likely fails in this context, unless there is an entirely different proof strategy that succeeds.