LESSWRONG
LW

A Technical Introduction to Solomonoff Induction without K-Complexity — LessWrong

This post has been written in collaboration with Iliad in service of one of Iliad's longer-term goals of understanding the simplicity bias of learning machines.

Solomonoff induction is a general theoretical ideal for how to predict sequences that were generated by a computable process. It posits that in order to predict the continuation of our world, we should weight hypotheses by their simplicity ("Occam's razor'') and update the weighting using Bayesian reasoning.

In this post, I introduce Solomonoff induction at a technical level, hoping to be more elegant and satisfying than introductions you might find elsewhere. In particular, the introduction avoids Kolmogorov complexity: Instead of making predictions with the Solomonoff prior probability of hypotheses — given as two to the negative of their shortest encodings, i.e., Kolmogorov complexities — we directly use the a priori probability that a hypothesis is sampled from code sampled uniformly at random. This is closely related to the alt-complexity sketched by Nate Soares before. The approach leads to more elegant results that are simpler to prove than for typical expositions. Yet, since only the prior over hypothesis is changed, the introduction remains compatible with other expositions and should be broadly valuable to many readers.

Later, we will see that this choice has both mathematical advantages and disadvantages compared to an approach based on Kolmogorov complexity, and the latter is likely a more powerful basis to build more theory on. These drawbacks notwithstanding, I hope that this introduction is philosophically satisfying, by touching on questions like "Why should we expect our world to be learnable or predictable in the first place?', "What is a universally good way to predict?", and connections to Occam's razor and why one should favor simple hypotheses when predicting the world. In the FAQ at the end, I will touch on several anticipated questions, including briefly on how the theory might be relevant for neural networks.

Audience. I try to not strictly assume much prior knowledge on Turing machines or related concepts, but prior exposure will still be very useful. Some amount of mathematical or scientific maturity can probably replace exposure to these topics.

Acknowledgments. This post benefitted most from discussions with Alexander Oldenziel, and it was caused by his inquiry about the potential simplicity of learned neural network functions based on analogies with algorithmic information theory (AIT). This made me want to think more carefully about a "degeneracy-first" perspective on AIT itself, from which this post emerged. I'm also grateful to Cole Wyeth for discussions and his feedback on an earlier draft that led to some changes. He also explained to me the argument for why the Solomonoff prior gives the optimal prediction bound among all lower semicomputable priors. Thanks also to Matt Dellago for discussions on the content of this post, and to CEEALAR for hosting me when I finished writing this post.

Introduction

If we make a sequence of observations in the world, e.g., a sequence of numbers

or a sequence of physical measurements, we are often able to continue the sequence in a way that coincides with its true continuation "in the wild". This seems to broadly work in much of the real world, meaning the world is learnable — we can learn from past experience to predict the future.

Why is that? This question has puzzled humanity for thousands of years. For example, Sextus Empiricus expressed that any rule inferred from a finite amount of data may always be contradicted by more data, and that it is impossible to take into account all data:

When they propose to establish the universal from the particulars by means of induction, they will affect this by a review of either all or some of the particulars. But if they review some, the induction will be insecure, since some of the particulars omitted in the induction may contravene the universal; while if they are to review all, they will be toiling at the impossible, since the particulars are infinite and indefinite.

— Sextus Empiricus (160-210)

René Descartes notices that our minds are so fallible that we cannot even be sure of being awake:

How often, asleep at night, am I convinced of just such familiar events — that I am here in my dressing-gown, sitting by the fire — when in fact I am lying undressed in bed! Yet at the moment my eyes are certainly wide awake when I look at this piece of paper; I shake my head and it is not asleep; as I stretch out and feel my hand I do so deliberately, and I know what I am doing. All this would not happen with such distinctness to someone asleep. Indeed! As if I did not remember other occasions when I have been tricked by exactly similar thoughts while asleep! As I think about this more carefully, I see plainly that there are never any sure signs by means of which being awake can be distinguished from being asleep. The result is that I begin to feel dazed, and this very feeling only reinforces the notion that I may be asleep.

— René Descartes (1596-1650)

And in the following passage, David Hume seems to imply that while we can make future inferences about cause and effect from analogizing about past patterns, there is no principle that can be discovered in nature itself that would justify doing so:

All our reasonings concerning matter of fact are founded on a species of Analogy, which leads us to expect from any cause the same events, which we have observed to result from similar causes. [...] It is impossible, that this inference of the animal can be founded on any process of argument or reasoning, by which he concludes, that like events must follow like objects, and that the course of nature will always be regular in its operations. For if there be in reality any arguments of this nature, they surely lie too abstruse for the observation of such imperfect understandings; since it may well employ the utmost care and attention of a philosophic genius to discover and observe them.

— David Hume (1711-1776)

Similar reasoning as those is expressed by the no free lunch theorems, which roughly state that a priori, no learning algorithm should perform better than any other.

We conclude that without placing any assumptions on the world, learning and prediction should indeed be impossible: If the world were truly (uniformly) random, then any way to use past data to predict the future would indeed be doomed. So why is it possible to learn?

One pre-formal answer is Occam's razor, which exists in various formulations and states that among all explanations that are compatible with our observations, we should choose the one making the least amount of assumptions. There are two quotes often attributed to Occam to this effect:

"Plurality must never be posited without necessity."

"It is futile to do with more things that which can be done with fewer."

— William of Ockham (1287-1347)

Occam's razor is a useful starting point, but it is not formal. Additionally, it would be nice to be able to justify Occam's razor from some minimal assumption that does not simply assume that we should focus on simple hypotheses in our reasoning. How can one do this?

In this post, I motivate the answer by starting with the minimal assumption that our world is computed. Not knowing what it is computed by, we assume that we are generated by running a uniformly randomly sampled program code through a universal Turing machine (Section link).

A consequence of this assumption will be that we can also think of our world as being sampled in a two-stage process: First by sampling a semicomputable (perhaps probabilistic) universe from a prior distribution, and then by generating our history from said universe. As it turns out, the prior on universes — which I call a priori prior — will automaically have a simplicity-weighting built in, thus justifying Occam's razor (Section link).

Repurposing this two-stage distribution for making predictions then leads to Solomonoff induction, which has remarkable predictive power no matter what computation truly underlies our universe. I.e., even if the program code underlying our true universe was in fact not sampled randomly, but chosen adversarially to make our prediction task hard, we would eventually, for a sufficiently long observed history, converge to perfect predictions (Section link).

I will then discuss drawbacks of my choice of an a priori prior. In particular, the weights are likely not lower semicomputable, and the a priori prior is not invariant up to a constant under a change of the universal Turing machine, different from the properties of the classical Kolmogorov-based Solomonoff prior. Additionally, the Solomonoff prior satisfies an optimality property for sequence prediction that does not exist in similar form for the a priori prior. This also means that my choice of prior — which can be regarded as being based on the probability, or degeneracy, of hypotheses — is meaningfully different than an approach based on description length (Section link).

This shows that the maxime that description length equals degeneracy is not universal in algorithmic information theory. Relatedly, in a footnote I discuss the relationship to Nate Soare's alt-complexity and will explain that many commenters under that post were probably wrong about K-complexity and alt-complexity being identical up to a constant. These problems notwithstanding, I do think my choice of prior is a priori easier to justify, depends on fewer choices, and leads to some results being more elegant.

I then conclude and anticipate questions in my FAQ, focusing briefly on the question of how to think about Solomonoff induction in the context of agents observing only part of the universe's history, practical usefulness, potential applicability of Solomonoff induction as a theoretical ideal to understand neural networks, the question of whether Solomonoff induction can be considered a theory of everything, and the sensitivity of codelength to the Turing machine model to work with. The answers are very preliminary, and much more could be said or studied about them.

On being generated by a random computer program

Observing our world, it seems like the future is determined by the past with precise rules. We capture this by the minimal assumption that our world is computed. Without a priori knowledge of the program that generated us, we simply assume that our history emerged by sampling a uniformly random computer program and piping it through a universal programming language that outputs sequences.

In this section I formalize what this means. Let $B = {0, 1}$ be our alphabet of bits. $B^{*} = \cup_{n \in N_{\geq 0}} B^{n}$ is the set of binary strings of arbitrary finite length (where the length $0$ string $ϵ$ is explicitly allowed!). $B^{\infty} = {0, 1}^{\infty}$ is the set of infinite binary sequences. And $B^{\leq \infty} = B^{*} \cup B^{\infty}$ is the set of finite or infinite binary sequences. For two sequences $x \in B^{*}, y \in B^{\leq \infty}$ , we write $x y$ for their concatenation. $x$ is then said to be a prefix of $x y$ , written $x ⪯ x y$ .

Monotone Turing machines

How do we capture that we're "computed"? One idea is to assume that we are generated one "frame" at a time by a program that never changes past frames. This leads to the formalization of a monotone Turing machine $T$ in Definition 4.5.3 of Li and Vitányi.^[1] Formally, a monotone Turing machine is a computable function $T : B^{*} \to B^{\leq \infty}$ with the monotonicity property: For all $p, q \in B^{*}$ with $p ⪯ q$ , we have $T (p) ⪯ T (q)$ .^[2] There is a lot to unpack here, and I hope that the following semi-formal explanations give enough intuitions to read the rest of the post:

The inputs $p \in B^{*}$ can be interpreted as program instructions written in binary code.
The outputs $T (p) \in B^{\leq \infty}$ are (possibly infinite) histories; Imagine a binary encoding of our universe's history. Thus, each code $p$ contains the instructions to compute a history $T (p)$ .
You should imagine $T$ to be a (potentially non-universal) programming language that translates a code $p$ to the history $T (p)$ .
Formally, that $T : B^{*} \to B^{\leq \infty}$ is computable simply means that there is an "algorithm" in the intuitive sense that computes the history for a given code. This is justified by the Church-Turing thesis, which allows us to avoid the precise formalism of Turing machines.
Monotonicity is formalized as $T (p) ⪯ T (q)$ for all codes $p, q$ with $p ⪯ q$ . Intuitively, $T$ reads input codes from left to right and may already write (arbitrarily many) output bits along the way. Earlier bits can never be erased. Outputs are simply "monotonically increasing" by concatenating further bits.
Notably, even though the inputs $p$ are finite, it is possible that $T$ produces an infinite output $T (p)$ . Intuitively, this happens if $T$ runs into a "loop" that applies the same rules recursively to produce the next output bits conditional on the prior output history. This is similar to how our physical history is generated by stable physical laws that run "on a loop" conditioned on the present; the physical laws themselves never change.
If it indeed happens that $T (p)$ is an infinite output, then any further bits are never reached, and so $T (p) = T (p q)$ for all extensions $q \in B^{*}$ .
It is possible for $T (p)$ to be finite and yet to have $T (p) = T (p q)$ for all $q \in B^{*}$ . Intuitively, this happens if all further bits are simply regarded as "dead code" or a "comment" that does not alter the output.

Universal monotone Turing machines

Not all monotone Turing machines $T$ are equally interesting. For example, imagine a machine $T$ which simply outputs an empty sequence on any input: This satisfies all properties stated above, but it is very boring.

The other extreme is a universal monotone Turing machine, which can simulate all others. It does this by operating on codes that first describe an arbitrary monotone Turing machine using a binary code $p$ , followed by the input $q$ to said machine. One tricky detail is that we cannot simply concatenate $p$ and $q$ to $p q$ since then we may forget where the description of $p$ ends and the input $q$ starts.

A solution is to use the following prefix-free encoding of descriptions $p \in B^{*}$ :

p^{'} := 1^{l (bin (l (p)))} 0 bin (l (p)) p \in B^{*} .

Let's unpack this: $l (p)$ is the length of $p$ . $bin (n)$ , for a natural number $n$ , is an encoding of $n$ as a binary sequence, usually by ordering binary sequences by length and then lexicographically. $p^{'}$ then first contains a sequence of $1$ 's followed by a zero. The number of $1$ 's is precisely the length of $bin (l (p))$ . After reading that, one knows how many of the following bits encode $l (p)$ . After reading those bits, one knows the length of $p$ , and so after reading so many further bits, one knows $p$ . Thus, it is algorithmically possible to disentangle a binary sequence of the form $p^{'} q$ into $p$ and $q$ since $p^{'}$ itself tells you when $p$ will end.

Then, a universal monotone Turing machine $U : B^{*} \to B^{\leq \infty}$ only produces non-trivial outputs for inputs of the form $p^{'} q$ . Additionally, for every $p \in B^{*}$ , $T_{p} : q \mapsto T_{p} (q) := U (p^{'} q)$ is itself a monotone Turing machine, and every monotone Turing machine $T$ is of the form $T = T_{p}$ for at least one $p \in B^{*}$ . Note that $U$ will never produce output bits before reading $q$ since $p^{'}$ only encodes a monotone Turing machine $T$ . Thus, for any prefix $r$ of $p^{'}$ , $U (r)$ is simply the empty string.

This is a nice definition, but does such a universal machine exist? The answer is yes, and it is proved by enumerating the set of monotone Turing machines in a computable way, which Li and Vitányi^[1] state to be "clearly" possible right before Definition 4.5.5 —without giving a proof. Intuitively, this is believable by thinking about universal programming languages: Any "monotone Turing machine" can be implemented by a finite code in any universal programming language like Python.

I fix a universal monotone Turing machine $U$ for the rest of the post. In a later section we will see that the theory is quite sensitive to the choice of this machine, but for now, we will not worry about this.

The universal distribution

A monotone Turing machine $T : B^{*} \to B^{\leq \infty}$ induces another function (by abuse of notation with the same notation) $T : B^{\infty} \to B^{\leq \infty}$ whose output on an infinite input is defined to be the limit of the outputs of finite prefixes:^[3]

\forall s \in B^{\infty} : T (s) := lim p \in B^{*}, p ⪯ s T (p) .

Now, let $P_{u}$ be the uniform distribution on $B^{\infty}$ , which samples infinite binary sequences one bit at a time, each with probability 50% to be $0$ or $1$ . Let $s \in B^{\infty}$ be a random input sampled from $P_{u}$ , and let $x = U (s) \in B^{\leq \infty}$ be its output (possibly infinite). For finite $x \in B^{*}$ , let $M (x)$ be the resulting probability that a history starts with $x$ . Formally:

Definition 1 (Universal a priori distribution). Let $U$ be a universal monotone Turing machine and $P_{u}$ be the uniform distribution on infinite binary codes. The universal a priori probability that a history starts with $x \in B^{*}$ is defined as

M (x) := P_{u} ({s \in B^{\infty} ∣ x ⪯ U (s)}) .

I invite you to pause for a moment about what a profound perspective shift this is compared to the historical puzzlement I emphasized in the introduction: When incorrectly assuming that our history $x$ is sampled uniformly, we cannot predict the future, as the skeptical views expressed by Empiricus, Descartes, and Hume make clear. However, the universal a priori distribution $M$ does not assume a uniformly sampled history. Instead, it assumes a uniformly sampled code, from which the history is generated with the fixed, deterministic universal monotone Turing machine $U$ . In other words, we take our universe to have a uniformly sampled description instead of history.

Should we expect any regularities in the histories $x$ sampled from $M$ ? And if so, does this regularity give us a hint about how to predict sequence continuations in our actual universe? These questions are answered in the next two sections.

$M$ is a universal mixture distribution

We now state and prove an equivalent description of the universal a priori distribution $M$ , which will show that $M$ has some simplicity-weighting. Intuitively, it it will turn out that sampling histories from $M$ is as if you were to first sample "probabilistic physical laws" in a way that obeys Occam's razor, and then use those laws to sample the history.

Lower semicomputable semimeasures

We formalize "probabilistic physical laws" by the notion of a semimeasure:

Definition 2 ((Lower semicomputable) semimeasure). A semimeasure is a function $ν : B^{*} \to [0, 1]$ with $ν (ϵ) = 1$ ^[4] and $ν (x) \geq ν (x 0) + ν (x 1)$ for all $x \in B^{*}$ , where $ϵ \in B^{*}$ is the empty binary string. $ν$ is called lower semicomputable if it can be computably approximated from below; This means there is an algorithm which, on input $x$ , produces progressively better approximations of $ν (x)$ from below. I abbreviate "lower semicomputable semimeasure" with LSCSM from now on.

For a LSCSM $ν$ , $ν (x)$ is interpreted as the probability, under $ν$ , that an infinite sequence starts with $x$ . We can have $ν (x) > ν (x 0) + ν (x 1)$ since the sequence may not continue after starting with $x$ . This is akin to the possibilities that the universe can end at any moment.

As an aside, LSCSMs also include deterministic sequences: Let $s \in B^{\infty}$ be a fixed sequence that can be generated by a computer program. Define $ν (x) = 1$ for all prefixes $x$ of $s$ , and $ν (x) = 0$ , else. Then for a given $x$ , we can compute $ν (x)$ by computing $s$ and determining whether $x$ is a prefix. This shows that $ν$ is an LSCSM that generates the sequence $s$ .

How can we obtain LSCSMs? Well, for one, we already know one instance: $M$ is a semimeasure since

M (ϵ) = P_{u} ({s ∣ ϵ ⪯ U (s)}) = P (B^{\infty}) = 1,

and for every $x \in B^{*}$ :

{s ∣ x ⪯ U (s)} = {s ∣ x 0 ⪯ U (s)} ˙ \cup {s ∣ x 1 ⪯ U (s)} ˙ \cup {s ∣ x = U (s)},

leading to $M (x) \geq M (x 0) + M (x 1)$ . $M$ is also lower semicomputable. Indeed, to approximate $M (x)$ from below, simply enumerate all finite binary sequences $p$ , compute $U (p)$ over all $p$ in parallel, and record whether $x ⪯ U (p)$ . If this is the case, then for all infinite extensions $s$ of $p$ , we have $x ⪯ U (s)$ , and they together have probability $2^{- l (p)}$ , with $l (p)$ the length of $p$ . Thus, we can add $2^{- l (p)}$ to our estimate of $M (x)$ and then skip any extensions of $p$ in our algorithmic procedure. The estimate then progressively approximates $M (x)$ since for every $s$ with $x ⪯ U (s)$ , there is a finite prefix $p ⪯ s$ with $x ⪯ U (p)$ , meaning we will find the contribution eventually in our approximation.

What other ways are there to obtain LSCSMs? Well, you may have noticed that I did not use the universality of $U$ in the explanation above. In other words, every monotone Turing machine $T$ gives rise to an LSCSM $ν_{T}$ by piping uniformly random noise through $T$ :

ν_{T} (x) := P_{u} ({s \in B^{\infty} ∣ x ⪯ T (s)})

And this is all: All LSCSMs emerge from a monotone Turing machine in this way, see Theorem 4.5.2 in Li and Vitányi.^[1]^[5]

The mixture equivalence

Now, for $p \in B^{*}$ , define the function $T_{p} : B^{*} \to B^{\leq \infty}$ by

T_{p} (q) := U (p^{'} q),

which is itself a monotone Turing machine. Define the LSCSM $ν_{p} := ν_{T_{p}}$ .

Definition 3 (A priori prior). The a priori prior is given for any LSCSM $ν$ by

P_{a p} (ν) := \sum p \in B^{*} : ν_{p} = ν 2^{- l (p^{'})},

which is the probability^[6] that a uniformly random infinite sequence $s \in B^{\infty}$ starts with an encoding $p^{'}$ that gives rise to the LSCSM $ν_{p} = ν$ .

The reason I call this the "a priori prior" is that it simply emerged from our definitions without further assumptions, which distinguishes it from the Kolmogorov-based Solomonoff prior that I will investigate later (Section link).

Finally, let $M_{s o l}$ be the set of all LSCSMs, where "sol" is shorthand for "Solomonoff". We obtain:

Theorem 4 (Mixture Equivalence). The universal distribution $M$ is a universal mixture over all LSCSMs with weights $P_{a p} (ν)$ :

M (x) = \sum ν \in M_{s o l} P_{a p} (ν) ν (x) .

Proof. Since $U$ only has outputs for inputs of the form $p^{'} s$ for $p \in B^{*}$ , we obtain

{s \in B^{\infty} ∣ x ⪯ U (s)} = {˙ ⋃}_{p \in B^{*}} {p^{'}} \times {s \in B^{\infty} ∣ x ⪯ U (p^{'} s) = T_{p} (s)} .

We obtain

M (x) = P_{u} ({s \in B^{\infty} ∣ x ⪯ U (s)}) = P_{u} ({˙ ⋃}_{p \in B^{*}} {p^{'}} \times {s \in B^{\infty} ∣ x ⪯ U (p^{'} s) = T_{p} (s)})

= \sum p \in B^{*} P_{u} (p^{'}) \cdot P_{u} ({s \in B^{\infty} ∣ x ⪯ T_{p} (s)}) = \sum p \in B^{*} 2^{- l (p^{'})} \cdot ν_{p} (x)

\sum ν \in M_{s o l} (\sum p \in B^{*} : ν_{p} = ν 2^{- l (p^{'})}) \cdot ν (x) = \sum ν \in M_{s o l} P_{a p} (ν) \cdot ν (x) .

That proves the claim. $□$

This means that if our history $x$ was sampled from $M$ , we can also think of it as being sampled by first sampling a LSCSM $ν$ with probability $P_{a p} (ν)$ , and then sampling the history successively from $ν$ .

If an LSCSM $ν$ has an encoding $ν = ν_{p}$ , then $P_{a p} (ν) \geq 2^{- l (p^{'})}$ . This gives a simplicity bias: Universes with a simple (i.e., short) description $p$ are more likely under this a priori prior. This can be interpreted as a form of Occam's razor, in that hypotheses with a short description receive exponentially more weight.^[7]

Solomonoff induction

So far, I have started with the assumption that our universe is the result of sampling a uniformly random program, and arrived at the conclusion that this is equivalent to first sampling an LSCSM with a prior that gives more weight to simple hypotheses, and then sampling our history from it. Thus, under our assumption, we should expect that the world is more predictable than sceptical paradoxes suggest: We can concentrate our reasoning to simple universes and should have a chance to predict the future better than chance. We will make this case in this section in mathematical detail.

Universal sequence prediction

How should we make predictions in light of these thoughts? The answer: We predict using the Solomonoff mixture distribution $M (x) = \sum_{ν \in M_{s o l}} P_{a p} (ν) ν (x)$ itself, predicting sequence continuations via familiar conditional probabilities $M (x_{t} ∣ x_{< t}) := M (x_{\leq t}) / M (x_{< t})$ upon observing an initial history $x_{< t} = x_{1}, \dots, x_{t - 1} \in B^{t - 1}$ . This is the most parsimonious way to predict, assuming we a priori only make the assumption of having emerged from a program being sampled uniformly at random.

We now explain in more detail how to predict using $M$ by relating the conditionals to Bayesian updating. For any LSCSM $ν$ , define the conditional probability $ν (x ∣ y) := ν (y x) / ν (y)$ . We obtain:

M (x_{t} ∣ x_{< t}) = \frac{M (x_{\leq t})}{M (x_{< t})} = \frac{\sum_{ν \in M_{s o l}} P_{a p} (ν) ν (x_{\leq t})}{\sum_{ν^{'} \in M_{s o l}} P_{a p} (ν^{'}) ν^{'} (x_{< t})}

= \sum ν \in M_{s o l} \frac{P_{a p} (ν) ν (x_{< t})}{\sum_{ν^{'} \in M_{s o l}} P_{a p} (ν^{'}) ν^{'} (x_{< t})} \cdot ν (x_{t} ∣ x_{< t}) =: \sum ν \in M_{s o l} P_{a p} (ν ∣ x_{< t}) \cdot ν (x_{t} ∣ x_{< t}),

where I defined $P_{a p} (ν ∣ x_{< t})$ in the suggested way. This is actually a Bayesian posterior in the conventional sense; One can notationally see this more clearly by writing $ν (x_{< t}) = P (x_{< t} ∣ ν)$ , the likelihood of history $x_{< t}$ conditional on the hypothesis $ν$ . Then the inner formula becomes:

P_{a p} (ν ∣ x_{< t}) = \frac{P_{a p} (ν) \cdot P (x_{< t} ∣ ν)}{\sum_{ν^{'}} P_{a p} (ν^{'}) \cdot P (x_{< t} ∣ ν^{'})} = \frac{P_{a p} (ν) \cdot P (x_{< t} ∣ ν)}{P (x_{< t})} .

This is the classical Bayes theorem.

Thus, we arrive at a very satisfying description of Solomonoff induction:

Observe the universe's history $x_{< t}$ .
Determine the posterior $P_{a p} (ν ∣ x_{< t})$ using the a priori prior $P_{a p} (ν)$ and likelihood $ν (x_{< t})$ for each hypothesis $ν$ for the rules underlying our universe.
Predict the next data point using a weighted average of the predictions $ν (x_{t} ∣ x_{< t})$ of each hypothesis $ν$ , weighted by their posterior probability.

This is very appealing:

It has Occam's razor built in, i.e., the common heuristic that hypotheses are preferable if they're simpler. Usually this means to have a short description, but here the notion of simplicity is extended to mean "easier to stumble into by chance" by piping random noise through the universal monotone Turing machine. Hypotheses that have a short description are likely in this sense, but this is not the only possibility: Another way is to have exponentially many free choices that lead to the same distribution, as might be the case with the choice of a coordinate system in our universe (cf. Nate's post.)
It is very compatible with typical scientific wisdom to discard and validate theories using observations (Bayes theorem).
It is mathematically precise.

One often-noted drawback is that it is incomputable since we can't precisely determine the posterior $P_{a p} (ν ∣ x_{< t})$ or the conditional $ν (x_{t} ∣ x_{< t})$ . However:

We do often have good intuitions about the prior probability $P_{a p} (ν)$ of hypotheses.
$ν (x_{< t})$ , an ingredient in the posterior and the conditional, cannot be computed, but it can in principle be approximated from below since $ν$ is assumed to be lower semicomputable. The only issue is that it is not always knowable how good the approximation is at any finite stage.

The Solomonoff bound

All of this is not useful if it would not result in actual predictability of the world. After all, our question was why is our actual universe predictable. Fortunately, it turns out that Solomonoff induction is actually an exquisite predictor under the reasonable assumption that our own universe is a lower semicomputable measure, instead of just a semimeasure.^[8]

Theorem 5 (Solomonoff Bound). Let $μ$ be any lower semicomputable measure on infinite sequences, and assume it generates our actual universe. Define the cumulative expected prediction error of predicting sequences sampled by $μ$ via $M$ as:

S_{\infty}^{μ} := \infty \sum t = 1 \sum x_{< t} \in B^{*} μ (x_{< t}) \sum x_{t} \in B (M (x_{t} ∣ x_{< t}) - μ (x_{t} ∣ x_{< t}))^{2} .

Then we have

S_{\infty}^{μ} \leq - ln P_{a p} (μ) .

In other words, the prediction error is upper-bounded by the cross entropy between the Dirac measure on our true universe and our a priori prior probability $P_{a p} (μ)$ . In particular, for increaingly long histories, the total remaining prediction error goes to zero.^[9]

Proof. We have (with explanations after the computation):

S_{\infty}^{μ} = \infty \sum t = 1 \sum x_{< t} \in B^{*} μ (x_{< t}) \sum x_{t} \in B (M (x_{t} ∣ x_{< t}) - μ (x_{t} ∣ x_{< t}))^{2}

\leq \infty \sum t = 1 \sum x_{< t} \in B^{*} μ (x_{< t}) \sum x_{t} \in B μ (x_{t} ∣ x_{< t}) ln \frac{μ (x_{t} ∣ x_{< t})}{M (x_{t} ∣ x_{< t})}

= lim n \to \infty n \sum t = 1 \sum x_{< t} μ (x_{< t}) D_{KL} (μ (X_{t} ∣ x_{< t}) ∥ M (X_{t} ∣ x_{< t}))

= lim n \to \infty n \sum t = 1 D_{KL} (μ (X_{t} ∣ X_{< t}) ∥ M (X_{t} ∣ X_{< t}))

= lim n \to \infty D_{KL} (μ (X_{1}, \dots, X_{n}) ∥ M (X_{1}, \dots X_{n}))

= lim n \to \infty \sum x_{\leq n} μ (x_{\leq n}) ln \frac{μ (x_{\leq n})}{M (x_{\leq n})} .

Step 1 is the definition. Step 2 is is a standard inequality that holds whenever $μ$ is a measure, see Corollary 2.8.8 in Hutter et al.^[10] (whereas $M$ is allowed to be only a semimeasure). Step 3 uses the definition of the KL divergence and changes the infinite series to a limit of partial sums. Step 4 substitutes in the definition of the conditional KL divergence.

Step 5 is a crucial step. Similar to how Shannon entropy $H$ satisfies the well-known chain rule $H (X) + H (Y | X) = H (X, Y)$ , KL divergence also satisfies the chain rule $D_{KL} (P (X) ∥ Q (X)) + D_{KL} (P (Y ∣ X) ∥ Q (Y ∣ X)) = D_{KL} (P (X, Y) ∥ Q (X, Y))$ . Intuitively, the divergence of $P$ and $Q$ measured in the variable $X$ plus the divergence in $Y$ once $X$ is already known equals the total divergence of $P$ and $Q$ over $(X, Y)$ . Telescoping this chain rule from $1$ till $n$ gives precisely the result of step $5$ . A full proof of this telescope property can be found in Lemma 3.2.4 in Hutter et al.^[10]. Step 6 is just the definition of the KL divergence again.

Now, note that by Theorem 4, we have

M (x_{\leq n}) = \sum ν \in M_{s o l} P_{a p} (ν) ν (x_{\leq n}) \geq P_{a p} (μ) μ (x_{\leq n}) .

With this inequality, we obtain

S_{\infty}^{μ} \leq lim n \to \infty \sum x_{\leq n} μ (x_{\leq n}) ln \frac{1}{P_{a p} (μ)} = - ln P_{a p} (μ) \cdot lim n \to \infty \sum x_{\leq n} μ (x_{\leq n}) = - ln P_{a p} (μ) .

That finishes the proof. $□$

Thus, adding to our earlier investigations, not only do we have reason to believe our universe is predictable, there also is an amazing predictor as long as our universe is generated from a lower semicomputable measure; This resolves the skeptical paradoxes from the introduction. Furthermore, this predictor is given by a blank-slate probability distribtion $M$ that assumes no prior knowledge of what our universe is. While the prediction error is lower for universes $μ$ that are more likely under $P_{a p}$ (which includes universes with a short description), I note that it goes to zero for long histories regardless of $μ$ .

Comparison to Solomonoff induction with K-complexity

This section compares our constructions of Solomonoff induction to the more familiar construction where the prior $P_{a p}$ is replaced by the Solomonoff prior $P_{s o l}$ . If you are not interested in this section, read on with the conclusion. This section assumes more prior knowledge than previous sections.

Assume we have a computable enumeration $ν_{1}, ν_{2}, \dots$ of LSCSMs. We can, for example, obtain this by mapping a natural number $i$ to its binary representation — a code $p$ — and then defining $ν_{i} := ν_{p}$ as above. Furthermore, let $U_{p r}$ be a separate universal prefix Turing machine, which is not directly related to the universal monotone Turing machine $U$ . Define the Kolmogorov complexity $K (i)$ of a natural number $i$ as the length of the shortest input to $U_{p r}$ which computes a binary encoding of $i$ . The Solomonoff prior is then given by

P_{s o l} (i) := 2^{- K (i)} .

One can then prove that approximately (i.e., up to at most a constant multiplicative error), we have

M (x) \approx \sum i \in N P_{s o l} (i) ν_{i} (x),

see Li and Vitányi,^[1] Theorem 4.5.3. I will prove this statement at the start of the second subsection to reveal an advantage of $P_{a p}$ .^[11]

Advantages of $P_{s o l}$ over $P_{a p}$

The Solomonoff prior has some distinct advantages that are missing in our definition:

Lower semicomputability. The weights $2^{- K (i)}$ are lower semicomputable, which our weights $P_{a p} (ν)$ are probably not. The reason is that to approximate $P_{a p} (ν)$ , we would need to be able to algorithmically determine for any binary string $p$ whether $ν_{p} = ν$ . There seems to be no method to do this. See our investigation in a footnote for more details on missing lower semicomputability in the related case of Nate's alt-complexity.^[12]

Invariance under a change of the UTM. The weights $2^{- K (i)}$ are, up to a multiplicative constant, invariant under a change of $U_{p r}$ . In contrast, $P_{a p} (ν)$ is not invariant, not even up to a multiplicative constant, under a change of the universal monotone machine $U$ . I learned the following argument for this from Cole Wyeth. Namely, take your universal monotone Turing machine $U$ . Define a second machine from it, $W$ , which is given by $W ((p p)^{'} q) := U (p^{'} q) = T_{p} (q)$ , and with empty output on any other input. We can easily see that this is again a universal monotone Turing machine. With suggestive notation, we obtain:

P_{a p}^{W} (ν) = \sum r \in B^{*} : ν_{r}^{W} = ν 2^{- l (r^{'})} = \sum p \in B^{*} : ν_{p} = ν 2^{- l ((p p)^{'})} ≲ (\sum p \in B^{*} : ν_{p} = ν 2^{- l (p^{'})})^{2} = P_{a p} (ν)^{2} .

Thus, if $P_{a p} (ν)$ is very small, then $P_{a p}^{W} (ν)$ becomes extremely small, and there can be no constant factor $c$ with $P_{a p}^{W} (ν) \geq c \cdot P_{a p} (ν)$ . We note that in the approximate inequality, we use $l ((p p)^{'}) \approx 2 l (p^{'})$ , which only holds up to a logarithmic error term that probably does not change the conclusion substantially.

Besides, this also means that the a priori prior $P_{a p}$ has no simple relationship to the K-complexity-based Solomonoff prior $P_{s o l}$ that can hold independently of the choices of the involved universal Turing machines. This may be surprising given that the coding theorem shows that the a priori probability of a string is up to a constant identical to its K-complexity. This observation is related to my belief that the alt-complexity and K-complexity in Nate's post are not identical up to a constant, contrary to what several commenters claimed. I discuss this in more detail in a footnote.^[12]

Optimality of the Solomonoff prior. Assume we remain in the realm of lower semicomputable semiprobability mass functions $ω (i)$ , of which $P_{s o l} (i)$ is an example, and consider mixtures $\sum_{i} ω (i) ν_{i} .$ There is a sense in which the Solomonoff prior $P_{s o l} (i) = 2^{- K (i)}$ is optimal under all these choices for $ω$ . Namely, let $P_{1}, P_{2}, \dots$ be an enumeration of all lower semicomputable semiprobability mass functions. Then for each $ω$ under consideration, there is a $j_{*}$ with $ω = P_{j_{*}}$ . By the coding theorem (Theorem 4.3.3 in Li and Vitányi^[1]), we have

2^{- K (i)} \approx \sum j \geq 1 2^{- K (j)} P_{j} (i) \geq 2^{- K (j_{*})} P_{j_{*}} (i) \approx ω (i),

up to the multiplicative constant $2^{- K (j_{*})}$ in the last step. In other words, $P_{s o l}$ has "more weight" on all indices $i$ than $ω$ . Consequently, when adapting Theorem 5 to general lower semicomputable mixtures in place of $M$ , $P_{s o l}$ provides a better bound than $ω$ (up to an additive constant):

- log P_{s o l} (i) = K (i) \leq - log ω (i) .

There seems to not be any analogous result for our a priori prior $P_{a p} (ν)$ . Indeed, it would be surprising if such a result exists since this prior is very non-invariant under the change of the universal monotone Turing machine. On the flip side, the fact that $P_{a p}$ is likely not lower semicomputable means that $P_{s o l}$ may not beat $P_{a p}$ .

Advantages of $P_{a p}$ over $P_{s o l}$

Dependence on fewer choices. The definition of $P_{s o l}$ depends on the choice of a universal prefix machine $U_{p r}$ , on top of the choice of $U$ that we already made use of before. $P_{a p}$ has no such further dependence.

$P_{a p}$ is platonic, $P_{s o l}$ is not. Theorem 4 is very natural: The proof is essentially revealed from the definition of $P_{a p}$ itself, giving the definition intrinsic justification due to the beautiful result. In contrast, the analogous statement that

M (x) \approx \infty \sum i = 1 2^{- K (i)} ν_{i} (x) =: ξ_{U_{p r}} (x)

only very losely depends on the definitions of $M$ and the Solomonoff prior $P_{s o l} (i) = 2^{- K (i)}$ at all, meaning we could have chosen a very different prior without many negative consequences. To explain this, I'm stating the proof of the result: $M$ itself is an LSCSM, meaning that $M = ν_{i_{o}}$ for an $i_{0} \in N$ . This results in

ξ_{U_{p r}} (x) \geq 2^{- K (i_{0})} ν_{i_{0}} (x) = c \cdot M (x),

which already establishes the inequality (up to a multiplicative constant) in one direction. For the other direction, first note that $ξ_{U_{p r}}$ is lower semicomputable since $2^{- K (i)}$ is lower semicomputable and each $ν_{i}$ is lower semicomputable. Furthermore, we have

ξ_{U_{p r}} (x) = \infty \sum i = 1 2^{- K (i)} ν_{i} (x) \geq \infty \sum i = 1 2^{- K (i)} (ν_{i} (x 0) + ν_{i} (x 1)) = ξ_{U_{p r}} (x 0) + ξ_{U_{p r}} (x 1) .

This almost establishes $ξ_{U_{p r}}$ as an LSCSM, except that it doesn't satisfy our assumption that $ξ_{U_{p r}} (ϵ) = 1$ . Thus, define ${~ ξ}_{U_{p r}}$ to be identical to $ξ_{U_{p r}}$ except that ${~ ξ}_{U_{p r}} (ϵ) = 1$ . Then this is an LSCSM and we obtain

M (x) = \sum ν \in M_{s o l} P_{a p} (ν) ν (x) \geq P_{a p} ({~ ξ}_{U_{p r}}) {~ ξ}_{U_{p r}} (x) \geq P_{a p} ({~ ξ}_{U_{p r}}) ξ_{U_{p r}} (x),

which is precisely the desired inequality in the other direction. As we can see, this theorem only depended on the fact that both $M$ and $ξ_{U_{p r}}$ are (essentially) LSCSMs which are mixtures of all LSCSMs; the specifics of $M$ and $ξ_{U_{p r}}$ are irrelevant, and the same comparison would hold true for any other lower semicomputable mixture of all LSCSMs. It then appears to me that the Solomonoff prior $P_{s o l} (i) = 2^{- K (i)}$ is ultimately arbitrary, and has no a priori justification.

$P_{a p}$ leads to fewer constants. Related to the previous points, the representation of $M (x)$ as $\sum_{i} 2^{- K (i)} ν_{i} (x)$ holds only up to a multiplicative constant, whereas Theorem 4 was an exact equality. Theorem 3.8.8 in Hutter et al.^[10] claims that with a smart choice of the two universal Turing machines $U$ and $U_{p r}$ , an exact equality can be achieved, but I am unsure if such a choice can be justified "a priori" in a way that feels satisfying.

Conclusion

In this post, I have introduced Solomonoff induction without using K-complexity: A hypothesis for our universe, modeled as lower semicomputable semimeasure (LSCSM), is considered to be as likely "as it truly is" when sampling uniformly random program codes and piping them through a universal monotone Turing machine. Almost tautologically, this means that the probability of sampling a history $x$ with the universal machine equals the probability of sampling the history from an LSCSM after first sampling an LSCSM from the a priori prior (Theorem 4).

I then defined Solomonoff induction as the process of predicting sequence continuations in a universe under the assumption that they were generated from a uniformly random computer program. Applying Theorem 4 then leads to an equivalent description in terms of the a priori prior: To continue a sequence, first form the posterior over hypotheses with Bayesian reasoning, and then use the mixture over the posterior to predict the continuation. Since the prior over hypotheses has a simplicity-bias (hypotheses with a shorter encoding are more likely), this can be considered a formalization of Occam's razor.

In Theorem 5, we saw that this leads to the well-known Solomonoff bound. It shows that the cumulative error of predicting continuations in any computed universe is upper bounded by the negative logarithm of the probability of the universe under the a priori prior, i.e., by a cross-entropy loss. This resolves the historical puzzlements in the introduction and explains how it is possible to predict at all, using just the reasonable assumption that our universe is computed.

I then compared the a priori prior to the Solomonoff prior that is based on K-complexity, and found that the latter has some distinct mathematical advantages: The prior is lower semicomputable, invariant under a change of the universal prefix Turing machine it is based on, and it leads to optimal Solomonoff bounds when compared to other priors that are lower semicomputable (which, however, is not necessarily an advantage against the bound in Theorem 5 since the a priori prior is likely not lower semicomputable). This means that the Solomonoff prior is probably the better mathematical basis for developing further theory.

Yet, the a priori prior seems ultimately more natural, in that it does not depend on an additional choice of a universal prefix machine, and the mixture description of the universal distribution (Theorem 4) is tautologically true — rather than being achieved by a post-hoc analysis involving a further constant multiplicative error, as is the case for the Solomonoff prior. As mentioned to me by Cole Wyeth, perhaps it is possible to unify the two perspectives with a construction akin to a Friedberg numbering by finding a unique encoding of each LSCSM. This might then salvage the coding theorem (aka the correspondence between degeneracy and simplicity, or between alt-complexity and K-complexity) in this setting.

FAQ (Frequently anticipated questions)

Here I answer some questions that I have frequently anticipated.

Q: In our universe, we don't observe the entire history $x_{1}, \dots, x_{t}$ up until this point in time. Instead, we observe a small part of the universe for a finite amount of time. How is sequence prediction in reality then compatible with Solomonoff induction?

A: The hypothesis $ν$ in reality is not only the simulator of the universe, but instead the simulator of your precise observations in our universe. Thus, $ν$ then contains a description of the universe and a description of your location in time and space. One further complicationm is that our perceptions are "qualia", which we do not know how to relate to the formalism of symbol strings, see Wei Dai's post of open problems.

Q: Should we try to predict the real world using Solomonoff induction?

A: That's fairly intractable, so we have to use alternatives. The question is then whether it would be a useful target to try to approximate. My guess is that this depends on the timescales: The shorter the history of our observations, the more we should trust the priors ingrained into our skulls by evolution, for they might be better than the Solomonoff prior and more adapted to our actual universe. On large time horizons, perhaps it becomes increasingly useful to actually approximate Solomonoff induction.

Q: Do neural networks do Solomonoff induction in any meaningful sense?

A: There is an interesting analogy between neural networks and Solomonoff induction: Namely, in the same way that our version of Solomonoff induction predicts sequence continuations by piping random noise through a universal Turing machine, a neural network function is at initialization sampled by aranging random parameters in an architecture. If it additionally turns out that gradient descent is akin to Bayesian updating, the analogy might become quite close. If there turns out to be a sufficiently strong connection between probability and K-complexity in the Solomonoff context (as in the coding theorem for a specific case), one might then reason that neural networks also have a bias toward learned functions with short descriptions, perhaps explaining their generalization capabilities. See also here, here, and here.

Q: Is the Solomonoff a priori distribution $M$ just a useful model, or do you actually expect our world to be sampled from it? Is it a theory of everything?

A: My intuition is that it's just a useful model. It depends on the universal monotone Turing machine $U$ , which feels unnatural. I'm also uncertain whether our universe is really discrete and computed. Perhaps we live in a continuous mathematical structure instead, and time just emerges in our experience? I have not engaged much with the precise overlap and correspondence between mathematical structures and Turing machines and would be happy for someone to say more in the comments.

Q: How is the complexity of a string under a universal monotone Turing machine related to the one under a different Turing machine model, which may be more akin to programming languages in which we typically write code?

A: There are several bounds that closely relate different codelengths to each other, typically up to at most logarithmic error terms. For an example, see Theorem 3.8.7 in Hutter et al.^[10]

^{^}
M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Cham, 4th edition, 2019. ISBN 9783030112974., DOI: 10.1007/978-3-030-11298-1.
^{^}
I found some ambiguity in the literature on whether monotone Turing machines are only partial functions, i.e., only producing an output for some of their inputs, or total functions. I settled on total functions since this makes things conceptually simpler and seems to be the most common view.
^{^}
Fomally, one can define this limit as a supremum over the prefix-relation.
^{^}
In the literature, one typically encounters the definition $ν (ϵ) \leq 1$ ; However, for our purpose, equality is more natural and will later ensure that all lower semicomputable semimeasures can be represented by a monotone Turing machine.
^{^}
Confusingly, Li and Vitányi use a definition of LSCSMs where $ν (ϵ) \leq 1$ (different from our definition where $ν (ϵ) = 1$ ), and yet claim that all of them are covered by monotone Turing machines. Perhaps they use a somewhat different definition of monotone Turing machines $T$ in which it is allowed for outputs to be entirely undefined (thus, not even the empty string), which means that $ν_{T} (ϵ) < 1$ , whereas our monotone Turing machines are total functions. This paper thinks Li and Vitányi are making a mistake and writes: "It is equivalent to Theorem 4.5.2 in [LV08] (page 301) with a small correction: $λ_{T} (ϵ) = 1$ for any $T$ by construction, but $μ (ϵ)$ may not be 1, so this case must be excluded."
^{^}
$P_{a p}$ is actually a proper probability mass function, meaning that $\sum_{ν} P_{a p} (ν) = 1$ . The reason is that each $s \in B^{\infty}$ is of the form $s = p^{'} t$ for a $p \in B^{*}$ and $t \in B^{\infty}$ , with all $s$ with the same prefix $p^{'}$ having weight $2^{- l (p^{'})}$ together. Thus, $1 = P (B^{\infty}) = \sum_{p \in B^{*}} 2^{- l (p^{'})} = \sum_{ν} P_{a p} (ν)$ .
^{^}
The reverse may not necessarily hold: A hypothesis can receive a lot of weight simply by having exponentially many (longer) descriptions, with a priori no guarantee for there to then be a short description. For gaining intuitions on this perspective, I can recommend Nate's post on alt-complexity and the last footnote.^[12]
^{^}
In other words, the theorem assumes that our universe cannot simply end at any finite time, compared to semimeasures which can always stop a sequence continuation. This seems restrictive at first, and would exclude possibilities like a big crunch in which the universe eventually ends. However, I think such scenarios can likely be rescued even in the true measure formalism: For a sequence that ends, simply continue it with $000 \dots$ to an infinite sequence. With this method, it should be possible to assign a measure to any semimeasure, which hopefully has a similar a priori probability. If this were the case, then we would still be able to predict our actual universe effectively up until the moment when it ends, when predicting it correctly does not matter anymore. I did not check these claims in detail, but perhaps Lemma 1 in this post already contains the answer.
^{^}
Note that we could have also written this error as the expected value
$S_{\infty}^{μ} = E_{s \sim μ} [\infty \sum t = 1 \sum x_{t} \in B (M (x_{t} ∣ s_{< t}) - μ (x_{t} ∣ s_{< t}))^{2}],$
which makes conceptually clearer that it is an expected cumulative prediction error.
^{^}
Marcus Hutter and David Quarel and Elliot Catt. An Introduction to Universal Artificial Intelligence. Chapman & Hall, 2024. ISBN: 9781032607023, DOI: 10.1201/9781003460299.
^{^}
The reader may wonder why we go through the hassle of using a separate universal prefix machine at all. With $p = bin (i)$ , another option would be to define $P (i) := 2^{- l (p^{'})}$ as the weight for $ν_{i} = ν_{p}$ . The issue with this definition is, perhaps paradoxically, that $P$ is then computable, which by Lemma 4.3.1 in Li and Vitányi^[1] leads to $P$ not being universal, i.e., not dominating all other computable (let alone semicomputable) semimeasures. This, in turn, means that this prior will not lead to the same optimal bounds that we will explain the prior $P_{s o l} (i) = 2^{- K (i)}$ to have.
^{^}
Nate assumes a universal programming language $U$ that can receive input codes $p$ that then result through $U$ to output functions $f$ . An example for $p$ is code for "quicksort" or "bubblesort", which both lead to the same function $f$ that receives a list as input and outputs its sorted version. He then defines the K-complexity
$K^{U} (f) := min {l (p) ∣ {eval}_{U} (p) = f},$
and the alt-complexity
$K_{alt}^{U} (f) := - {log}_{2} \sum p : {eval}_{U} (p) = f 2^{- l (p)} =: - {log}_{2} P_{U} (f) .$
Many commenters claimed that these differ only up to an additive constant, but I believe this to be wrong.
At first sight it seems true, since it seems identical to the coding theorem, which is about the equality of $K (x)$ and $- {log}_{2} P (x)$ defined for a universal prefix machine, with $x$ being a binary string. The proof of the coding theorem crucially uses the lower semicomputability of $P (x)$ , which is achieved by dovetailing through all $p$ and checking whether the universal prefix machine sends them to $x$ , and increasing the estimate of $P (x)$ by $2^{- l (p)}$ when this happens. These estimates are crucial to be able to find codewords for $x$ that get progressively shorter and approximate $- {log}_{2} P (x)$ in length, which then establishes the approximate equality to $K (x)$ .
The same proof strategy does not work for Nate's alt-complexity $K_{alt}^{U}$ since there can, to my knowledge, not exist an algorithm that checks for any program $p$ and function $f$ whether ${eval}_{U} (p) = f$ ; after all, both sides of this equation are themselves functions that can in principle receive infinitely many inputs over which we can't all iterate (See Rice's theorem). Thus, the lower semicomputability of $P_{U}$ seems to fail, and thus I don't know of a method to construct codewords that approximate the length $- {log}_{2} P_{U} (f)$ eventually. The coding theorem thus likely fails in this context, unless there is an entirely different proof strategy that succeeds.

LESSWRONG
LW

LESSWRONG
LW

22

A Technical Introduction to Solomonoff Induction without K-Complexity

22

22

Introduction

On being generated by a random computer program

Monotone Turing machines

Universal monotone Turing machines

The universal distribution

$M$ is a universal mixture distribution

Lower semicomputable semimeasures

The mixture equivalence

Solomonoff induction

Universal sequence prediction

The Solomonoff bound

Comparison to Solomonoff induction with K-complexity

Advantages of $P_{s o l}$ over $P_{a p}$

Advantages of $P_{a p}$ over $P_{s o l}$

Conclusion

FAQ (Frequently anticipated questions)

22

A Technical Introduction to Solomonoff Induction without K-Complexity

22

22

Introduction

On being generated by a random computer program

Monotone Turing machines

Universal monotone Turing machines

The universal distribution

M is a universal mixture distribution

Lower semicomputable semimeasures

The mixture equivalence

Solomonoff induction

Universal sequence prediction

The Solomonoff bound

Comparison to Solomonoff induction with K-complexity

Advantages of Psol over Pap

Advantages of Pap over Psol

Conclusion

FAQ (Frequently anticipated questions)

$M$ is a universal mixture distribution

Advantages of $P_{s o l}$ over $P_{a p}$

Advantages of $P_{a p}$ over $P_{s o l}$