How Bayes' theorem is consistent with Solomonoff induction

Alex_Altair

You've read the introduction to Bayes' theorem. You've read the introduction to Solomonoff induction. Both describe fundamental theories of epistemic rationality. But how do they fit together?

It turns out that it’s pretty simple. Let’s take a look at Bayes’ theorem.

$P(H|E)=\frac{P(E|H)P(H)}{\sum_{i}P(E|H_i)P(H_i)}$

For a review:

H is the particular hypothesis in question.
E is the evidence, or data that we have observed.
P(H) is the “prior” probability of the hypothesis alone.
P(E|H) is the probability of seeing the evidence given that the hypothesis is true.
P(H|E) is the probability that the hypothesis is true, given that you’ve seen the evidence.
H_i is an arbitrary element in the set of all hypotheses.

In terms of Solomonoff induction:

H is the set of all binary sequences which we input to the universal Turing machine.
E is the binary sequence of data that we are given to match.
P(H) is $2^{-l(H)}$ where l(H) is the length of the binary sequence H. This is a basic premise of Solomonoff induction.
P(E|H) is the probability that we will see data sequence E, given that we run program H on the universal Turing machine. Because this is deterministic it is either 1, if H outputs E, or 0, if H does not output E.
P(H|E) is still what we’re looking for. Of course, if H does not output E, then P(E|H) = 0, which means P(H|E) = 0. This makes sense; if a program does not output the data you have, it cannot be the true program which output the data you have.
H_i is an arbitrary binary sequence.

The denominator is the same meaning as the numerator, except as a sum for every possible hypothesis. This essentially normalizes the probability in the numerators. Any hypotheses that do not match the data E exactly will cause P(E|H_i) = 0, and therefore that term will contribute nothing to the sum. If the hypothesis does output E exactly, then P(E|H_i) = 1, and the matching hypothesis contributes its weight to the renormalizing sum in the denominator.

Let's see an example with these things substituted. Here, the set of H_i is the set of hypotheses that match.

$P(H|E)=\frac{2^{-l(H)}}{\sum_{i}2^{-l(H_i)}}$

In summary; Bayes’ theorem says that once we find all matching hypotheses, we can find their individual probability by dividing their individual weight of $2^{-l(H)}$ by the weights of all the matching hypotheses.

This is intuitive, and matches Bayes’ theorem both mathematically and philosophically. Updating will occur when you get more bits of evidence E. This will eliminate some of the hypotheses H_i, which will cause the renormalization in the denominator to get smaller.

Huh? All of that applies to any choice of priors whatsoever, not just Solomonoff's. Or am I missing something?

I'm saying that Solomonoff induction doesn't contradict Bayes' theorem. The purpose of Solomonoff induction was to find an objective prior, but then after they discovered it, it included a way of updating too. Bayes' theorem turned out to be redundant. But since we're pretty sure Bayes' theorem is correct, it's nice to see that they don't contradict.

Solomonoff induction as opposed to what? Is there any choice of priors which does contradict Bayes' theorem?

Solomonoff induction is more than a choice of priors. It's also a method of finding all possible hypotheses, and a method of computing likelihoods. It's an entire system of reasoning.

Worth also noting possible misunderstanding from 0 and 1 are not probabilities .

I guess I made conversational assumption that when Bayes name is used rather than 'Aristotelian logic', it speaks of non-binary probabilities rather than the limit in which Bayes does not contradict Aristotelian logic of the form 'if hypothesis does not match data exactly, hypothesis is wrong'.

A clarification if I might:

"is the probability that we will see data sequence E, given that we run program H on the universal Turing machine."

I think it'll be helpful to word it as "output begins with the data sequence E", as it is generally a very common misconception that it suffices to see E somewhere within the output; that it suffices that the H "explains" the data (the original article used "explains").

When thinking of e.g. the universe, the "explains" is typically taken to mean "the universe contains me somewhere" and a form of anthropic reasoning, which can lead to substantially different concept than Solomonoff induction.

As a side note, one can obtain a type of anthropic reasoning prior by including some self-description on extra tape that can be read; then the code can search for instances of itself within the models for only a constant cost, but still needs to be predictive, i.e. output string that begins with the observed data. This seems no different (up to a constant) from simply including the self description as part of the data sequence E . edit: on second thought, extra tape is different in major fallible way: the self description on extra tape, if sufficiently complete, can allow to construct the god in your own image for 'goddidit' . One should just add self description as part of the data sequence E . It is still no-different-up-to-a-constant though.

This should also be in main.