Stephen_Jordan

William,

By considering models in the first place, one is already using Occam's razor. With no preference for simplicity in the priors at all, one would start with uniform priors for all possible data sequences, not finite-parameter models of data sequences. If you formalize models as being programs for Turing machines which have a separate tape for inputting the program, and your prior is a uniform distribution over possible inputs on that tape, you exactly recover the 2^-k Occam's razor law, where k is the number of program bits that the Turing machine reads during its execution (i.e. the Kolmogorov complexity).

Interestingly, the degree to which one can outperform this distribution is provably bounded. Suppose you are considering some distribution of priors and you want to see how much it outperforms the 2^-k distribution. It can do so if its prior for the true model is higher than 2^-k. So you compute the ratio of this prior to 2^-k for every model. If I remember correctly, there is a theorem which says that this set of ratios will have a finite supremum for any computable prior distribution (which is not obvious since the set of models is infinite). Of course, in some respects, this is an unfair comparison since the 2^-k distribution is itself uncomputable. Hopefully I am not misstating the theorem. I think it is stated and proved in the book by Li and Vitanyi.

To see why ignoring Occam's razor is unthinkable (and to be amused), consider the following joke.

An astronaut visits a planet with intelligent aliens. These aliens believe in the reverse induction principle. That is, whatever has happened in the past is unlikely to be what will happen in the future. That the sun has risen every previous day is to them not evidence that it will rise tomorrow, but the contrary. Unsurprisingly, this causes all sorts of problems for the aliens, and they are starving and miserable. The astronaut asks them, "Why are you clinging to this belief, given that it obviously causes you so much suffering?" The aliens respond, "Well...it's never worked well for us in the past!"

Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.

In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.

It seems to me that the cancellation is an artifact of the particular example, and that it would be easy to come up with an example in which the cancellation does not occur. For example, maybe you have previous experience with the mugger. He has mugged you before about minor things and sometimes you have paid him and sometimes not. In all cases he has been true to his word. This would seem to tip the probabilities at least slightly in favor of him being truthful about his current much larger threat.