All of Stephen_Jordan's Comments + Replies

It seems to me that the cancellation is an artifact of the particular example, and that it would be easy to come up with an example in which the cancellation does not occur. For example, maybe you have previous experience with the mugger. He has mugged you before about minor things and sometimes you have paid him and sometimes not. In all cases he has been true to his word. This would seem to tip the probabilities at least slightly in favor of him being truthful about his current much larger threat.

Even in that case I would assign enormously higher probability to the hypothesis that my deadbeat pal has caught some sort of brain disease that results in compulsive lying, than that such a person has somehow acquired reality-breaking powers but still has nothing better to do than hit me up for spare change.


By considering models in the first place, one is already using Occam's razor. With no preference for simplicity in the priors at all, one would start with uniform priors for all possible data sequences, not finite-parameter models of data sequences. If you formalize models as being programs for Turing machines which have a separate tape for inputting the program, and your prior is a uniform distribution over possible inputs on that tape, you exactly recover the 2^-k Occam's razor law, where k is the number of program bits that the Turing machine re... (read more)

Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.

In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original... (read more)