Open Thread: September 2009

AllanCrossman

I declare this Open Thread open for discussion of Less Wrong topics that have not appeared in recent posts.

Just found this note in Shalizi's notebooks which casts an interesting shadow on the Solomonoff prior:

The technical results say that a classification rule is simple if it has a short description, measured in bits. (That is, we are in minimum description length land, or very close to it.) The shorter the description, the tighter the bound on the generalization error. I am happy to agree that this is a reasonable (if language-dependent) way of defining "simplicity" for classifier rules. However, so far as I can tell, this really isn't what makes the proofs work.

What actually makes the proofs go is the fact that there are not very many binary strings of length m for small m. Finding a classifier with a short description means that you have searched over a small space. It's the restriction of the search space which really does all the work. It's not the simplicity of the rules which matters, but the fact that simple rules are scarce in the space of all possible rules. If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.

For context, see the Wikipedia page on PAC learning or this lecture by Scott Aaronson.

Looks like I almost missed a very interesting discussion. Hope I'm not too late in joining it.

As far as I can tell, you still need the Solomonoff prior for decision making. Kevin T. Kelly's Ockham Efficiency Theorem says that by using Ockham's Razor you minimize reversals of opinion prior to finding the true theory, but that seems irrelevant when you have to bet on something, especially since even after you've found the truth using Ockham's Razor, you don't know that you've found it.

Also, I think there's a flaw in Shalizi's argument:

Suppose one of these

... (read more)

4Johnicholas17y

Here's my understanding of the dialog, which (as I read it) is not particularly critical of the Solomonoff prior, if that is what cousin_it meant by "casts a shadow". (Background knowledge) Shalizi understands "Occam's Razor" to be something like "In order to reach the truth, among the theories compatible with the evidence, chose the simplest". There is a claim that he wishes to refute. The claim is that a certain result is an explanation or proof of Occam's Razor. The result says that if one finds a simple classification rule which works well in-sample, then it is highly probable that it will continue to work well out-of-sample. This is a failure of relevance. Occam's Razor, as Shalizi understands it, is a way of obtaining TRUTH, but the proof only concludes something about GENERALIZATION PERFORMANCE. To illustrate the difference, he points to an example where, in order to increase generalization performance, one might decrease truth. Shalizi contrasts the algorithmic information theory proof with Kevin T. Kelly's Ockham Efficiency Theorem, which seems to Shalizi more productive. In particular, Kevin T. Kelly's formalization does talk about truth rather than generalization performance. Finally, Shalizi provides an alternative ending to the algorithmic information theory proof. If instead of choosing the simplest classification rule, one chose the simplest rule within a sparse random subset of rules (even a non-computable random subset), then you could still conclude a bound on generalization performance. By providing an alternative ending, he has constructed an alternative proof. Presumably this alternative proof does NOT seem like it is a formalization of Occam's Razor. Therefore, the interpretation of the original proof as demonstrating some version of Occam's Razor must also be mistaken. To sum up, Shalizi is arguing that a certain rhetorical/motivational/interpretational notion which often occurs near a specific proof is wrong. I don't think he's concludi

1DanB17y

The point of Kolmogorov complexity is that there is some limit to how baroque your rules can ever become, or how baroque you can make a fundamentally simple rule by choosing a really weird prior. In algorithmic information theory, the problem of choosing a prior is equivalent to choosing a particular universal Turing machine. If you pick a really weird Turing machine, you will end up assigning low prior probability to things that "normal" people would consider simple - like a sine wave, for example. But because of universal computation, there's a limit to how low a probability you can assign. Your weird machine is still universal, so somehow or other it's got to be able to produce a sine wave, after whatever translation-prefix machinations you have to perform to get it to act like a normal programming language. Another way of viewing this is just to eschew the notion of generalization and state that the goal of learning is compression. If you do this, you end up doing all the same kinds of work, with many fewer philosophical headaches. Now the great deep problem of picking a prior boils down to the rather more quotidian one of picking a data format to use to transmit/encode data.