Related to: Occam's Razor

If the Razor is defined as, “On average, a simpler hypothesis should be assigned a higher prior probability than a more complex hypothesis,” or stated in another way, "As the complexity of your hypotheses goes to infinity, their probability goes to zero," then it can be proven from a few assumptions.

1)      The hypotheses are described by a language that has a finite number of different words, and each hypothesis is expressed by a finite number of these words. That this allows for natural languages such as English, but also for computer programming languages and so on. The proof in this post is valid for all cases.

2)      A complexity measure is assigned to hypotheses in such a way that there are or may be some hypotheses which are as simple as possible, and these are assigned the complexity measure of 1, while hypotheses considered to be more complex are assigned higher integer values such as 2, 3, 4, and so on. Note that apart from this, we can define the complexity measure in any way we like, for example as the number of words used by the hypothesis, or in another way, as the shortest program which can output the hypothesis in a given programming language (e.g. the language of the hypotheses might be English but their simplicity measured according to a programming language; Eliezer Yudkowsky follows this way in the linked article.) Many other definitions would be possible. The proof is valid for all definitions that follow the conditions laid out.

3)      The complexity measure should also be defined in such a way that there are a finite number of hypotheses given the measure of 1, a finite number given the measure of 2, a finite number given the measure of 3, and so on. Note that this condition is not difficult to satisfy; it would be satisfied by either of the definitions mentioned in condition 2, and in fact by any reasonable definition of simplicity and complexity. The proof would not be valid without this condition precisely because if simplicity were understood in such a way as to allow for an infinite number of hypotheses with minimum simplicity, the Razor would not be valid for that understanding of simplicity.

The Razor follows of necessity from these three conditions. To explain any data, there will be in general infinitely many mutually exclusive hypotheses which could fit the data. Suppose we assign prior probabilities for all of these hypotheses. Given condition 3, it will be possible to find the average probability for hypotheses of complexity 1 (call it x1), the average probability for hypotheses of complexity 2 (call it x2), the average probability for hypotheses of complexity 3 (call it x3), and so on. Now consider the infinite sum “x1 + x2 + x3” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.

Now, x1 is a finite real number. So in order for this series to converge, there must be only a finite number of later terms in the series equal to or greater than x1. There will therefore be some complexity value, y1, such that all hypotheses with a complexity value greater than y1 have an average probability of less than x1. Likewise for x2: there will be some complexity value y2 such that all hypotheses with a complexity value greater than y2 have an average probability of less than x2. Leaving the derivation for the reader, it would also follow that there is some complexity value z1 such that all hypotheses with a complexity value greater than z1 have a lower probability than any hypothesis with a complexity value of 1, some other complexity value z2 such that all hypotheses with a complexity value greater than z2 have a lower probability than any hypothesis of complexity value 2, and so on.

From this it is clear that on average, or as the complexity tends to infinity, hypotheses with a greater complexity value have a lower prior probability, which was our definition of the Razor.

N.B. I have edited the beginning and end of the post to clarify the meaning of the theorem, according to some of the comments. However, I didn't remove anything because it would make the comments difficult to understand for later readers.

New Comment
139 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Another way of putting this is that all probability distributions over an infinite enumerated set are biased towards elements that occur earlier, regardless of the principle of enumeration.

Certainly not regardless of the probability distribution though. Do you have a particular probability distribution in mind?
Regardless of the probability distribution. If one has any assignment of probabilities to an infinite series of mutually exclusive hypotheses H1, H2, ..., then for every epsilon > 0 there is an N such that every hypothesis after the Nth has probability less than epsilon. In fact, there is an N such that the sum of the probabilities of all the hypotheses after the Nth is less than epsilon.
But N could be 3^^^3. Which does an injury to the term early in my book. E.g. I could swap the probabilities of p (x3^^^3) and p(x1).
Indeed you could, but that problem is already present in the definition of Kolmogorov complexity. It's only defined up to an arbitrary additive constant determined by (in one formulation) the choice of a universal Turing machine. The Kolmogorov complexity of a string is the size of the shortest input for that UTM that produces that string as output, but there's nothing in the definition to prevent the UTM from having any finite set of arbitrary strings on speed-dial.
Kelly deals with this by looking at complexity from other angles. For example, a complex world can give you a long sequence of observations persuading you that it's a simple world and then suddenly "change its mind", but a simple world cannot pretend that it's complex.
Why not? It would look almost exactly like the complex worlds imitating it, wouldn't it?
Hmm... maybe I was reading your claim as stronger than you intended. I was imagining you were claiming that property would hold for any finite enumerated subset, which clearly isn't what you meant.
If the sum of every term in a sequence after the Nth one is less than epsilon, then the sum of every term in any subsequence after the Nth one is also less than epsilon.
Right, but that isn't what I meant - it is not necessarily the case that for every n, every hypothesis after the nth has probability less than that of the the nth hypothesis. Obviously - which is why I should have noticed my confusion and not misread in the first place.
Yes, regardless of the distribution.
And yet another way is this: every computable enumeration of all programs is equivalent to a UTM, and therefore generates the same measure of Kolmogorov complexity, up to the usual additive constant. Proof: Computability of the enumeration means that there is a program which takes any integer N and returns the Nth in that enumeration. Represent this program as a Turing machine and cascade it with a second Turing machine that executes the program returned by the first TM. Is this a novel observation? Does it go any distance towards answering Shalizi's question at the end of this note?
This assumes countable additivity. Otherwise, each basic event could have probability zero, though the whole has probability 1. I'm inclined to use countable additivity, but I couldn't give you arguments for it, other than by sketching models that violate it and pointing out that they are weird.
Yes, exactly.
This is the best way of putting it.

Your proof can't be right because it doesn't use the concept of "complexity" in any non-trivial manner. If we replace "complexity" with "flubbity", so that there's only a finite number of hypotheses for any given "flubbity", your proof will still go through.

For some actual work on justifying Occam's Razor, see Kevin T. Kelly's Ockham Efficiency results. Kelly unpacks "complexity" in a nontrivial way that isn't just description length.

ETA: see this earlier discussion about learning and simplicity, it's very relevant to the topic of your interest.

6Wei Dai
Consider a creature with a prior of the type described in this post. Then there is some concept of "foobity", such that hypotheses with higher "foobity" are assigned smaller weights. The creature will find that it follows Occam's Razor, if it does not have a separate concept of "complexity" such that "complexity" != "foobity". But why would it, unless there was some reason for that to be the case? "Typically" there would be no reason for the creature to have an explicit concept of complexity that's different from the concept of complexity implicit in its prior. Assuming that's the case, its concept of complexity may still seem very strange and convoluted to us or to some other creature, but to that creature it will appear to be perfectly natural, since there's nothing else to judge it by. Could we just be such a creature? Intuitively, the answer seems to be no. Our concept of complexity seems to be natural in some absolute, objective sense, and not just relative to itself. But why is it so hard to pin that down?
Circular/anthropic arguments are seductive, but invariably turn out to be flawed because they predict less order and more narrowly-averted chaotic doom than we actually observe. Compared to our value system, which is genuinely a product of many evolutionary accidents, our concept of complexity is too simple because it can be captured (albeit imperfectly) by Turing machines. In other words, a creature using a randomly-evolved concept of "foobity" wouldn't be able to approach it with simple math, as we do. I think i'ts a mistake to reach for the anthropic hammer whenever you see a confusing problem. The most extreme example was when Nesov produced a "proof" that particle physics is a Darwinian construct, after which (thankfully) the absurdity heuristic finally kicked in.
2Wei Dai
But what do you mean by "simple" math? Simple according to what, if not "foobity"? ETA: I looked at Nesov's comment about particle physics, and didn't understand it. Can you explain?
Re ETA: Nesov said that particle physics are this way because you only care about the worlds where they are this way. Just like your explanation of probabilities. :-)
A correction (though I mixed that up in comments too): what we anticipate is not necessarily linked to what we care about. Particle physics is this way because we anticipate worlds in which it's this way, but we may well care about other worlds in which it isn't. Anticipation is about what we can control (as evolution saw the possibility, based on the past in the same world), not what we want to happen. Since evolution is causal, we don't anticipate acausal control, but we can care about acausal control. The useful conclusion seems to be that the concept of anticipation (and hence, reality/particle physics) is not fundamental in the decision theory sense, it's more like the concept of hunger: something we can feel, can have accurate theories about, but doesn't answer questions about the nature of goodness.
Don't know about you, but I anticipate acausal control, to a degree. I have a draft post titled "Taking UDT Seriously" featuring such shining examples as: if a bully attacks you, you should try to do maximum damage while disregarding any harm to yourself, because it's good for you to be predicted as such a person. UDT is seriously scary when applied to daily life, even without superintelligences.
4Wei Dai
I don't think UDT (or a variant of UDT that applies to humans that nobody has really formulated yet, because the original UDT assumed that one has access to one's source code) implies this, because the difference between P(bully predicts me as causing a lot of damaged | I try to cause max­i­mum dam­age) and P(bully predicts me as causing a lot of damaged | I don't try to cause max­i­mum dam­age) seems quite small (because the bully can't see or predict my source code and also can't do a very good job of simulating or predicting my decisions), while the negative consequences of trying to cause maximum damage seems quite high if the bully fails to be preemptively dissuaded (e.g., being arrested or sued or disciplined or retaliated against). (Not sure if you still endorse this comment, 9 years later, but I sometimes see what I consider to be overly enthusiastic applications of UDT, and as the person most associated with UDT I feel an obligation to push against that.)
Can you post this in the discussion area?
You seem to be mixing up ambient control within a single possible world with assignment of probability measure to the set of possible worlds (which anticipation is all about). You control the bully by being expected (credibly threatening) to retaliate within a single possible world. Acausal control is about controlling one possible world from another, while ambient (logical) control is about deciding the way your possible world will turn out (what you discussed in the recent posts). More generally, logical control can be used to determine an arbitrary concept, including that of utility of all possible worlds considered together, or of all mathematical structures. Acausal control is just a specific way in which logical control can happen.
Yep. I can't seem to memorize the correct use of our new terminology (acausal/ambient/logical/etc), so I just use "acausal" as an informal umbrella term for all kinds of winning behavior that don't seem to be recommended by CDT from the agent's narrow point of view. Like one-boxing in Newcomb's Problem, or being ready to fight in order to release yet-undiscovered pheromones or something.
"Correct" is too strong a descriptor, it's mostly just me pushing standardization of terminology, based on how it seems to have been used in the past.
Sorry, I misparsed your comment and gave a wrong answer, which I then deleted. Your original comment was trivially correct, and my reply missed the point. We can never justify our concept of complexity by thinking like that - linguistically - because this would be like trying to justify our prior with our prior, "a priori". If my prior is based on complexity and Bob's prior is based on foobity (religion, or whatever), we will find each other's priors weird. So if you ask whether all imaginable creatures have to use our concept of complexity, the easy answer is no. Instead we look at the outside world and note that our brand of razor seems to work. When it doesn't (religion, or whatever), we update it. Is there any other aspect to your question that I missed?
0Wei Dai
Let's call our brand of razor together with the algorithm we use to update it (using what we see from the outside world) our "meta-razor". Now is this "meta-razor" just a kind of "foobity", i.e., an arbitrary notion that we just happen to have, or is there something objective about it?
I spent some time thinking about your question and cannot give an answer until I understand better what you mean by objective vs arbitrary. The concept of complexity looks objective enough in the mathematical sense. Then, if I understand you correctly, you take a step back and say that mathematics itself (including logic, I presume?) is a random concept, so other beings could have wildly different "foomatics" that they find completely clear and intuitive. With the standards thus raised, what kind of argument could ever show you that something is "objective"? This isn't even the problem of induction, this is... I'm at a loss for words. Why do you even bother with Tegmark's multiverse then? Why not say instead that "existence" is a random insular human concept, and our crystalloid friends could have a completely different concept of "fooistence"? Where's the ground floor? Here's a question to condense the issue somewhat. What do you think about Bayesian updating? Is it "objective" enough?
1Wei Dai
Perhaps asking that question wasn't the best way to make my point. Let me try to be more explicit. Intuitively, "complexity" seems to be an absolute, objective concept. But all of the formalizations we have of it so far contain a relativized core. In Bayesian updating, it's the prior. In Kolmogorov complexity, it's the universal Turing machine. If we use "simple math", it would be the language we use to talk about math. This failure to pin down an objective notion of complexity causes me to question the intuition that complexity is objective. I'd probably change my mind if someone came up with a "reasonable" formalization that's not "relative to something."
Implementable on a machine during my lifetime. That's got to be an objective property, at least I'm wondering how you will spin it to sound subjective :-) Edit: whoops, sorry, this is wrong. Don't bother answering.
Short computer programs, compared to the ones that would encode our concepts of "beauty" or "fairness", say.
The fact that the proof uses only very weak properties of the notion of "complexity" does not show that the proof is invalid. In fact, it suggests (not with certainty) the opposite: that an even stronger result could be proven.
Thanks and apologies, I was being careless when I said the proof "couldn't be right". It may be formally right, but fail to explain why we should use Occam's Razor instead of the Flubby Razor.
Yes a statement like "on average, X holds" needs to be quantitative in order to be useful in practice. Still, this is a valuable argument, especially as far as it clarifies what needs to be done to reach a quantitative refinement.
The proof does allow for different Razors depending on your specific definition of complexity. It's true that this makes the statement fairly weak. But it is also in a way a virtue, because it corresponds with our actual use of the Razor. There are in fact different definitions of complexity and we use different ones in different contexts; we learn from experience which contexts require which Razors. For example, if you want to describe the path of a ball moving through the air, you expect to be able to describe it with some fairly simple mathematics, and you think this more likely than complicated mathematical descriptions. On the other hand, if you see a precise human-shaped footprint in the mud in your yard, saying "a human did it" is very complicated mathematically, since the mathematical description would include the description of a human being. So in this way, "Wind and weather did it" should be simpler, in the mathematical way. Nonetheless, you use a different Razor and say that with this Razor, it is simpler to say that a human being did it.
The last paragraph feels wrong to me. This is the explanation for that feeling that came to mind first, but I'm not positive it's my true objection: If your only observation were the footprint, sure--but your observations also include that you live on a planet with 7 billion people who leave footprints like that--so despite the comparitive algorithmic simplicity of the weather model, it doesn't match the observations as well.
Yes, it is likely that a stronger result is possible. However, it is difficult to make it stronger while satisfying my conditions (true in all possible worlds; true according to every logically consistent assignment of priors).
Interestingly, flubbity will correlate with complexity, regardless of how you define it. This is for pretty much the same reason as the inverse correlation of complexity and probability.
I don't understand your objection. Yes, it will apply to other attributes as well. I don't see how that prevents it from applying to complexity. If the objection is that I didn't describe complexity in detail, this doesn't matter, precisely because the proof will still go through regardless of how much or little is added.
It does prevent the proof from applying to complexity. Which hypothesis would you choose: the simpler one, or the less flubby one?
On average you should choose the simpler of two hypotheses compared according to simplicity, and again, on average you should choose the less flubby of two hypotheses compared according to flubbiness. Or to put this in more real terms: the Razor will be true if it is understood to mean that you should choose the hypothesis that can be described by the shorter program, and it will also be true that you should choose the hypothesis that can be described by a shorter English description. Yes, these can on particular occasions be opposed to one another, and you would have to ask which rule is better: choose the shorter English description, or choose the shorter program? My proof does not answer this question, but it doesn't have to, because both rules measure some kind of complexity, and the Razor is true whether it is taken in one way or the other.
Flubbity may not have much to do with complexity. In fact it can be opposed to complexity, except in the limit for extremely complex/flubby hypotheses. For example, you may say that flubbity=1000000-complexity for complexity<1000000, and flubbity=complexity elsewhere. Your proof will go through just fine, but in our world (which probably doesn't need such huge hypotheses) it will lead to the opposite of Occam's Razor. You don't always have the luxury of letting your parameter go to infinity.
By Occam's razor?
The proof follows on average. Naturally you can construct artificial examples that make fun of it, but there can be no proof of the Razor which is not based on averages, since in fact, it happens on various occasions that the more complex hypothesis is more correct than the simpler hypothesis.
I don't object to the formal correctness of the proof, but the statement it proves is way too weak. Ideally we'd want something that works for complexity but not flubbity. For any Occamian prior you care to build, I can take the first few hypotheses that comprise 99% of its weight, build a new prior that assigns them a weight of 1e-20 combined, and claim it's just as good as yours by Occamian lights. If we removed the words "on average" from the formulation of your theorem, we'd have a stronger and more useful statement. Kelly's work shows an approach to proving it not just "on average", but for all possible hypothesis lengths. ETA: I apologize for not objecting to the formal side of things. I just read the proof once again and failed to understand what it even means by "on average".
I started reading some of Kelly's work, and it isn't trying to prove that the less complex hypothesis is more likely to be true, but that by starting from it you converge on the truth more quickly. I'm sure this is right but it isn't what I was looking for.
Yes, the statement is weak. But this is partly because I wanted a proof which would be 1) valid in all possible worlds ; 2) valid according to every logically consistent assignment of priors. It may be that even with these conditions, a stronger proof is possible. But I'm skeptical that a much stronger proof is possible, because it seems to be logically consistent for someone to say that he assigns a probability of 99% to a hypothesis that has a complexity of 1,000,000, and distributes the remaining 1% among the remaining hypotheses. This is also why I said "on average." I couldn't remove the words "on average" and assert that a more complex statement is always less probable without imposing a condition on the choice of prior which does not seem to be logically necessary. The meaning of "on average" in the statement of the Razor is that in the limit, as the complexity tends to infinity, the probability necessarily tends to zero; given any probability x, say 0.000001 or whatever , there will be some complexity value z such that all statements equal or greater than that complexity value z have a probability less than x. I will read the article you linked to.
Why do you want the theorem to hold for every logically consistent prior? This looks backwards. Occamian reasoning should show why some prior distributions work better than others, not say they're all equally good. For example, the Solomonoff prior is one possible formalization of Occam's Razor.
Because for every logically consistent prior, there should be a logically possible world where that prior works well. If there isn't, and you can prove this to me, then I would exclude priors that don't work well in any possible world. I want it to apply to every possible world because if we understand the Razor in such a way that it doesn't apply in every possible world, then the fact that Razor works well is a contingent fact. If this is the case there can't be any conclusive proof of it, nor does it seem that there can be any ultimate reason why the Razor works well except "we happen to be in one of the possible worlds where it works well." Yes, there could be many interpretations which are more practical in our actual world, but I was more interested in an interpretation which is necessary in principle.
This is even more backwards. There are logically possible worlds where an overseer god punishes everyone who uses Bayesian updating. Does this mean we should stop doing science? Looking for "non-contingent" facts and "ultimate" reasons strikes me as a very unfruitful area of research.
Different people have different interests.
How do you know when that happens?
My point is that if someone has a higher greater for the more complex hypothesis which turns out to be correct, you cannot object to his prior, saying "How did you know that you should use a higher prior," since people do not justify their priors. Otherwise they wouldn't be priors.
A major use (if not the whole point,) of occam's razor is to have a rational basis for priors. If people don't have to justify their priors, then why have a process for generating them at all? If I create an encoding with 'God' as a low complexity explanation, would you say I am being rational? But the point of my question above was that you find out that the more complex hypothesis is correct when you get evidence for it. Juggling your priors is not the way to do it. (in fact it probably invites accidentally counting evidence twice.
This is spot-on. Furthermore, such a lax definition indicates that certain hypotheses will have "probability" zero. If our language is English, the explanation, "PUJF;FDAS!;FDS?" could be assigned probability zero. While this does not guarantee that the set of possible explanations is finite, neither does the author prove that the set of nonzero possibilities is infinite, and if it is not this proof is largely useless. Also, interestingly, the rules do not address the, "A wizard did it!" problem with complexity, though that is likely far beyond the attempted scope.

This really doesn't seem to have formally proven anything, and so at least the title should not be "a proof of Occam's razor". And it doesn't seem like you've done anything more interesting or useful here than, say, Solomonoff induction, which can already be seen as a formalization of Occam's razor.

Maybe if you made your proof more rigorous and explained why you think the statement you're trying to prove is relevant to anyone (or should properly be called Occam's razor) then that would help.

To expand on cousin_it's point, this is way too weak because it allows for arbitrary encodings, like a complexity value of 1 for "The lady down the street is a witch; she did it."

And if you start with that complexity prior, it's difficult to see how evidence would rescue you from it.

I don't see how you can exclude that encoding without additional stipulations that seem arbitrary to me. The fact that you assign something a complexity value of 1 doesn't mean you assign it a high probability. I allow for the possibility that some more complex hypotheses will be more probable than some less complex hypotheses. And even if the hypothesis of complexity 1 is the most probable, it may only have a probability of 75% or something. You could easily be rescued from this by ordinary Bayesian evidence. In fact this will still be true even if you assign it a probability of 99.99%. In order to put yourself beyond the possibility of being rescued by evidence, you would have to try very hard.
It's true that other hypotheses better predicting the data would still overcome it in time. But a rigged encoding would still allow for theories which add an epiphenomenal entity to count as less complex; and we definitely don't want Occam's Razor to do that. I know there's no way to guarantee that, say, the universal Turing machine complexity is "fair", but you can't dodge the necessity of putting some restrictions on the encoding.
As I said to cousin it, the proof allows for different Razors depending on the specific definition of complexity. This is actually a good thing because we can learn from experience which definition to use in various contexts. Of course experience doesn't directly say whether or not epiphenomenal entities exist, but we do discover by experience that the "number of entities" understanding of the Razor often works well, that is, that hypotheses with less entities posited are more probable than ones with more. This would lead us to adopt a Razor that does not count the addition of epiphenomena as less complex. Would you agree that even if such restrictions are desirable in practice, there is no way to add such restrictions without adding something which is not logically necessary? The original proof is intended to be logically necessary.
At which point, all the difficulty in deciding between theories is transferred to deciding between encodings. Yes we can learn over time which encodings work well in different domains, but at that point, you're really just putting chunks of theory into your encoding choice. There is a place for this kind of shorthand: I shouldn't have to model people in detail to form a hypothesis explaining the presence of a building in a feild. But this is because I know people exist from vast amounts of other data, so I have a justification for the shorthand of 'a human did it' in my encoding. It should have a high complexity cost, but it comes packaged with tons of evidence so in most cases it's practically free.

I think the treatment of complexity is OK in this argument. However, I don't think it proves Occam's Razor. Other than possible exception within set theory, the complexity of an explanation of any phenomenon is always finite. (Counterexample if you disagree?) There is no finite integer which is 'almost' infinite.

So, given xn - the average probability among explanations of complexity n - it's quite possible for a (finite) positive number of integers m > n for which xm > xn.

Yes, I agree with the possibility you mention. I don't think that means it doesn't prove Occam's Razor because I don't think the Razor means that every simpler hypothesis is more probable than every more complex hypothesis. Other things might affect your priors besides how complex the hypothesis is.
I don't think the Razor means that every simpler hypothesis is more probable than every more complex hypothesis. I don't think you've stated the theorem you are trying to prove precisely. I think that would help.
As the complexity of your hypotheses tends to infinity, their probability tends to zero. Still, that doesn't mean that each and every increase of complexity decreases the probability.
I upvoted the article, because I like this definition (i.e., it corresponds well to how I think of Occam's Razor), and I think you've proven it. However, I agree with Oscar Cunningham that the statement at the top of your article is different, and you didn't prove it. In particular, I don't like the phrase 'on average'.
I edited the article.
State that this is what you're trying to prove at the top of your main post. People are downvoting you because you haven't proved what they see as Occam's razor.
I edited the article.

It seems like what you're saying is: "If we have an infinite number of hypotheses, some of them must have infinetisimally small prior probability". That doesn't require a proof.

(I upvoted this article in spite of not learning much from it, since I love this kind of discussion. Why not read some papers about PAC learning, MDL, Algorithmic Info. Theory, or VC theory and compare/contrast for us the view of Occam implicit in each development?)

Clearly as complexity goes to infinity the probability goes to 0. (i.e. For any epsilon>0 there exists n such that all hypotheses with complexity>n have probability less than epsilon. (since otherwise there would be infinitely many hypotheses with probability >epsilon, taking 1/epsilon of them would produce a collection of hypotheses with probability>1))

So as the post says, "on average" Occam's razor must be true. (As must any other Razor using a different measure of complexity, or flubbity, or whatever. However, message length seems ... (read more)

Well taking this to be true, it seems like we could even construct an ordering where we enumerate algorithms based on something that maps to inverse complexity, so this would prove the opposite of Occam's razor. But I can't think of any such ordering, and I've got a proof kicking around in my head that there can be no such ordering, so this hunch is probably wrong.
For any given inverse complexity, infinitely many algorithms have a lower inverse complexity. This seems to break any attempt to enumerate in that order, or assign decreasing nonzero probabilities that sum to 1.
Indeed there is no such ordering.

The part about there being sets of infinitely many mutually exclusive hypotheses needs to be marked as an assumption, since I can construct languages where it does not hold. Also, you need to strengthen this assumption by saying that every hypothesis is part of at least one such set.

You also need to somehow get from "the probability of a statement decreases monotonically with its complexity" to "the probability of a statement decreases exponentially with its complexity".

Maybe this has already been considered but has anyone noted that even without introducing complexity at all, you can at least say that some programs are more likely simply because there is no uniform distribution on a countably infinite set, no matter which probability measure or which way you label the set there must be at most a finite number of programs with the same probability.

Hopefully there's someone still here after 10 years XD.

Oh look, if we definitely the complexity as "the date when hypothesis was published", then I can say that the prior probability that our earth stands on top of a whale, on top of a turtle on top of an elephant is the highest, because this hypothesis is the oldest. And the Occam's razor becomes "don't propose new hypotheses". Trinitrotrololol)

I find it funny, that it works even in continuous case: suppose that we have probability density defined in R^n (or any other set). Then whatever bijection F:R <--> R^n we apply, the integral of probability density on that path should converge, therefore p(F(x)) goes to zero faster than 1/x. :)

Also, look: suppose the "real" universe is a random point x from some infinite set X. Let's say we are considering finite set of hypotheses "H". Probability that random hypothesis h € H is closest to x is 1/|H|. So the larger H is, the less likely it is that a

... (read more)

I don't think this proves anything at a practical level, given that it requires us to deal with hypotheses of arbitrarily high complexity. It breaks down when dealing with finite values.

Suppose there is a finite complexity value W, beyond which no more complex hypotheses can actually be imagined by any being in our universe. Physics implies that there is such a W. Nothing in your argument implies that z1<W, though.

As I said to cousin it, I wanted a proof which would be valid in all possible worlds and valid according to every logically consistent assignment of priors, and for this reason, yes, in some possible worlds things would be as you say.

Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.

x1, x2,... (read more)

There does not appear to be any a priori ordering of probability by complexity value, and unless you assume this the conclusion is improperly drawn. If you assume it, the argument is circular.

If the issue I'm missing is that the conclusion should read:

hypotheses with a greater complexity value tend to have a lower prior probability

Then that would explain my confusion. It does make the conclusion pretty weak, as it's unclear if infinite possibilities even exist. If they don't, the proof collapses.

I'd written out more reasoning, but I think this works ... (read more)


Hmmm... I'd say that the problem is formulated in a way that allows for some maths, but in the process loses some of its essence.

For example, the 'proof' only talks about the length of the hypothesis, because that can be easily quantified. However, that does not rule out very short pseudo-explanations like 'god did it' or 'it's magic'. To rule out those, in practice one would need to introduce some non-mathematical language about what language is acceptable -- but that would make it very hard to reason about it mathematically.

The usual solution to this problem is actually just the opposite: introduce an even more mathematical language, which avoids bringing in pre-cached complex-but-simple-seeming notions like "god" and "magic" so that the message length is more meaningful. Turing machine instructions are a popular choice. More on this here.

As some commentators didn't like or couldn't understand the original phrasing of the theorem, I have added a phrase interpreting this. I didn't remove the phrase because it would make the previous comments difficult to understand.

Occams's Razor is based on empirical observations.

You can't prove it - indeed, you can easily imagine worlds where it is not true.


Define the "simplicity" of a hypothesis as 1/"complexity". If you can construct a probability distribution based on the "simplicity" of a hypothesis, then, by the same argument as above, you can prove that you should prefer the hypothesis with less "simplicity".

There are a finite number of hypotheses less complicated than any given hypothesis, and an infinite number of hypotheses more complicated than any given hypothesis. If there were only a finite number of hypotheses more complicated than any given hypothesis, the theorem would fail.