Occam's Razor

by Eliezer Yudkowsky4 min read26th Sep 200756 comments

64

Occam's RazorPrinciples
Frontpage

The more complex an explanation is, the more evidence you need just to find it in belief-space. (In Traditional Rationality this is often phrased misleadingly, as “The more complex a proposition is, the more evidence is required to argue for it.”) How can we measure the complexity of an explanation? How can we determine how much evidence is required?

Occam’s Razor is often phrased as “The simplest explanation that fits the facts.” Robert Heinlein replied that the simplest explanation is “The lady down the street is a witch; she did it.”

One observes that the length of an English sentence is not a good way to measure “complexity.” And “fitting” the facts by merely failing to prohibit them is insufficient.

Why, exactly, is the length of an English sentence a poor measure of complexity? Because when you speak a sentence aloud, you are using labels for concepts that the listener shares—the receiver has already stored the complexity in them. Suppose we abbreviated Heinlein’s whole sentence as “Tldtsiawsdi!” so that the entire explanation can be conveyed in one word; better yet, we’ll give it a short arbitrary label like “Fnord!” Does this reduce the complexity? No, because you have to tell the listener in advance that “Tldtsiawsdi!” stands for “The lady down the street is a witch; she did it.” “Witch,” itself, is a label for some extraordinary assertions—just because we all know what it means doesn’t mean the concept is simple.

An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, “Maybe a really powerful agent was angry and threw a lightning bolt.” The human brain is the most complex artifact in the known universe. If anger seems simple, it’s because we don’t see all the neural circuitry that’s implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don’t feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.

To a human, Maxwell’s equations take much longer to explain than Thor. Humans don’t have a built-in vocabulary for calculus the way we have a built-in vocabulary for anger. You’ve got to explain your language, and the language behind the language, and the very concept of mathematics, before you can start on electricity.

And yet it seems that there should be some sense in which Maxwell’s equations are simpler than a human brain, or Thor the thunder-agent.

There is. It’s enormously easier (as it turns out) to write a computer program that simulates Maxwell’s equations, compared to a computer program that simulates an intelligent emotional mind like Thor.

The formalism of Solomonoff induction measures the “complexity of a description” by the length of the shortest computer program which produces that description as an output. To talk about the “shortest computer program” that does something, you need to specify a space of computer programs, which requires a language and interpreter. Solomonoff induction uses Turing machines, or rather, bitstrings that specify Turing machines. What if you don’t like Turing machines? Then there’s only a constant complexity penalty to design your own universal Turing machine that interprets whatever code you give it in whatever programming language you like. Different inductive formalisms are penalized by a worst-case constant factor relative to each other, corresponding to the size of a universal interpreter for that formalism.

In the better (in my humble opinion) versions of Solomonoff induction, the computer program does not produce a deterministic prediction, but assigns probabilities to strings. For example, we could write a program to explain a fair coin by writing a program that assigns equal probabilities to all 2N strings of length N. This is Solomonoff induction’s approach to fitting the observed data. The higher the probability a program assigns to the observed data, the better that program fits the data. And probabilities must sum to 1, so for a program to better “fit” one possibility, it must steal probability mass from some other possibility which will then “fit” much more poorly. There is no superfair coin that assigns 100% probability to heads and 100% probability to tails.

How do we trade off the fit to the data, against the complexity of the program? If you ignore complexity penalties, and think only about fit, then you will always prefer programs that claim to deterministically predict the data, assign it 100% probability. If the coin shows htthht, then the program that claims that the coin was fixed to show htthht fits the observed data 64 times better than the program which claims the coin is fair. Conversely, if you ignore fit, and consider only complexity, then the “fair coin” hypothesis will always seem simpler than any other hypothesis. Even if the coin turns up hthhthhhthhhhthhhhht  . . .

Indeed, the fair coin is simpler and it fits this data exactly as well as it fits any other string of 20 coinflips—no more, no less—but we see another hypothesis, seeming not too complicated, that fits the data much better.

If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should cost at least a “factor of two gain” in the fit. If you try to design a computer program that explicitly stores an outcome like htthht, the six bits that you lose in complexity must destroy all plausibility gained by a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.

Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.

The way Solomonoff induction works to predict sequences is that you sum up over all allowed computer programs—if every program is allowed, Solomonoff induction becomes uncomputable—with each program having a prior probability of 1/2 to the power of its code length in bits, and each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits.

The Minimum Message Length formalism is nearly equivalent to Solomonoff induction. You send a string describing a code, and then you send a string describing the data in that code. Whichever explanation leads to the shortest total message is the best. If you think of the set of allowable codes as a space of computer programs, and the code description language as a universal machine, then Minimum Message Length is nearly equivalent to Solomonoff induction.1

This lets us see clearly the problem with using “The lady down the street is a witch; she did it” to explain the pattern in the sequence 0101010101. If you’re sending a message to a friend, trying to describe the sequence you observed, you would have to say: “The lady down the street is a witch; she made the sequence come out 0101010101.” Your accusation of witchcraft wouldn’t let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused.

Witchcraft may fit our observations in the sense of qualitatively permitting them; but this is because witchcraft permits everything , like saying “Phlogiston!” So, even after you say “witch,” you still have to describe all the observed data in full detail. You have not compressed the total length of the message describing your observations by transmitting the message about witchcraft; you have simply added a useless prologue, increasing the total length.

The real sneakiness was concealed in the word “it” of “A witch did it.” A witch did what?

Of course, thanks to hindsight bias and anchoring and fake explanations and fake causality and positive bias and motivated cognition, it may seem all too obvious that if a woman is a witch, of course she would make the coin come up 0101010101. But I’ll get to that soon enough. . .

1 Nearly, because it chooses the shortest program, rather than summing up over all programs.

64

55 comments, sorted by Highlighting new comments since Today at 11:35 PM
New Comment

The Vapnik Chernovenkis Dimension also offers a way of filling in the detail of the the concept of "simple" appropriate to Occam's Razor. I've read about it in the context of statistical learning theory, specifically "probably approximately correct learning".

Having successfully tuned the parameters of your model to fit the data, how likely is it to fit new data, that is, how well does it generalise. The VC dimension comes with formulae that tell you. I've not been able to follow the field, but I suspect that VC dimension leads to worst case estimates whose usefulness is harmed by their pessimism.

"Your accusation of witchcraft wouldn't let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused."

My model of witches, if I had one, would produce a given simple sequence like 01010101 with greater probability than a given random sequence like 00011011. Wouldn't yours? I might agree if you said "in nearly full detail".

Steven, that means you have to transmit the accusation of witchcraft, followed by a computer program, followed by the coded data. Why not just transmit the computer program followed by the coded data? I don't expect my own environment to be random noise, but that has nothing to do with witchcraft...

Alan, I agree that VC dimension is an important conceptually different way of thinking about "complexity". One of its primary selling points is that, for example, it doesn't attach infinite complexity to a model class that contains one real-valued parameter, if that model class isn't very flexible (i.e., it says only "the data points are greater than R"). But VC complexity doesn't plug into standard probability theory as easily as Solomonoff induction.

In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.

Eliezer,

"I don't expect my own environment to be random noise, but that has nothing to do with witchcraft..."

I think I misinterpreted the math and now see what you're getting at. Would it be an accurate translation to human language to say, "a sequence like 10101010 may favor witchcraft over the hypothesis that nothing weird is going on (i.e. the coinflips are random), but it will never favor witchcraft over the simpler hypothesis that something weird is going on that isn't witchcraft"?

I find it awkward to think of "witchcraft" as just a content-free word; what "witchcraft" means to me is something like the possibility that reality includes human-mind-like things with personalities and with preferences that they achieve through unknown nonstandard causal means. If you coded that up, it would probably no longer be content-free; it would allow shortening the rest of the program generating the sequences in some cases and require lengthening it in some other cases. In all realistic cases the resulting program would still be longer than necessary.

Good comments, all!

Steven, yes. Stephen, also yes.

Eli, you said:

An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, "Maybe a really powerful agent was angry and threw a lightning bolt." The human brain is the most complex artifact in the known universe. If anger seems simple, it's because we don't see all the neural circuitry that's implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don't feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.

I think it's worth noting that Norse tribesfolk already knew about human beings, so whatever model of the universe they made had to include angry agents in it somewhere.

I agree. I feel like the post is poking a bit of fun at hokey religion, and in so doing falls into an error. The Norse would do quite badly in life if they switched to a prior based on description lengths in Turing machines rather than a description length in their own language, because their language embodies useful bias concerning their environment. Similarly, English description lengths contain useful bias for our environment. The formalism of Solomonoff induction does not tell us which universal language to use, and English is a fine choice. The "thunder god" theory is not bad because of Occam's razor, but because it doesn't hold up when we investigate empirically! Similarly, if the Norse believed that earthquakes were caused by giant animals moving under the earth, it would not be such a bad theory given what evidence they had (even though animals are complex from a Turing-machine perspective); animals caused many things in their environment. We just know it is wrong today, based on what we know now.

What you are talking about in terms of Solmonoff induction is usually called algorithmic information theory and the shortest-program-to-produce-a-bit-string is usually called Kolmogorov-Chaitin information. I am sure you know this. Which begs the question, why didn't you mention this? I agree, it is the neatest way to think about Occam's razor. I am not sure why some are raising PAC theory and VC-dimension. I don't quite see how they illuminate Occam. Minimalist inductive learning is hardly the simplest "explanation" in the Occam sense, and is actually closer to Shannon entropy in spirit, in being more of a raw measure. Gregory Chaitin's 'Meta Math: The Search for Omega', which I did a review summary of is a pretty neat look at this stuff.

Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.

Occam’s razor is not conclusive and it’s not science. It is not unscientific but I would say that it fits into the category of philosophy. In science you do not get two theories, take the facts you know, and then conclude based on the simplest theory. If you’re doing this, you need to do better experiments to determine the facts. Occam’s razor can be a useful heuristic to suggest what experiments should be done. Just like mathematical elegance, Occam’s razor suggests that something is on the right track but it is not decisive. To look back at the facts and then interpret it through Occam’s razor is just an exercise in hindsight bias.

Your analogy with Norse tribesfolk reminds me of the NRA slogan, “Guns don’t kill people, people kill people”. There are many different levels of causation. The gun can be said to be the secondary cause of why someone died. The person pulling the trigger would be the primary cause. The secondary cause of thunder is nature but the first cause that brought things into existence and created the system is God. Nature cannot be its own cause.

The rest of what you wrote sounds like you're pulling numbers out of your arse. The last sentence should be read in your best Norse tribesfolk accent.

Science is just a method of filtering hypothesis. Which is exactly what Occam's razor is. Occam's razor is not a philosophy, it is a statistical prediction. To claim that Occam's razor is not a science would be to claim that statistics is not a science.

Example: You leave a bowl with milk in it over night, you wake up in the morning and its gone. Two possibly theories, are one, your cat drank it, or two, someone broke into your house, and drank it, then left.

Well, we know that cats like milk, and you have a cat, so you know the probability of there being a cat is 1:1, and you also know your cat likes to steal food when your sleeping, so based on past experience you might say the probability of the cat stealing the milk is 1:2, so you know theres two high probabilities. But when we consider the burglar hypothesis, we know that its extremely rare for someone to break into our house, thus the probability for that situation, while being physically possible, is very low say 1 in 10,000. We know that burglars tend to break into houses to steal expensive things, not milk from a bowl, thus the probability of that happening is say 1 in a million.

This is Occams razor at work, its 1/1 1/2 vs 1/10,000 1/1,000,000. Its statistics, and its science. Nothing I described here would be inaccessible to experimentation and control groups.

I think that the God reference and foul language used in Cure_of_Ars comment have misdirected an important criticism to this article, which I for one would like to hear your responses to, so please for those who downvoted and saved the criticism for his comment, I would like to hear your thoughts and have it explained to me; for me, it is not trivial that he has no point in his first paragraph.

But to clarify, I'd restate my open questions on the subject which were partly described by his comment.

The original formulation of this principle is: "Entities should not be multiplied without necessity." This formulation is not that clear to me; what I can understand from it is that one shouldn't add unnecessary complexity to a theory unless he has to.

A clear example where Occam's razor may be used as intended is as following: assume I have a program that takes a single number as an input and returns a number. Now, if we observe the following sequence: f(1) = 2, f(4) = 16 and f(10) = 1024, we might be tempted say f(x) = 2^x. But this is not the only option; we could have: f(x) = {x > 0 -> 2^x, x <= 0 -> 10239999999} or even f(x) = {1 -> 2, 4 ->16, 10->1024, [ANY OTHER INPUT TO OUTPUT]}.

Since these examples all make the same predictions in all experimental tests so far, it follows we should choose the simplest one, being 2^x [and if more experimental tests will follow in the future, we could have chosen in advance similarly complex alternatives that would have predicted correct observations as 2^x for these tests just as well. In fact, we can only make a finite amount of experimental tests, and as such there are an infinite amount of hypotheses that would correctly predict these tests and have an additional, useless, layer of complexion added to them.]

What exactly entities mean, or how multiplication of them is defined, I could only guess based on my understanding of these concepts and the popular interpretations of this principle, such as: "Occam's razor says that when presented with competing hypotheses that make the same predictions, one should select the solution with the fewest assumptions"

In any case, I sense (after reading multiple sources that emphasize this) that there is an emphasis here that isn't properly addressed in this article and skipped over in these replies, and it is that Occam's razor is not meant to be a way of choosing between hypotheses that make different predictions.

In the article, the question of how to weight simplicity over precision arises; if we have two theorems, T1 and T2, which have different precision (say T1 has 90% success rate where T2 has 82%) and different complexity (but T1 is more complex than T2) how can we decide between the two?

From my understanding, and this is where I would like to hear your thoughts, this question cannot be solved by Occam’s razor. That being said, I think this question is even more interesting and important than the one that Occam's razor attempts at solving. And to answer that question, it appears that Occam's razor has been generalized, to something like: "The explanation requiring the fewest assumptions is most likely to be correct." These generalizations are even given a different name (the law of parsimony, or the rule of simplicity) to stress they are not the same as Occam's razor.

But that is neither the original purpose of the principle, nor is it a proven fact. The following quote stresses this issue: "The principle of simplicity works as a heuristic rule of thumb, but some people quote it as if it were an axiom of physics, which it is not. [...] The law of parsimony is no substitute for insight, logic and the scientific method.  It should never be relied upon to make or defend a conclusion.  As arbiters of correctness, only logical consistency and empirical evidence are absolute."

A usage of this principle that does appeal to my logic is to get rid of hypothetical absurdities, esp. if they cannot be tested using the scientific method. This has been done in the field of physics, and this quote illustrates my point:

"In physics we use the razor to shave away metaphysical concepts. [...] The principle has also been used to justify uncertainty in quantum mechanics.  Heisenberg deduced his uncertainty principle from the quantum nature of light and the effect of measurement.

Stephen Hawking writes in A Brief History of Time:
"We could still imagine that there is a set of laws that determines events completely for some supernatural being, who could observe the present state of the universe without disturbing it.  However, such models of the universe are not of much interest to us mortals.  It seems better to employ the principle known as Occam's razor and cut out all the features of the theory that cannot be observed.""

My point here is not to disagree with the rule of simplicity (and surely not the original razor) but to stress why it is somewhat philosophical (after all, it was invented in the 14th century, much before the scientific method,) or at least, that it isn't proven that this law is right for all cases; there are strong cases in history that support it, but that is not the same as being proven.

I think that this law is a very good heuristic. Especially when we try to locate our belief in belief-space. But I believe this razor is wielded with less care than it should be - please let me know if and why you disagree.

Additionally, I do not think I have gained a practical tool to evaluate precision vs. simplicity. Solomonoff's induction seems highly impossible to use in real life, esp. when evaluating theories outside of the laboratories (in our actual life!) I do understand it's a very hard problem, but Rationality's purpose is all about using our brains, with all its weaknesses and biases, to the best of our abilities, in order to have the maximum chance to reach Truth. This implies practical, however imperfect they may be (hopefully, as least imperfect as possible,) tools to deal with these kinds of problems in our private lives. I do not think that Solomonoff's induction is such a tool, and I do think we could use some heuristic to help us in this task.

To dudeicus: one cannot argue a theory by an example to it and then conclude by saying "if it would be tested with proper research, it will be proven." This is not the scientific method at work. What I do take from your comment is only that this has not been formally proven - thus relating to the philosophy discussion again.

In science you do not get two theories

You're right - there are an infinite number of theories consistent with any set of observations. Any set. All observed facts are technically consistent with the prediction that gravity will reverse in one hour, but nobody believes that because of... Occam's Razor!

I don't think this is what's actually going on in the brains of most humans.

Suppose there were ten random people who each told you that gravity would be suddenly reversing soon, but each one predicted a different month. For simplicity, person 1 predicts the gravity reversal will come in 1 month, person 2 predicts it will come in 2 months, etc.

Now you wait a month, and there's no gravity reversal, so clearly person 1 is wrong. You wait another month, and clearly person 2 is wrong. Then person 3 is proved wrong, as is person 4 and then 5 and then 6 and 7 and 8 and 9. And so when you approach the 10-month mark, you probably aren't going to be expecting a gravity-reversal.

Now, do you not suspect the gravity-reversal at month ten simply because it's not as simple as saying "there will never a be a gravity reversal," or is your dismissal substantially motivated by the fact that the claim type-matches nine other claims that have already been disproven? I think that in practice most people end up adopting the latter approach.

The rest of what you wrote sounds like you're pulling numbers out of your arse.

Cure of Ars, I should prefer it if you no longer commented on my posts. There may be a place on Overcoming Bias for Catholics; but none for those who despise math they don't understand.

MIT Press has just published Peter Grünwald's The Minimum Description Length Principle. His Preface, Chapter 1, and Chapter 17 are available at that link. Chapter 17 is a comparison of different conceptions of induction.

I don't know this area well enough to judge Peter's wok, but it is certainly informative. Many of his points echo Eliezer's. If you find this topic interesting, Peter's book is definitely worth checking out.

"Different inductive formalisms are penalized by a worst-case constant factor relative to each other"

You mean a constant term; it's additive, not multiplicative.

That depends on whether you're thinking of the length or the probability. Since the length is the log-probability, it works out.

There is: It's enormously easier (as it turns out) to write a computer program that simulates Maxwell's Equations, compared to a computer program that simulates an intelligent emotional mind like Thor.

Coming back to this post, I finally noticed that "emotional" is a necessary word in the quoted sentence. If we leave it out, the sentence might just become false! That is, if you believe there's some sort of simple mathematical "key" to intelligence (my impression is that you do believe that), then you also ought to believe that the Solomonoff prior makes an intelligent god quite probable apriori. Maybe even more probable than the currently most elegant known formulations of physical laws, which include a whole zoo of elementary particles etc. Of course, if we take into account the evidence we've seen so far, it looks like our universe is based on physics rather than a "god".

What sort of specification for Thor are you thinking of that could possibly be simpler than Maxwell's equations? A description of macroscopic electrical phenomena is more complex, as is "a being that wants to simulate Maxwell's equations."

If you're thinking of comparing all "god-like" hypotheses to Maxwell's equations, sure. But that comparison is a bit false - you should really be comparing all "god-like" hypotheses to all "natural law-like" hypotheses, in which case I confidently predict that the "natural law-like" hypotheses will win handily.

Yeah, I agree. The shortest god-programs are probably longer than the shortest physics-programs, just not "enormously" longer.

Probably enormously longer if you want it to produce a god that would cause the world to act in a way as if basic EM held.

ie, you don't just need a mind, you need to specify the sort of mind that would want to cause the world to be in a specific way...

Is there a nice way to quantify how fast does these theoretical priors drop off with the length of something? By how much should I favor simple explanation X over only mediumly more complicated explanation Y.

Interesting question. If you have a countable infinity of mutually exclusive explanations (e.g. they are all finite strings using letters from some finite alphabet), then your only constraint is that the infinite sum of all their prior probabilities must converge to 1. Otherwise you're free to choose. You could make the convergence really fast (say, by making the prior of a hypothesis inversely proportional to the exponent of the exponent of its length), or slower if you wish to. A very natural and popular choice is restricting the hypotheses to form a "prefix-free set" (no hypothesis can begin with another shorter hypothesis) and then assigning every hypothesis of N bits a prior of 2^-N, which makes the sum converge by Kraft's inequality.

What is the reasoning behind using a prefix-free set?

Apart from giving a simple formula for the prior, it comes in handy in other theoretical constructions. For example, if you have a "universal Turing machine" (a computer than can execute arbitrary programs) and feed it an infinite input stream of bits, perhaps coming from a random source because you intend to "execute a random program"... then it needs to know where the program ends. You could introduce an end-of-program marker, but a more general solution is to make valid programs form a prefix-free set, so that when the machine has finished reading a valid program, it knows that reading more bits won't result in a longer but still valid program. (Note that adding an end-of-program marker is one of the ways to make your set of programs prefix-free!)

Overall this is a nice example of an idea that "just smells good" to a mathematician's intuition.

Ah! I must have had a brain-stnank - this makes total sense in math / theoretical CS terms, I was substituting an incorrect interpretation of "hypothesis" when reading the comment out of context. Thanks :)

And, in particular, we're looking at god-programs that produce the output we've observed, which seems to cut out a lot of them (and specifically a lot of simple ones).

Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

Now thats just purely definition. I think both are true. I think there are problems with both. The problem with Occams razor, is that yes its true, however, it doesn't cover all the bases. There is a deeper underlying principle that makes Occams razor true, which is the one you described in the article. However summing up your article as "The simplest explanation that fits the facts" is also misleading as in, while it does seem to cover all the bases, it only does so if you use a very specific definition of simple which really doesnt fit with everyday language.

Example: Stonehenge, let me suggest two theories, 1. it was built by ancient humans, 2. it fell together through purely random geological process. Both theories fit with the facts, we know that both are physically possible (yes 2. is vastly less probable, ill get to that in a second). Occams razor suggest 2. as the answer, and "The simplest explanation" appears to be 2. also. Both seem to be failing. The real underlying principle as to why Occams razor is true, is statistics, not simplicity. Now dont get me wrong, I understand why "The simplest explanation that fit the facts" actually points to 1., but then you have to go through this long process of what you actually mean by simplest, which basically just ends up being a long explanation of how "simple" actually means "probable".

Anyways, I'm just arguing over semantics, I do in fact agree with everything you said. I just wish there was no Occams razor, it should just be "The theory which is the most statistically probable, is usually the right one." This is what people actually mean to say when they say "The simplest explanation that fits the facts."

In statistics generally the model that has the least variables and is the most statistically probable is the one used. See things like AIC or Bayesian Information Criterion on how to choose a good model. This means that Occam's razor is accurate. Given that is is possible to continuously add variables to a model and get a perfect fit but have the model be blown apart with the addition of an additional observation that is not otherwise influential, then, unless you are defining probability to include an Information Criterion, your formulation is less useful.

Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

The form you list it in is the historical form of Occam's Razor, but it isn't the form that the Razor has been applied in for a fairly long time. Among other problems, defining what one means by distinct entities is problematic. And we really do want to prefer simpler explanations to more complicated ones. Indeed, the most general form of the razor doesn't even need to have an explanatory element (I in general prefer a low degree polynomial to interpolate some data to a high degree polynomial even if I have no explanation attached to why I should expect the actual phenomenon to fit a linear or quadratic polynomial.)

I may be missing something here -- Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

-- but isn't the post using the first definition anyway? So even if he explicitly wrote the second definition instead of the first, he was clearly aware of the first since that's what corresponds with his argument.

Witchcraft may fit our observations in the sense of qualitatively permitting them; but this is because witchcraft permits everything

I think replacing witchcraft with godhood is also a common mistake

What I don't understand is so much insistence that Occam's Razor applies only to explanations you address to God. Or else how do you avoid the observation that the simplicity of an explanation is a function of whom you are explaining to ? In the post, you actually touch on the issue, only to observe that there are difficulties interpreting Occam's Razor in the frame of explaining things to humans (in their own natural language), so let's transpose to a situation where humans are completely removed from the picture. Curiously enough, where the same issue occurs in the context of machine languages it is quickly "solved". Makes one wonder what Occam - who had no access to Turing machines - himself had in might.

Also, if you deal in practice with shortening code length of actual programs, at some point you have exploited all the low lying fruit; further progress can come after a moment of contemplation made you observe that distinct paths of control through the code have "something in common" that you may try to enhance to the point where you can factor it out. This "enhancing" follows from the quest for minimal "complexity", but it drives you to do locally, on the code, just the contrary of what you did during the "low-lying fruit" phase, you "complexify" rather than "simplify" two distinct areas of the code to make them resemble each other (and the target result emerges during the process, fun). What I mean to say, I guess, is that even the frame proposed by Chaitin-Kolmogorov complexity gives only fake reasons to neglect bias (from shared background or the equivalent).

"each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits."

I don't see it explained anywhere what algorithm is used to weight the experts for this measure. Does it matter? And how are the "fit" probabilities and "complexity" probabilities combined? Multiply and normalize?

I don't see it explained anywhere what algorithm is used to weight the experts for this measure.

bayes theorem

What I find fascinating is that Solomonoff Induction (and the related concepts from Kolmogorov complexity) very elegantly solves the classical philosophical problem of induction, as well as resolving a lot of other problems:

  1. What is the correct "prior" in Bayesian inference, and isn't the choice of prior all subjective?
  2. What does Occam's razor really mean, and what is a "simple" theory?
  3. Why do physicists insist that their theories are "simple" when only they can understand them?

Despite this, it is almost unheard of in the general philosophical (analytic philosophy) community. I've read literally dozens of top-grade philosophers discussing these topics, with the implication that these are still big unsolved problems, and in complete ignorance that there is a very rich mathematical theory in this area. And the theory's not exactly new either... dates back to the 1960s.

Anyone got an explanation for the disconnect?

Anyone got an explanation for the disconnect?

Philosophers don't read those things. If that explanation seems lacking, I feel like referring to Feynman.

I don't think Solomonoff Induction solves any of those three things. I really hope it does, and I can see how it kinda goes half of the way there to solving them, but I just don't see it going all the way yet. (Mostly I'm concerned with #1. The other two I'm less sure about, but they are also less important.)

I don't know why the philosophical community seems to be ignoring Solomonoff Induction etc. though. It does seem relevant. Maybe the philosophers are just more cynical than we are about Solomonoff Induction's chances of eventually being able to solve 1, 2, and 3.

Possibly because .Solomonoff induction isnt very suitable to answering the kinds of questions philosophers want answered, questions of fundamental ontology.. It can tell you what programme would generate observed data, but it doesn't tell you what the programme is running on..the laws of physics, Gods mind, .or a giant simulation. OTOH, traditional Occams razor can exclude a range of ontological hypotheses.

There is also the problem that there is no absolute measure of the complexity of a programme: a programming language is still a language, and some languages can express some things more concisely than others, as explained in kokotajlods other comment. http://lesswrong.com/lw/jhm/understanding_and_justifying_solomonoff_induction/ady8

If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should cost at least a "factor of two gain" in the fit. If you try to design a computer program that explicitly stores an outcome like "HTTHHT", the six bits that you lose in complexity must destroy all plausibility gained by a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.

I found this paragraph confusing. How about

If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should always buy at least a "factor of two gain" in the fit. If you try to design a computer program that explicitly stores an outcome like "HTTHHT", the six bits that you pay in complexity must get you at least a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.

Does that mean the same thing?

Upcoming formal philosophy conference on the foundations of Occam's razor here. Abstracts included.

Hello,

I need some help understanding the article after Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.

How is the connection being made from complexity and fit to data and program description?

Thanks in advance! :)

Complexity, as defined in Solomonoff Induction, means program description - that is, code length in bits.

Sidenote: thank you for reminding me that Eliezer was talking about better versions of SI in 2007, before starting his quantum mechanics sequence.

I found a reference to a very nice overview for the mathematical motivations of Occam's Razor on wikipedia.

It's Chapter 28: Model Comparison and Occam's Razor; from (page 355 of) Information Theory, Inference, and Learning Algorithms (legally free to read pdf) by David J. C. MacKay.

The Solomonoff Induction stuff went over my head, but this overview's talk of trade-offs between communicating increasing numbers of model parameters vs having to communicate less residuals (ie. offsets from real data); was very informative.

My own way of thinking of Occam's Razor is through model selection. Suppose you have two competing statements (the which did it) and (it was chance or possibly something other than a which caused it ()) and some observations (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as

this is simply Bayes rule where

and the model is parametrized by some parameters .

Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.

In the witch example this would be:

  • There exist an advanced intelligent being (at least not much less than human intelligence) that can do things beyond what has ever been reproduced in a scientific way that for some reason chooses to live on our street and act mostly as a human that will choose to influence my sequence coin tosses to end up in some seemingly looking pattern
  • The coin toss is ruled by chance and might end up in the set of possible outcomes that seem to form a pattern ()
  • The coin toss ended up as
  • The way I stated the hypotheses

Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip as we are interested in .

Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.

For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.

In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.

The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam's Razor ;).