The more complex an explanation is, the more evidence you need just to find it in belief-space. (In Traditional Rationality this is often phrased misleadingly, as “The more complex a proposition is, the more evidence is required to argue for it.”) How can we measure the complexity of an explanation? How can we determine how much evidence is required?

Occam’s Razor is often phrased as “The simplest explanation that fits the facts.” Robert Heinlein replied that the simplest explanation is “The lady down the street is a witch; she did it.”

One observes that the length of an English sentence is not a good way to measure “complexity.” And “fitting” the facts by merely *failing to prohibit* them is insufficient.

Why, exactly, is the length of an English sentence a poor measure of complexity? Because when you speak a sentence aloud, you are using *labels* for concepts that the listener shares—the receiver has already stored the complexity in them. Suppose we abbreviated Heinlein’s whole sentence as “Tldtsiawsdi!” so that the entire explanation can be conveyed in one word; better yet, we’ll give it a short arbitrary label like “Fnord!” Does this reduce the complexity? No, because you have to tell the listener in advance that “Tldtsiawsdi!” stands for “The lady down the street is a witch; she did it.” “Witch,” itself, is a label for some extraordinary assertions—just because we all know what it means doesn’t mean the concept is simple.

An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, “Maybe a really powerful agent was angry and threw a lightning bolt.” The human brain is the most complex artifact in the known universe. If *anger* seems simple, it’s because we don’t see all the neural circuitry that’s implementing the emotion. (Imagine trying to explain why *Saturday Night Live* is funny, to an alien species with no sense of humor. But don’t feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.

*To a human,* Maxwell’s equations take much longer to explain than Thor. Humans don’t have a built-in vocabulary for calculus the way we have a built-in vocabulary for anger. You’ve got to explain your language, and the language behind the language, and the very concept of mathematics, before you can start on electricity.

And yet it seems that there should be some sense in which Maxwell’s equations are *simpler* than a human brain, or Thor the thunder-agent.

There is. It’s *enormously* easier (as it turns out) to write a computer program that simulates Maxwell’s equations, compared to a computer program that simulates an intelligent emotional mind like Thor.

The formalism of Solomonoff induction measures the “complexity of a description” by the length of the shortest computer program which produces that description as an output. To talk about the “shortest computer program” that does something, you need to specify a space of computer programs, which requires a language and interpreter. Solomonoff induction uses Turing machines, or rather, bitstrings that specify Turing machines. What if you don’t like Turing machines? Then there’s only a constant complexity penalty to design your own universal Turing machine that interprets whatever code you give it in whatever programming language you like. Different inductive formalisms are penalized by a worst-case constant factor relative to each other, corresponding to the size of a universal interpreter for that formalism.

In the better (in my humble opinion) versions of Solomonoff induction, the computer program does not produce a deterministic prediction, but assigns probabilities to strings. For example, we could write a program to explain a fair coin by writing a program that assigns equal probabilities to all 2^{N} strings of length N. This is Solomonoff induction’s approach to *fitting* the observed data. The higher the probability a program assigns to the observed data, the better that program *fits* the data. And probabilities must sum to 1, so for a program to better “fit” one possibility, it must steal probability mass from some other possibility which will then “fit” much more poorly. There is no superfair coin that assigns 100% probability to heads and 100% probability to tails.

How do we trade off the fit to the data, against the complexity of the program? If you ignore complexity penalties, and think *only* about fit, then you will always prefer programs that claim to deterministically predict the data, assign it 100% probability. If the coin shows HTTHHT, then the program that claims that the coin was fixed to show HTTHHT fits the observed data 64 times better than the program which claims the coin is fair. Conversely, if you ignore fit, and consider *only* complexity, then the “fair coin” hypothesis will always seem simpler than any other hypothesis. Even if the coin turns up HTHHTHHHTHHHHTHHHHHT . . .

Indeed, the fair coin *is* simpler and it fits this data exactly as well as it fits any other string of 20 coinflips—no more, no less—but we see another hypothesis, seeming not too complicated, that fits the data much better.

If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should cost *at least* a “factor of two gain” in the fit. If you try to design a computer program that explicitly stores an outcome like HTTHHT, the six bits that you lose in complexity must destroy all plausibility gained by a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.

Unless your program is being smart, and *compressing* the data, it should do no good just to move one bit from the data into the program description.

The way Solomonoff induction works to predict sequences is that you sum up over all allowed computer programs—if every program is allowed, Solomonoff induction becomes uncomputable—with each program having a prior probability of 1/2 to the power of its code length in bits, and each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits.

The Minimum Message Length formalism is nearly equivalent to Solomonoff induction. You send a string describing a code, and then you send a string describing the data in that code. Whichever explanation leads to the shortest *total* message is the best. If you think of the set of allowable codes as a space of computer programs, and the code description language as a universal machine, then Minimum Message Length is nearly equivalent to Solomonoff induction.^{1}

This lets us see clearly the problem with using “The lady down the street is a witch; she did it” to explain the pattern in the sequence 0101010101. If you’re sending a message to a friend, trying to describe the sequence you observed, you would have to say: “The lady down the street is a witch; she made the sequence come out 0101010101.” Your accusation of witchcraft wouldn’t let you *shorten* the rest of the message; you would still have to describe, in full detail, the data which her witchery caused.

Witchcraft may fit our observations in the sense of qualitatively *permitting* them; but this is because witchcraft permits *everything* , like saying “Phlogiston!” So, even after you say “witch,” you still have to describe all the observed data in full detail. You have not *compressed the total length of the message describing your observations* by transmitting the message about witchcraft; you have simply added a useless prologue, increasing the total length.

The real sneakiness was concealed in the word “it” of “A witch did it.” A witch did *what*?

Of course, thanks to hindsight bias and anchoring and fake explanations and fake causality and positive bias and motivated cognition, it may seem all too obvious that if a woman is a witch, of *course* she would make the coin come up 0101010101. But I’ll get to that soon enough. . .

^{1} Nearly, because it chooses the *shortest* program, rather than summing up over all programs.

The Vapnik Chernovenkis Dimension also offers a way of filling in the detail of the the concept of "simple" appropriate to Occam's Razor. I've read about it in the context of statistical learning theory, specifically "probably approximately correct learning".

Having successfully tuned the parameters of your model to fit the data, how likely is it to fit new data, that is, how well does it generalise. The VC dimension comes with formulae that tell you. I've not been able to follow the field, but I suspect that VC dimension leads to worst case estimates whose usefulness is harmed by their pessimism.

Great post!

"Your accusation of witchcraft wouldn't let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused."

My model of witches, if I had one, would produce a given simple sequence like 01010101 with greater probability than a given random sequence like 00011011. Wouldn't yours? I might agree if you said "in nearly full detail".

Steven, that means you have to transmit the accusation of witchcraft, followed by a computer program, followed by the coded data. Why not just transmit the computer program followed by the coded data? I don't expect my own environment to be random noise, but that has nothing to do with witchcraft...

Alan, I agree that VC dimension is an important conceptually different way of thinking about "complexity". One of its primary selling points is that, for example, it doesn't attach infinite complexity to a model class that contains one real-valued parameter, if that model class isn't very flexible (i.e., it says only "the data points are greater than R"). But VC complexity doesn't plug into standard probability theory as easily as Solomonoff induction.

In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.

Eliezer,

"I don't expect my own environment to be random noise, but that has nothing to do with witchcraft..."

I think I misinterpreted the math and now see what you're getting at. Would it be an accurate translation to human language to say, "a sequence like 10101010 may favor witchcraft over the hypothesis that nothing weird is going on (i.e. the coinflips are random), but it will never favor witchcraft over the simpler hypothesis that something weird is going on that isn't witchcraft"?

I find it awkward to think of "witchcraft" as just a content-free word; what "witchcraft" means to me is something like the possibility that reality includes human-mind-like things with personalities and with preferences that they achieve through unknown nonstandard causal means. If you coded that up, it would probably no longer be content-free; it would allow shortening the rest of the program generating the sequences in some cases and require lengthening it in some other cases. In all realistic cases the resulting program would still be longer than necessary.

Good comments, all!

Steven, yes. Stephen, also yes.

Eli, you said:

An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, "Maybe a really powerful agent was angry and threw a lightning bolt." The human brain is the most complex artifact in the known universe. If anger seems simple, it's because we don't see all the neural circuitry that's implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don't feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.I think it's worth noting that Norse tribesfolk already knew about human beings, so whatever model of the universe they made had to include angry agents in it somewhere.

I agree. I feel like the post is poking a bit of fun at hokey religion, and in so doing falls into an error. The Norse would do quite badly in life if they switched to a prior based on description lengths in Turing machines rather than a description length in their own language, because their language embodies useful bias concerning their environment. Similarly, English description lengths contain useful bias for our environment. The formalism of Solomonoff induction does not tell us which universal language to use, and English is a fine choice. The "thunder god" theory is not bad because of Occam's razor, but because it doesn't hold up when we investigate empirically! Similarly, if the Norse believed that earthquakes were caused by giant animals moving under the earth, it would not be such a bad theory given what evidence they had (even though animals are complex from a Turing-machine perspective); animals caused many things in their environment. We just know it is wrong today, based on what we know now.

What you are talking about in terms of Solmonoff induction is usually called algorithmic information theory and the shortest-program-to-produce-a-bit-string is usually called Kolmogorov-Chaitin information. I am sure you know this. Which begs the question, why didn't you mention this? I agree, it is the neatest way to think about Occam's razor. I am not sure why some are raising PAC theory and VC-dimension. I don't quite see how they illuminate Occam. Minimalist inductive learning is hardly the simplest "explanation" in the Occam sense, and is actually closer to Shannon entropy in spirit, in being more of a raw measure. Gregory Chaitin's 'Meta Math: The Search for Omega', which I did a review summary of is a pretty neat look at this stuff.

Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.

In science you do not get two theoriesYou're right - there are an infinite number of theories consistent with any set of observations.

Anyset. All observed facts are technically consistent with the prediction that gravity will reverse in one hour, but nobody believes that because of... Occam's Razor!I don't think this is what's actually going on in the brains of most humans.

Suppose there were ten random people who each told you that gravity would be suddenly reversing soon, but each one predicted a different month. For simplicity, person 1 predicts the gravity reversal will come in 1 month, person 2 predicts it will come in 2 months, etc.

Now you wait a month, and there's no gravity reversal, so clearly person 1 is wrong. You wait another month, and clearly person 2 is wrong. Then person 3 is proved wrong, as is person 4 and then 5 and then 6 and 7 and 8 and 9. And so when you approach the 10-month mark, you probably aren't going to be expecting a gravity-reversal.

Now, do you not suspect the gravity-reversal at month ten simply because it's not as simple as saying "there will never a be a gravity reversal," or is your dismissal substantially motivated by the fact that the claim type-matches nine other claims that have already been disproven? I think that in practice most people end up adopting the latter approach.

The rest of what you wrote sounds like you're pulling numbers out of your arse.Cure of Ars, I should prefer it if you no longer commented on my posts. There may be a place on Overcoming Bias for Catholics; but none for those who despise math they don't understand.

MIT Press has just published Peter Grünwald's The Minimum Description Length Principle. His Preface, Chapter 1, and Chapter 17 are available at that link. Chapter 17 is a comparison of different conceptions of induction.

I don't know this area well enough to judge Peter's wok, but it is certainly informative. Many of his points echo Eliezer's. If you find this topic interesting, Peter's book is definitely worth checking out.

"Different inductive formalisms are penalized by a worst-case constant factor relative to each other"

You mean a constant term; it's additive, not multiplicative.

That depends on whether you're thinking of the length or the probability. Since the length is the log-probability, it works out.

Coming back to this post, I finally noticed that "emotional" is a necessary word in the quoted sentence. If we leave it out, the sentence might just become false! That is, if you believe there's some sort of simple mathematical "key" to intelligence (my impression is that you do believe that), then you also ought to believe that the Solomonoff prior makes an intelligent god quite probable apriori. Maybe even more probable than the currently most elegant known formulations of physical laws, which include a whole zoo of elementary particles etc. Of course, if we take into account the evidence we've seen so far, it looks like

ouruniverse is based on physics rather than a "god".What sort of specification for Thor are you thinking of that could possibly be simpler than Maxwell's equations? A description of macroscopic electrical phenomena is more complex, as is "a being that wants to simulate Maxwell's equations."

If you're thinking of comparing all "god-like" hypotheses to Maxwell's equations, sure. But that comparison is a bit false - you should really be comparing all "god-like" hypotheses to all "natural law-like" hypotheses, in which case I confidently predict that the "natural law-like" hypotheses will win handily.

Yeah, I agree. The shortest god-programs are probably longer than the shortest physics-programs, just not "enormously" longer.

Probably enormously longer if you want it to produce a god that would cause the world to act in a way

as ifbasic EM held.ie, you don't just need a mind, you need to specify the sort of mind that would want to cause the world to be in a specific way...

Is there a nice way to quantify how fast does these theoretical priors drop off with the length of something? By how much should I favor simple explanation X over only mediumly more complicated explanation Y.

Interesting question. If you have a countable infinity of mutually exclusive explanations (e.g. they are all finite strings using letters from some finite alphabet), then your only constraint is that the infinite sum of all their prior probabilities must converge to 1. Otherwise you're free to choose. You could make the convergence really fast (say, by making the prior of a hypothesis inversely proportional to the exponent

of the exponentof its length), or slower if you wish to. A very natural and popular choice is restricting the hypotheses to form a "prefix-free set" (no hypothesis can begin with another shorter hypothesis) and then assigning every hypothesis of N bits a prior of 2^-N, which makes the sum converge by Kraft's inequality.What is the reasoning behind using a prefix-free set?

Apart from giving a simple formula for the prior, it comes in handy in other theoretical constructions. For example, if you have a "universal Turing machine" (a computer than can execute arbitrary programs) and feed it an infinite input stream of bits, perhaps coming from a random source because you intend to "execute a random program"... then it needs to know where the program ends. You could introduce an end-of-program marker, but a more general solution is to make valid programs form a prefix-free set, so that when the machine has finished reading a valid program, it knows that reading more bits won't result in a longer but still valid program. (Note that adding an end-of-program marker is one of the ways to make your set of programs prefix-free!)

Overall this is a nice example of an idea that "just smells good" to a mathematician's intuition.

Ah! I must have had a brain-stnank - this makes total sense in math / theoretical CS terms, I was substituting an incorrect interpretation of "hypothesis" when reading the comment out of context. Thanks :)

And, in particular, we're looking at god-programs that produce the output we've observed, which seems to cut out a lot of them (and specifically a lot of simple ones).

Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

Now thats just purely definition. I think both are true. I think there are problems with both. The problem with Occams razor, is that yes its true, however, it doesn't cover all the bases. There is a deeper underlying principle that makes Occams razor true, which is the one you described in the article. However summing up your article as "The simplest explanation that fits the facts" is also misleading as in, while it does seem to cover all the bases, it only does so if you use a very specific definition of simple which really doesnt fit with everyday language.

Example: Stonehenge, let me suggest two theories, 1. it was built by ancient humans, 2. it fell together through purely random geological process. Both theories fit with the facts, we know that both are physically possible (yes 2. is vastly less probable, ill get to that in a second). Occams razor suggest 2. as the answer, and "The simplest explanation" appears to be 2. also. Both seem to be failing. The real underlying principle as to why Occams razor is true, is statistics, not simplicity. Now dont get me wrong, I understand why "The simplest explanation that fit the facts" actually points to 1., but then you have to go through this long process of what you actually mean by simplest, which basically just ends up being a long explanation of how "simple" actually means "probable".

Anyways, I'm just arguing over semantics, I do in fact agree with everything you said. I just wish there was no Occams razor, it should just be "The theory which is the most statistically probable, is usually the right one." This is what people actually mean to say when they say "The simplest explanation that fits the facts."

In statistics generally the model that has the least variables and is the most statistically probable is the one used. See things like AIC or Bayesian Information Criterion on how to choose a good model. This means that Occam's razor is accurate. Given that is is possible to continuously add variables to a model and get a perfect fit but have the model be blown apart with the addition of an additional observation that is not otherwise influential, then, unless you are defining probability to include an Information Criterion, your formulation is less useful.

The form you list it in is the historical form of Occam's Razor, but it isn't the form that the Razor has been applied in for a fairly long time. Among other problems, defining what one means by distinct entities is problematic. And we really do want to prefer simpler explanations to more complicated ones. Indeed, the most general form of the razor doesn't even need to have an explanatory element (I in general prefer a low degree polynomial to interpolate some data to a high degree polynomial even if I have no explanation attached to why I should expect the actual phenomenon to fit a linear or quadratic polynomial.)

I may be missing something here -- Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

-- but isn't the post using the first definition anyway? So even if he explicitly wrote the second definition instead of the first, he was clearly aware of the first since that's what corresponds with his argument.

I think replacing

witchcraftwithgodhoodis also a common mistakeWhat I don't understand is so much insistence that Occam's Razor applies only to explanations you address to God. Or else how do you avoid the observation that the simplicity of an explanation is a function of whom you are explaining to ? In the post, you actually touch on the issue, only to observe that there are difficulties interpreting Occam's Razor in the frame of explaining things to humans (in their own natural language), so let's transpose to a situation where humans are completely removed from the picture. Curiously enough, where the same issue occurs in the context of machine languages it is quickly "solved". Makes one wonder what Occam - who had no access to Turing machines - himself had in might.

Also, if you deal in practice with shortening code length of actual programs, at some point you have exploited all the low lying fruit; further progress can come after a moment of contemplation made you observe that distinct paths of control through the code have "something in common" that you may try to enhance to the point where you can factor it out. This "enhancing" follows from the quest for minimal "complexity", but it drives you to do locally, on the code, just the contrary of what you did during the "low-lying fruit" phase, you "complexify" rather than "simplify" two distinct areas of the code to make them resemble each other (and the target result emerges during the process, fun). What I mean to say, I guess, is that even the frame proposed by Chaitin-Kolmogorov complexity gives only fake reasons to neglect bias (from shared background or the equivalent).

"each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits."

I don't see it explained anywhere what algorithm is used to weight the experts for this measure. Does it matter? And how are the "fit" probabilities and "complexity" probabilities combined? Multiply and normalize?

bayes theorem

What I find fascinating is that Solomonoff Induction (and the related concepts from Kolmogorov complexity) very elegantly solves the classical philosophical problem of induction, as well as resolving a lot of other problems:

Despite this, it is almost unheard of in the general philosophical (analytic philosophy) community. I've read literally dozens of top-grade philosophers discussing these topics, with the implication that these are still big unsolved problems, and in complete ignorance that there is a very rich mathematical theory in this area. And the theory's not exactly new either... dates back to the 1960s.

Anyone got an explanation for the disconnect?

Philosophers don't read those things. If that explanation seems lacking, I feel like referring to Feynman.

I don't think Solomonoff Induction solves any of those three things. I really hope it does, and I can see how it kinda goes half of the way there to solving them, but I just don't see it going all the way yet. (Mostly I'm concerned with #1. The other two I'm less sure about, but they are also less important.)

I don't know why the philosophical community seems to be ignoring Solomonoff Induction etc. though. It does seem relevant. Maybe the philosophers are just more cynical than we are about Solomonoff Induction's chances of eventually being able to solve 1, 2, and 3.

Possibly because .Solomonoff induction isnt very suitable to answering the kinds of questions philosophers want answered, questions of fundamental ontology.. It can tell you what programme would generate observed data, but it doesn't tell you what the programme is running on..the laws of physics, Gods mind, .or a giant simulation. OTOH, traditional Occams razor can exclude a range of ontological hypotheses.

There is also the problem that there is no absolute measure of the complexity of a programme: a programming language is still a language, and some languages can express some things more concisely than others, as explained in kokotajlods other comment. http://lesswrong.com/lw/jhm/understanding_and_justifying_solomonoff_induction/ady8

I found this paragraph confusing. How about

Does that mean the same thing?

Upcoming formal philosophy conference on the foundations of Occam's razor here. Abstracts included.

I'll be there!

Hello,

I need some help understanding the article after

Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.How is the connection being made from complexity and fit to data and program description?

Thanks in advance! :)

Complexity, as defined in Solomonoff Induction, means program description - that is, code length in bits.

Sidenote: thank you for reminding me that Eliezer was talking about better versions of SI in 2007, before starting his quantum mechanics sequence.

I found a reference to a very nice overview for the mathematical motivations of Occam's Razor on wikipedia.

It's Chapter 28: Model Comparison and Occam's Razor; from (page 355 of) Information Theory, Inference, and Learning Algorithms (legally free to read pdf) by David J. C. MacKay.

The Solomonoff Induction stuff went over my head, but this overview's talk of trade-offs between communicating increasing numbers of model parameters vs having to communicate less residuals (ie. offsets from real data); was very informative.

My own way of thinking of Occam's Razor is through model selection. Suppose you have two competing statements H1 (the which did it) and H2 (it was chance or possibly something other than a which caused it (H2=¬H1)) and some observations D (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as

this is simply Bayes rule where

and the model is parametrized by some parameters θ.

Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.

In the witch example this would be:

Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip p(D) as we are interested in p(H1|D):p(H2|D).

Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like 2100 in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.

For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.

In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.

The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam's Razor ;).