Rationality: A-Z

[-]Alan_Crowe18y40 Response to previous version

The Vapnik Chernovenkis Dimension also offers a way of filling in the detail of the the concept of "simple" appropriate to Occam's Razor. I've read about it in the context of statistical learning theory, specifically "probably approximately correct learning".

Having successfully tuned the parameters of your model to fit the data, how likely is it to fit new data, that is, how well does it generalise. The VC dimension comes with formulae that tell you. I've not been able to follow the field, but I suspect that VC dimension leads to worst case estimates whose usefulness is harmed by their pessimism.

[-]Hopefully_Anonymous18y00 Response to previous version

Great post!

[-]steven18y-10 Response to previous version

"Your accusation of witchcraft wouldn't let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused."

My model of witches, if I had one, would produce a given simple sequence like 01010101 with greater probability than a given random sequence like 00011011. Wouldn't yours? I might agree if you said "in nearly full detail".

[-]Eliezer Yudkowsky18y50 Response to previous version

Steven, that means you have to transmit the accusation of witchcraft, followed by a computer program, followed by the coded data. Why not just transmit the computer program followed by the coded data? I don't expect my own environment to be random noise, but that has nothing to do with witchcraft...

Alan, I agree that VC dimension is an important conceptually different way of thinking about "complexity". One of its primary selling points is that, for example, it doesn't attach infinite complexity to a model class that contains one real-valued parameter, if that model class isn't very flexible (i.e., it says only "the data points are greater than R"). But VC complexity doesn't plug into standard probability theory as easily as Solomonoff induction.

[-]Stephen_Jordan18y60 Response to previous version

In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff's original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.

[-]steven18y80 Response to previous version

Eliezer,

"I don't expect my own environment to be random noise, but that has nothing to do with witchcraft..."

I think I misinterpreted the math and now see what you're getting at. Would it be an accurate translation to human language to say, "a sequence like 10101010 may favor witchcraft over the hypothesis that nothing weird is going on (i.e. the coinflips are random), but it will never favor witchcraft over the simpler hypothesis that something weird is going on that isn't witchcraft"?

I find it awkward to think of "witchcraft" as just a content-free word; what "witchcraft" means to me is something like the possibility that reality includes human-mind-like things with personalities and with preferences that they achieve through unknown nonstandard causal means. If you coded that up, it would probably no longer be content-free; it would allow shortening the rest of the program generating the sequences in some cases and require lengthening it in some other cases. In all realistic cases the resulting program would still be longer than necessary.

[-]Eliezer Yudkowsky18y10 Response to previous version

Good comments, all!

Steven, yes. Stephen, also yes.

[-]Peter_de_Blanc18y200 Response to previous version

Eli, you said:

An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, "Maybe a really powerful agent was angry and threw a lightning bolt." The human brain is the most complex artifact in the known universe. If anger seems simple, it's because we don't see all the neural circuitry that's implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don't feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.

I think it's worth noting that Norse tribesfolk already knew about human beings, so whatever model of the universe they made had to include angry agents in it somewhere.

[-]abramdemski14y230 Response to previous version

I agree. I feel like the post is poking a bit of fun at hokey religion, and in so doing falls into an error. The Norse would do quite badly in life if they switched to a prior based on description lengths in Turing machines rather than a description length in their own language, because their language embodies useful bias concerning their environment. Similarly, English description lengths contain useful bias for our environment. The formalism of Solomonoff induction does not tell us which universal language to use, and English is a fine choice. The "thunder god" theory is not bad because of Occam's razor, but because it doesn't hold up when we investigate empirically! Similarly, if the Norse believed that earthquakes were caused by giant animals moving under the earth, it would not be such a bad theory given what evidence they had (even though animals are complex from a Turing-machine perspective); animals caused many things in their environment. We just know it is wrong today, based on what we know now.

[-]Venkat18y40 Response to previous version

What you are talking about in terms of Solmonoff induction is usually called algorithmic information theory and the shortest-program-to-produce-a-bit-string is usually called Kolmogorov-Chaitin information. I am sure you know this. Which begs the question, why didn't you mention this? I agree, it is the neatest way to think about Occam's razor. I am not sure why some are raising PAC theory and VC-dimension. I don't quite see how they illuminate Occam. Minimalist inductive learning is hardly the simplest "explanation" in the Occam sense, and is actually closer to Shannon entropy in spirit, in being more of a raw measure. Gregory Chaitin's 'Meta Math: The Search for Omega', which I did a review summary of is a pretty neat look at this stuff.

[-]Stephen_Jordan18y10 Response to previous version

Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.

[+]Cure_of_Ars18y-60 Response to previous version

[-]Nick_Tarleton18y90 Response to previous version

In science you do not get two theories

You're right - there are an infinite number of theories consistent with any set of observations. Any set. All observed facts are technically consistent with the prediction that gravity will reverse in one hour, but nobody believes that because of... Occam's Razor!

[-]Zaq11y-10 Response to previous version

I don't think this is what's actually going on in the brains of most humans.

Suppose there were ten random people who each told you that gravity would be suddenly reversing soon, but each one predicted a different month. For simplicity, person 1 predicts the gravity reversal will come in 1 month, person 2 predicts it will come in 2 months, etc.

Now you wait a month, and there's no gravity reversal, so clearly person 1 is wrong. You wait another month, and clearly person 2 is wrong. Then person 3 is proved wrong, as is person 4 and then 5 and then 6 and 7 and 8 and 9. And so when you approach the 10-month mark, you probably aren't going to be expecting a gravity-reversal.

Now, do you not suspect the gravity-reversal at month ten simply because it's not as simple as saying "there will never a be a gravity reversal," or is your dismissal substantially motivated by the fact that the claim type-matches nine other claims that have already been disproven? I think that in practice most people end up adopting the latter approach.

[-]Eliezer Yudkowsky18y80 Response to previous version

The rest of what you wrote sounds like you're pulling numbers out of your arse.

Cure of Ars, I should prefer it if you no longer commented on my posts. There may be a place on Overcoming Bias for Catholics; but none for those who despise math they don't understand.

[-]Jed_Harris18y10 Response to previous version

MIT Press has just published Peter Grünwald's The Minimum Description Length Principle. His Preface, Chapter 1, and Chapter 17 are available at that link. Chapter 17 is a comparison of different conceptions of induction.

I don't know this area well enough to judge Peter's wok, but it is certainly informative. Many of his points echo Eliezer's. If you find this topic interesting, Peter's book is definitely worth checking out.

[-]ksvanhorn15y50 Response to previous version

"Different inductive formalisms are penalized by a worst-case constant factor relative to each other"

You mean a constant term; it's additive, not multiplicative.

[-]abramdemski14y40 Response to previous version

That depends on whether you're thinking of the length or the probability. Since the length is the log-probability, it works out.

[+]shanerg15y-90 Response to previous version

[-]cousin_it15y120 Response to previous version

There is: It's enormously easier (as it turns out) to write a computer program that simulates Maxwell's Equations, compared to a computer program that simulates an intelligent emotional mind like Thor.

Coming back to this post, I finally noticed that "emotional" is a necessary word in the quoted sentence. If we leave it out, the sentence might just become false! That is, if you believe there's some sort of simple mathematical "key" to intelligence (my impression is that you do believe that), then you also ought to believe that the Solomonoff prior makes an intelligent god quite probable apriori. Maybe even more probable than the currently most elegant known formulations of physical laws, which include a whole zoo of elementary particles etc. Of course, if we take into account the evidence we've seen so far, it looks like our universe is based on physics rather than a "god".

[-]Manfred15y40 Response to previous version

What sort of specification for Thor are you thinking of that could possibly be simpler than Maxwell's equations? A description of macroscopic electrical phenomena is more complex, as is "a being that wants to simulate Maxwell's equations."

If you're thinking of comparing all "god-like" hypotheses to Maxwell's equations, sure. But that comparison is a bit false - you should really be comparing all "god-like" hypotheses to all "natural law-like" hypotheses, in which case I confidently predict that the "natural law-like" hypotheses will win handily.

[-]cousin_it15y30 Response to previous version

Yeah, I agree. The shortest god-programs are probably longer than the shortest physics-programs, just not "enormously" longer.

[-]Psy-Kosh15y70 Response to previous version

Probably enormously longer if you want it to produce a god that would cause the world to act in a way as if basic EM held.

ie, you don't just need a mind, you need to specify the sort of mind that would want to cause the world to be in a specific way...

[-][anonymous]15y00 Response to previous version

Is there a nice way to quantify how fast does these theoretical priors drop off with the length of something? By how much should I favor simple explanation X over only mediumly more complicated explanation Y.

[-]cousin_it15y50 Response to previous version

Interesting question. If you have a countable infinity of mutually exclusive explanations (e.g. they are all finite strings using letters from some finite alphabet), then your only constraint is that the infinite sum of all their prior probabilities must converge to 1. Otherwise you're free to choose. You could make the convergence really fast (say, by making the prior of a hypothesis inversely proportional to the exponent of the exponent of its length), or slower if you wish to. A very natural and popular choice is restricting the hypotheses to form a "prefix-free set" (no hypothesis can begin with another shorter hypothesis) and then assigning every hypothesis of N bits a prior of 2^-N, which makes the sum converge by Kraft's inequality.

[-]CuSithBell15y20 Response to previous version

What is the reasoning behind using a prefix-free set?

[-]cousin_it15y40 Response to previous version

Apart from giving a simple formula for the prior, it comes in handy in other theoretical constructions. For example, if you have a "universal Turing machine" (a computer than can execute arbitrary programs) and feed it an infinite input stream of bits, perhaps coming from a random source because you intend to "execute a random program"... then it needs to know where the program ends. You could introduce an end-of-program marker, but a more general solution is to make valid programs form a prefix-free set, so that when the machine has finished reading a valid program, it knows that reading more bits won't result in a longer but still valid program. (Note that adding an end-of-program marker is one of the ways to make your set of programs prefix-free!)

Overall this is a nice example of an idea that "just smells good" to a mathematician's intuition.

[-]CuSithBell15y00 Response to previous version

Ah! I must have had a brain-stnank - this makes total sense in math / theoretical CS terms, I was substituting an incorrect interpretation of "hypothesis" when reading the comment out of context. Thanks :)

[-][anonymous]15y00 Response to previous version

And, in particular, we're looking at god-programs that produce the output we've observed, which seems to cut out a lot of them (and specifically a lot of simple ones).

[-]dudeicus15y50 Response to previous version

Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

Now thats just purely definition. I think both are true. I think there are problems with both. The problem with Occams razor, is that yes its true, however, it doesn't cover all the bases. There is a deeper underlying principle that makes Occams razor true, which is the one you described in the article. However summing up your article as "The simplest explanation that fits the facts" is also misleading as in, while it does seem to cover all the bases, it only does so if you use a very specific definition of simple which really doesnt fit with everyday language.

Example: Stonehenge, let me suggest two theories, 1. it was built by ancient humans, 2. it fell together through purely random geological process. Both theories fit with the facts, we know that both are physically possible (yes 2. is vastly less probable, ill get to that in a second). Occams razor suggest 2. as the answer, and "The simplest explanation" appears to be 2. also. Both seem to be failing. The real underlying principle as to why Occams razor is true, is statistics, not simplicity. Now dont get me wrong, I understand why "The simplest explanation that fit the facts" actually points to 1., but then you have to go through this long process of what you actually mean by simplest, which basically just ends up being a long explanation of how "simple" actually means "probable".

Anyways, I'm just arguing over semantics, I do in fact agree with everything you said. I just wish there was no Occams razor, it should just be "The theory which is the most statistically probable, is usually the right one." This is what people actually mean to say when they say "The simplest explanation that fits the facts."

[-]JohnH14y40 Response to previous version

In statistics generally the model that has the least variables and is the most statistically probable is the one used. See things like AIC or Bayesian Information Criterion on how to choose a good model. This means that Occam's razor is accurate. Given that is is possible to continuously add variables to a model and get a perfect fit but have the model be blown apart with the addition of an additional observation that is not otherwise influential, then, unless you are defining probability to include an Information Criterion, your formulation is less useful.

[-]JoshuaZ14y50 Response to previous version

Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

The form you list it in is the historical form of Occam's Razor, but it isn't the form that the Razor has been applied in for a fairly long time. Among other problems, defining what one means by distinct entities is problematic. And we really do want to prefer simpler explanations to more complicated ones. Indeed, the most general form of the razor doesn't even need to have an explanatory element (I in general prefer a low degree polynomial to interpolate some data to a high degree polynomial even if I have no explanation attached to why I should expect the actual phenomenon to fit a linear or quadratic polynomial.)

[-]adamisom14y00 Response to previous version

I may be missing something here -- Occam's Razor is "entities must not be multiplied beyond necessity" (entia non sunt multiplicanda praeter necessitatem)

NOT "The simplest explanation that fits the facts."

-- but isn't the post using the first definition anyway? So even if he explicitly wrote the second definition instead of the first, he was clearly aware of the first since that's what corresponds with his argument.

[-]Dr_Manhattan14y20 Response to previous version

Witchcraft may fit our observations in the sense of qualitatively permitting them; but this is because witchcraft permits everything

I think replacing witchcraft with godhood is also a common mistake

[-]k4ntico14y20 Response to previous version

What I don't understand is so much insistence that Occam's Razor applies only to explanations you address to God. Or else how do you avoid the observation that the simplicity of an explanation is a function of whom you are explaining to ? In the post, you actually touch on the issue, only to observe that there are difficulties interpreting Occam's Razor in the frame of explaining things to humans (in their own natural language), so let's transpose to a situation where humans are completely removed from the picture. Curiously enough, where the same issue occurs in the context of machine languages it is quickly "solved". Makes one wonder what Occam - who had no access to Turing machines - himself had in might.

Also, if you deal in practice with shortening code length of actual programs, at some point you have exploited all the low lying fruit; further progress can come after a moment of contemplation made you observe that distinct paths of control through the code have "something in common" that you may try to enhance to the point where you can factor it out. This "enhancing" follows from the quest for minimal "complexity", but it drives you to do locally, on the code, just the contrary of what you did during the "low-lying fruit" phase, you "complexify" rather than "simplify" two distinct areas of the code to make them resemble each other (and the target result emerges during the process, fun). What I mean to say, I guess, is that even the frame proposed by Chaitin-Kolmogorov complexity gives only fake reasons to neglect bias (from shared background or the equivalent).

[-]quintopia14y00 Response to previous version

"each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits."

I don't see it explained anywhere what algorithm is used to weight the experts for this measure. Does it matter? And how are the "fit" probabilities and "complexity" probabilities combined? Multiply and normalize?

[-][anonymous]14y30 Response to previous version

I don't see it explained anywhere what algorithm is used to weight the experts for this measure.

bayes theorem

[-]drnickbone14y70 Response to previous version

What I find fascinating is that Solomonoff Induction (and the related concepts from Kolmogorov complexity) very elegantly solves the classical philosophical problem of induction, as well as resolving a lot of other problems:

What is the correct "prior" in Bayesian inference, and isn't the choice of prior all subjective?
What does Occam's razor really mean, and what is a "simple" theory?
Why do physicists insist that their theories are "simple" when only they can understand them?

Despite this, it is almost unheard of in the general philosophical (analytic philosophy) community. I've read literally dozens of top-grade philosophers discussing these topics, with the implication that these are still big unsolved problems, and in complete ignorance that there is a very rich mathematical theory in this area. And the theory's not exactly new either... dates back to the 1960s.

Anyone got an explanation for the disconnect?

[-]thomblake14y20 Response to previous version

Anyone got an explanation for the disconnect?

Philosophers don't read those things. If that explanation seems lacking, I feel like referring to Feynman.

[-]kokotajlod13y00 Response to previous version

I don't think Solomonoff Induction solves any of those three things. I really hope it does, and I can see how it kinda goes half of the way there to solving them, but I just don't see it going all the way yet. (Mostly I'm concerned with #1. The other two I'm less sure about, but they are also less important.)

I don't know why the philosophical community seems to be ignoring Solomonoff Induction etc. though. It does seem relevant. Maybe the philosophers are just more cynical than we are about Solomonoff Induction's chances of eventually being able to solve 1, 2, and 3.

[-]TheAncientGeek10y10 Response to previous version

Possibly because .Solomonoff induction isnt very suitable to answering the kinds of questions philosophers want answered, questions of fundamental ontology.. It can tell you what programme would generate observed data, but it doesn't tell you what the programme is running on..the laws of physics, Gods mind, .or a giant simulation. OTOH, traditional Occams razor can exclude a range of ontological hypotheses.

There is also the problem that there is no absolute measure of the complexity of a programme: a programming language is still a language, and some languages can express some things more concisely than others, as explained in kokotajlods other comment. http://lesswrong.com/lw/jhm/understanding_and_justifying_solomonoff_induction/ady8

[-]John_Maxwell13y00 Response to previous version

If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should cost at least a "factor of two gain" in the fit. If you try to design a computer program that explicitly stores an outcome like "HTTHHT", the six bits that you lose in complexity must destroy all plausibility gained by a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.

I found this paragraph confusing. How about

If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should always buy at least a "factor of two gain" in the fit. If you try to design a computer program that explicitly stores an outcome like "HTTHHT", the six bits that you pay in complexity must get you at least a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.

Does that mean the same thing?

[-]lukeprog13y40 Response to previous version

Upcoming formal philosophy conference on the foundations of Occam's razor here. Abstracts included.

[-]vallinder13y20 Response to previous version

I'll be there!

[+]aceofspades13y-90 Response to previous version

[-]snishar9y20 Response to previous version

Hello,

I need some help understanding the article after Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.

How is the connection being made from complexity and fit to data and program description?

Thanks in advance! :)

[-]hairyfigment9y00 Response to previous version

Complexity, as defined in Solomonoff Induction, means program description - that is, code length in bits.

Sidenote: thank you for reminding me that Eliezer was talking about better versions of SI in 2007, before starting his quantum mechanics sequence.

[-]Viktor Riabtsev7y30 Response to previous version

I found a reference to a very nice overview for the mathematical motivations of Occam's Razor on wikipedia.

It's Chapter 28: Model Comparison and Occam's Razor; from (page 355 of) Information Theory, Inference, and Learning Algorithms (legally free to read pdf) by David J. C. MacKay.

The Solomonoff Induction stuff went over my head, but this overview's talk of trade-offs between communicating increasing numbers of model parameters vs having to communicate less residuals (ie. offsets from real data); was very informative.

[-]Viktor Rehnberg5y100

My own way of thinking of Occam's Razor is through model selection. Suppose you have two competing statements $H_{1}$ (the which did it) and $H_{2}$ (it was chance or possibly something other than a which caused it ( $H_{2} = \neg H_{1}$ )) and some observations $D$ (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as

p (H | D) = \frac{p (H) p (D | H)}{p (D)}

this is simply Bayes rule where

p (D | H) = \int_{θ} p (D | θ, H) p (θ | H) d θ

and the model is parametrized by some parameters $θ$ .

Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.

In the witch example this would be:

$H_{1} =$ There exist an advanced intelligent being (at least not much less than human intelligence) that can do things beyond what has ever been reproduced in a scientific way that for some reason chooses to live on our street and act mostly as a human that will choose to influence my sequence coin tosses to end up in some seemingly looking pattern
$H_{2} =$ The coin toss is ruled by chance and might end up in the set of possible outcomes that seem to form a pattern ( ${0101010101, 1111100000, 1100110011, \dots}$ )
$D =$ The coin toss ended up as $0101010101$
The way I stated the hypotheses $p (D | H_{1}) = p (D | H_{2}) \cdot Fraction of outcomes that look like a pattern$

Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip $p (D)$ as we are interested in $p (H_{1} | D) : p (H_{2} | D)$ .

Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like $2^{100}$ in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.

For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.

In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.

The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam's Razor ;).

LESSWRONG
Rationality: A-Z
LW

LESSWRONG
Rationality: A-Z
LW

143

Occam's Razor

143

143