Many thanks to Unknowns for inventing the scenario that led to this post, and to Wei Dai for helpful discussion.

Imagine you subscribe to the universal prior. Roughly, this means you assign credence 2^-k to each program of length k whose output matches your sensory inputs so far, and 0 to all programs that failed to match. Does this imply you should assign credence 2^-m to any statement about the universe ("hypothesis") that has length m? or maybe Kolmogorov complexity m?

The answer is no. Consider the following examples:

1. The complexity of "A and B and C and D" is roughly equal to the complexity of "A or B or C or D", but we know for certain that the former hypothesis can never be more probable than the latter, no matter what A, B, C and D are.

2. The hypothesis "the correct theory of everything is the lexicographically least algorithm with K-complexity 3^^^^3" is quite short, but the universal prior for it is astronomically low.

3. The hypothesis "if my brother's wife's first son's best friend flips a coin, it will fall heads" has quite high complexity, but should be assigned credence 0.5, just like its negation.

Instead, the right way to derive a prior over hypotheses from a prior over predictors should be to construct the set of all predictors (world-algorithms) that "match" the hypothesis, and see how "wide" or "narrow" that set is. There's no connection to the complexity of the hypothesis itself.

An exception is if the hypothesis gives an explicit way to construct a predictor that satisfies it. In that case the correct prior for the hypothesis is bounded from below by the "naive" prior implied by length, so it can't be too low. This isn't true for many interesting hypotheses, though. For example the words "Islam is true", even expanded into the complete meanings of these words as encoded in human minds, don't offer you a way to implement or predict an omnipotent Allah, so the correct prior value for the Islam hypothesis is not obvious.

This idea may or may not defuse Pascal's Mugging - I'm not sure yet. Sorry, I was wrong about that, see Spurlock's comment and my reply.

New Comment
69 comments, sorted by Click to highlight new comments since:

In other words, universal prior assigns probability to points, not events. Probability of an event is sum over its elements, which is not related to the complexity of the event specification (and hypotheses are usually events, not points, especially in the context of Bayesian updating, which is why high-complexity hypotheses can have high probability according to universal prior).

For an explicit example of a high-complexity event with high prior, let one-element event S be one of very high Kolmogorov complexity, and therefore contain a single program that is assigned very low universal prior. Then, not-S is an event with about the same K-complexity as S, but which is assigned near-certain probability by universal prior (because it includes all possible programs, except for the one high-complexity program included in S).



The problem with your example is that 'not-S' is not an event, it's a huge set of events. It's like talking about the integer 452 and the 'integer' not-452.

The simplest beliefs about the events within that set can be extremely simple. For example, if S is described by the bits 1001001, the beliefs corresponding to non-S would be, "the first bit is 0', or "the second bit is 1", or, "the third bit is 1", and so forth.

Events aka hypotheses are sets of programs.


Sorry, you're right, I was totally confused by your (and cousin_it's) choice of words.


Wow, that was by far my worst comment on Less Wrong. That was really, really dumb.

It's OK, the post itself was pretty stupid and I guess it was infectious. Here's my attempt to fix things.

I'm new here, so maybe I'm just being stupid, but I'm massively confused by this (and previous discussion hasn't cleared it up for me):

The complexity of "A and B and C and D" is roughly equal to the complexity of "A or B or C or D"

That to me seems to be an utterly unfounded assertion. Of course the complexity of the two statements is more or less the same, but the the complexity of the state of the world given by A&B&C&D would be the length of the shortest description of a state of the world satisfying A and B and C and D, whereas the complexity of A|B|C|D would be the length of the shortest description of a state of the world satisfying at least one of the conditions. The latter one is potentially much smaller than the former, is it not?

You're right. In the post I refer only to "surface" complexities of statements, making what amounts to an obvious point.

Your example 2, "the correct theory of everything is the lexicographically least algorithm with K-complexity 3^^^^3", has the same problem as the "islam is true" example. K-complexity is uncomputable.

However, "the correct theory of everything is the lexicographically least algorithm in with length 3^^^^3" is computable - and the universal prior probability for it is low, but reasonable given the length of the sentence - it's not related to the magnitude of 3^^^^3.

Here at LessWrong, we're used to using "3^^^^3" as a way to point to a generic very big number, and certainly if you replaced "3^^^^3" with an explicit random number of similar magnitude, then the prior would be very low. But 3^^^^3 is different than most numbers of that size - it has a very short description.

If there was evidence suggesting that there was a very large constant somewhere in our physics, and someone showed that the last few digits of that constant matched the last few digits of 3^^^^3, then we would jump towards the hypothesis that in fact, the number is exactly 3^^^^3.

You haven't convinced me that Occam's prior is false.

I would claim that example 1 is an invalid criticism because Occam's prior only claims to work on sets of mutually exclusive hypotheses, example 2 is an invalid criticism because K-complexity is uncomputable, so the quoted hypothesis is outlawed (which may be unsatisfactory, I don't know the solution to that), and example 3 is an invalid criticism because in the case of P(Hypothesis|Prior Information) Occam's prior should only care about the stuff to the left of the "|".

The other problem with disregarding Occam's prior is that it has a proof. The proof is at least good enough to refute claims that there's no connection with complexity at all.

Re-reading your post I think that your definition of hypothesis is wider than mine, and when I say hypothesis I mean what you call a "predictor". If you apply Occam's prior to predictors (instead of the "wide"/"narrow" thing) do our positions become the same?

Cannot parse your first three paragraphs. Completely agree with the fourth.

The fourth one is the important one, and matches what Vladimir_Nesov was saying. In my second paragraph I'm explaining why each of your examples aren't things that you should apply Occam's prior to, even though they might be called "hypotheses". I can't really see how my paragraphs 1 and 3 could be unparsable.

Good post, but I object to the title. On average, more complex hypotheses must have lower prior probabilities, just for reasons of summability to 1.

I think a more proper title would be "The prior of a hypothesis is not a function of its complexity".

Huh? All hypotheses of a given complexity don't have to sum to 1, because they aren't mutually exclusive.

The math of it isn't as neat as I'd like, but what I mean is that there are only finitely many hypotheses of each complexity, exponentially many as the complexity goes up, and that most of them differ from each other at some point, so ordinary summability criteria "almost" apply. I don't think that the number of epiphenomenal hypotheses could be enough to keep the average up.

A hypothesis has approximately the same complexity as its negation, so we have many pairs that sum to 1 each.

People sometimes casually use "is a function of" to mean "depends on", so "is not determined by its complexity" might be clearer.

Can you elaborate on how it might defuse Pascal's Mugging? It seems the problem there is that, no matter how low your prior, the mugger can just increase the number of victims until the expected utility of paying up overwhelms that of not paying. Hypothesis complexity doesn't seem to play in, and even if I were using it to assign a low prior, this could still be overcome.

That said, any solution to the problem (Robin's of course being a good start) is more than welcome.

Not sure. I must've gone crazy for a minute there, thinking something like "being able to influence 3^^^^3 people is a huge conjunction of statements, thus low probability" - but of course the universal prior doesn't work like that. Struck that part.

It still seems relevant to me, since like in the "if my brother's wife's first son's best friend flips a coin, it will fall heads" example, the prior probability actually comes from opening up the statement and looking inside, in a way that would also differentiate just fine between stubbing 3 toes and stubbing 3^3^3 toes.

Question: can we construct a low-complexity event that has universal prior much lower than is implied by its complexity, in other words it describes a relatively small set of programs, each of which has high complexity? Clearly it can't just describe one program, but maybe with a whole set of them it's possible. Naturally, the programs must be still hard-to-locate given the event.


See example 2 in the post. I think you can use Rice's theorem to easily construct hypotheses with hard-to-locate predictors, but I'm not sure about the K-complexity of the resulting predictors.

K-complexity of the program defined by that criterion is about as low as that of the criterion, I'm afraid, so example 2 is invalid ("complexity" that is not K-complexity shouldn't be relevant). The universal prior for that theory is not astronomically low.

Edit: This is wrong, in particular because the criterion doesn't present an algorithm for finding the program, and because the program must by definition have high K-complexity.

Um, what? Can you exhibit a low-complexity algorithm that predicts sensory inputs in accordance with the theory from example 2? That's what it would mean for the universal prior to not be low. Or am I missing something?

You are right, see updated comment.


Yes, forgot about that. So, just crossing the meta-levels once is enough to create a gap in complexity, even if the event only has one element.

Pascal's mugging shouldn't come into it if you are using the universal prior. Or at least following an AIXI-ish version. Because AIXI does not intrinsically allow people to tell it the utility of actions, it figures them out for itself empirically. It treats the utility as another part of the world it is trying to predict.


Isn't the solution the same as for Pascal's Wager? That is, just as Muslim Heaven/Hell cancels out Christian Heaven/Hell, the possibility that hell is triggered if you give in to the mugger cancels out the possibility that the mugger is telling the truth.

It doesn't apply in quite the same way. You would have to be able to assert that there was an equal or greater chance that the mugger would do the opposite of what he says.

If there is a 99% (obviously it's much higher, but you see the idea) chance he's lying and won't do anything, that still doesn't cancel out that 1% chance he's telling the truth, because the expected utility multiplied by 3^^^3 people (or whatever) still overwhelms. Now if you could say it was equally likely that he would torture those people only if you DID pay him, that would nullify it. But it's not clear that you can do this because most muggers are not playing tricksy opposite-day games when they threaten you. And if the guy is really evil enough to set up a trick like that, it seems like he'd just go ahead and torture the people without consulting you.

Evidence on the actual tendencies of omnipotent muggers is lacking, but you can at least see why it's not clear that these cancel out.

I think your general conclusion is correct: "does subscription to the universal prior imply assigning probability 2^-m to any hypothesis of length/complexity m? .. ahh no". But perhaps obviously so?

The key is the universal prior deals with programs that are complete predictors of an entire sequence of observations. They are not just simple statements.

If you want to compare to simple statements, you need to quantify their predictive power. The universal prior is a secondary classification, a way of ranking a set of algorithms/hypotheses that all are 100% perfectly accurate and specific. Its like a secondary weighting procedure you use when you have many equally perfect choices given your current data. ( given my understanding refreshed by your description)

That clearly can't apply to statements in general, because statements in general do not in general perfectly predict the whole sequence.

For #1, what are A,B,C,D supposed to be? Complete world predictor programs?

Simple statements? If A&B&C&D is a complete predictor, and so is A|B|C|D, then the universal prior will not choose either: it will choose the simplest of A,B,C,D.

Otherwise, if A&B&C&D is required for a full predictor, then A|B|C|D can not be a full predictor. There is no case then where A&B&C&D has less predictive power than A|B|C|D. The former is more specific and thus has strictly more predictive power, but of course yes is less intrinsically probable.

I think this wrong, but I'm not sure how to directly counter it. so just some notes:

  1. When are "A and B" and "A or B" both predictors that fit the existing data?

  2. If it turns out that the laws of physics are themselves the result of a simple algorithm, wouldn't that seem obvious in retrospect?

  3. The added complexity is not about explanation. Your arrival at .5 is based on a ton of evidence about coins.

The complexity of "A&B&C&D" is roughly equal to the complexity of "A||B||C||D", but we know for certain that the former hypothesis can never be more probable than the latter,

I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C? More careful definitions would lead to a more traditional conclusion.

For one way to formalize the A/B/C/D point, imagine you are trying to solve the UCI Census Income problem, where one attempts to predict whether a person's income is greater than 50k based on various other factors. Let A/B/C/D refer to binary predictors such as (is-married?), (is-religious?), (has-college-degree?), and (is-military?). The goal is to find a combination of these that predicts income well. Now it is of course true that (A||B||C||D) is more likely than (A&B&C&D) for a person. But there is no prior reason to believe that the OR-hypothesis has more predictive power than the AND-hypothesis, just as a KC-style analysis would indicate.

I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C?

Whaa? I found that paragraph more confused than anything I wrote :-) Kolmogorov complexity is defined for arbitrary bit strings, which can be not programs but (say) statements about the world expressed in some formal language. Yeah, the definition of KC quantifies over programs that generate those bit strings; so what.

Cannot parse your second paragraph because I don't understand why "predictive power" should be related to prior probability of a hypothesis.

why "predictive power" should be related to prior probability of a hypothesis.

To solve the learning problem as described, you must ask: what is the prior probability that a given rule/hypothesis will correctly predict income? It is trivially true that the OR-hypothesis selects a larger number of people, but there is no reason to believe it will more accurately predict income.

Since you don't buy the KC idea, do you also refuse to accept the more general idea of capacity control/regularization/MDL as a (the) way to prevent overfitting and achieve generalization? In the standard setting of the learning problem, it seems inevitable that some method of penalizing complexity is necessary for generalization.

I thought about it some more and it seems you're right. In learning problems we need some weighting of hypotheses to prevent overfitting, description length has no obvious downsides, so we can just use it and be happy. Now I just need to shoehorn Islam into the "learning problem" framework, to understand why our prior for it should be low...

This isn't about prior though.

I'm not convinced that example 1 works. "A and B and C" and "A or B or C" do have roughly the same complexity. It is then after we look at the content that we then update our estimates based on the logical form of the claims to give P(A and B and C) < P(A or B or C).

A prior - especially one as fundamental as the universal prior - must be coherent from the start. Inspecting your hypotheses more closely without looking at the universe should never cause "updating".


To be fair, would you agree that this is a big weakness of probability theory? It can be really difficult to sort out some of those boolean combinations.

Edit: Why the downvote?

I'm pretty sure it doesn't break probability theory per se, only limits its applicability. For example, Wei Dai wants UDT to use a "mathematical intuition module" that assigns degrees of belief to math statements, while I'm skeptical of the idea.


I argued here that it means that an AI can't compute its own prior. That argument is about all mathematical statements. If we just restrict to boolean combinations on a finite sample space then I think a similar argument shows that an AI can't compute its own prior in polynomial time, unless P = NP.

Edit: Wait, that last might be boneheaded. For given inputs we can compute Boolean expressions pretty efficiently, I think we just can't efficiently determine whether they're tautologies.

That depends on what statements are "expressible" by the AI. If it can use quantification ("this boolean formula is true for some/all inputs"), computing the prior becomes NP-hard. An even trickier case: imagine you program the AI with ZFC and ask it for its prior about the continuum hypothesis? On the other hand, you may use clever programming to avoid evaluating prior values that are difficult or impossible to evaluate, like Monte Carlo AIXI does.


Far out.


Sorry, my reply was off track, so I deleted it once again.



Your argument looks correct: being able to compute the value of the prior for any sentence you can represent is a very strong condition. In the boolean satisfiability setting this corresponds to P=NP, and in stronger settings it corresponds to incompleteness (e.g. you believe in ZFC, what's your prior for CH?) On the other hand, you may cleverly maneuver so that all prior values you end up calculating in practice will be tractable, like in Monte Carlo AIXI.

OP was talking about priors, which is the part before we look at the content. OP is saying, I think, that the prior should be the fraction of all possible worlds in which the proposition is true.

Whether you accept this example depends on how you interpret the phrase "possible worlds". If "all possible worlds" is all possible values of a binary string, and "possible worlds in which P is true" is all binary strings in which position P = 1, then A or B or C gets a higher prior than A and B and C.

However, if you compute your "prior" not when you are a Cartesian blank slate with no evolutionary history, but when you already know something about the world, then "all possible worlds" could mean "all possible values of A, B, and C that are consistent with previous information". In that case, I would expect that "A and B and C" and "A or B or C" would have more similar priors, provided that the set {A, B, C} was chosen deliberately. A person reasoning about A, B, and C probably chose them due to some perceived commonality, meaning there is some underlying rule about things in this set, and the hypothesis being tested is really "is this an 'OR' rule or an 'AND' rule?"

Consider that the complexity of "A and B" is a lot less than "(A and B) or (A and C) or (A and D)", but P(A and B) < P((A and B) or (A and C) or (A and D))


Nice. Can you give some examples related to your last two paragraphs?

Cut some now irrelevant nonsense about Pascal's mugging.


There's one rather big example. In the comments to Unknowns's scenario, Wei Dai asked us to calculate the probability of Islam being true, conditional on many people believing in it. Most of us failed for non-obvious reasons, and my own failure was especially embarrassing because many people upvoted it. This post is me trying to make progress towards that goal, namely trying to work out what our prior for Islam should be.


I need more help than that. Can you expand "the hypothesis gives an explicit way to construct a predictor that satisfies it" in case hypothesis = Islam?

I don't know whether we can construct a predictor that satisfies Islam. What sensory inputs are consistent with the existence of Allah? Muslims would claim that their sensory inputs are just fine... So the question with Islam is difficult and I don't feel I've solved it yet. All I know is that the argument from complexity ("burdensome details") certainly fails because it just assigns a credence based on hypothesis length, which can't work in general.


I asked a harder question than I meant to, because I don't know what's going on. Tell me what a "predictor that satisfies X" is. If X = Islam is too hard, tell me for an X contrived so as to make saying so easy.

Edit: Perplexed called me disingenuous yesterday, maybe for questions like these? But I am serious.

Imagine you're debugging a program that crashes. Your friend Bob is smarter than you and has already figured it out, so for any input data he can tell you in advance whether it will crash or not. You formulate a hypothesis: "the crash happens within this function". It does not yet allow you to predict crashes accurately, because there's a variety of things that could go wrong within this particular function. Now you tell Bob, "I think it crashes somewhere in this function", and Bob nods, "Yeah". This means Bob belongs to the subset of possible predictors that satisfy your hypothesis :-)

But what predictions Bob should make to satisfy Islam seems to be a more difficult question.

Edit: I don't think you're disingenuous. Asking the right questions can help all participants achieve clarity where there was none. It's an art that I'm slowly learning here on LW - most of the results in my posts come out of discussions.


Sorry for deleting my comment. I noticed it was wrong before I saw your reply...


You seem to completely misunderstand Kolmogorov prior, confusing hypotheses with events.

Probability of events "A and B and C and D" is sum of Kolmogorov-derived probabilities of every hypothesis implying "A and B and C and D" that's also consistent with data.

Probability of events "A or B or C or D" is sum of Kolmogorov-derived probabilities of every hypothesis implying "A or B or C or D" that's also consistent with data.

Hypotheses in first set won't be more complex than hypotheses in second set - the second set is simply far far larger.

Second probability consists of things like: (1/2)^length("A") + (1/2)^length("B") + (1/2)^length("A and B") + (1/2)^length("A or B") + (1/2)^length("A and C") + ...

In the post, the word "hypothesis" is used to refer to events (which is the usual meaning in the context of Bayesian updating). They are not confused, you just didn't use the intended interpretation.


In the post, the word "hypothesis" is used to refer to events

Then title "complexity of a hypothesis" makes no sense. Events ("hypotheses" in post) have no complexity. Descriptions ("hypotheses" in my comment) have complexity.

Connection is between complexity of descriptions and probability of events, and it has none of the problems described in the post.


Even after adjusting for terminology, I can't understand the point of your comment. Maybe you could flesh it out more formally in your own preferred terms?

For example, imagine we're receiving a sequence of integers. Consider the statements A = "all are even", B = "all are perfect squares", C = A and B, D = A or B. In your formalism, which of these are "hypotheses"? Which are "events"? Which "have complexity", and which don't?

I think you simply point out that I'm using incorrect terminology: I should've used "descriptions of events" in place of "hypotheses", and "hypotheses" in place of "predictors" (or "algorithms", or "worlds"). Okay, I can live with that.

NB: onlookers, please don't downvote parent. Upvoted back to 0.

He essentially argued about definitions, but presented that as a disagreement about content. This is a standard error (whether by oversight or not), hence should be downvoted.


Your post still wouldn't make any sense. Events/world have prior probability, but not complexity. Descriptions/algorithms/predictors have complexity, but not prior probability.

Infinite number of descriptions of different lengths predicts same event.

Does this imply you should assign credence 2^-m to any statement about the universe ("hypothesis") that has length m? [...]

This confuses "A and B and C and D" as description (which has complexity but no prior probability) and "A and B and C and D" as subset of possible worlds (which has prior probability but no complexity).

Speaking about Kolmogorov complexity requires a lot of precision - we can select toy world but there's just no way to use anything less than Turing complete description language.

Except if world can described in m bits, 2^-m serves as a lower bound on probability, as long as it's consistent with data.

  1. The hypothesis "the correct theory of everything is the lexicographically least algorithm with K-complexity 3^^^^3" is quite short, but the universal prior for it is astronomically low.

Is it? If you try to encode informal statements formally, they get very big. And doesn't "lexicographically least algorithm with K-complexity 3^^^^3" violate Rice theorem? If it wasn't for these problems, 2^-m would still be proper lower bound of prior probability of this.

Any statement about the world ("description") does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.

The prior probability of a statement is not connected to its "complexity" (whatever that means). This is also kind of obvious, but many people have managed to miss it, so I wrote the post.

A formal encoding of example 2 (as a predicate on algorithms expressed in ZFC, say) isn't very long or difficult. The notion of K-complexity is easily definable, though not computable. Just like BB numbers - also easily definable, but not computable. What does Rice's theorem have to do with it? Maybe I'm missing something?


By complexity I mean program length.

Any statement about the world ("description") does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.

Programs used in Kolmogorov priors have to describe everything about the world. So it's about programs like "print(2,4,6);" and "while(true){print(42);}" and their lengths. World "2 4 6" has probability induced by all programs that generate "2 4 6".

You don't normally have world-testing programs, just world-generating programs.

By complexity I mean program length.

Then why did you mention things like length("A and B") in your first comment? You can't take the conjunction of two programs, can you? I'm confused.


I shouldn't be replying that late at night, as my clarity suffer a lot, and this subject demands a lot of clarity.

There are meaningful ways to describe Kolmogorov complexity of world-testing programs and their combinations, and there is meaningful equivalent of conditional probability. But you don't use them in a prior directly. When you want to define prior with world-testing programs you need to do a moral equivalent of padding every program with enough entropy to fully describe the world. So program "numbers are 2, 4, 6" converts without entropy padding, program "numbers are even" needs a lot of extra entropy, and "numbers are whatever Jesus wants them to be" requires as much entropy as an empty program.

My excuse is that when you sum over every possible padding, as in (1/2)^length("A and B" + padding bits 0000) + (1/2)^length("A and B" + padding bits 0001) + ... what you get will be simply (1/2)^length("A and B") times a small language-dependent constant for (1/2)^length(program to convert entropy bits and world-testing program into world-generating program).

So you have:

  • Length of world-testing program
  • Entropy needed to get world-generating program from world-testing program
  • Probability that world-testing program matches the world

For complete descriptions (no extra entropy needed) of world that don't have any shorter descriptions, complexity of description is very closely related to Kolmogorov complexity - "sequence is 50 times 42" is shorter than "sequence is 9, 0, 2, 9, 3, 2, 7, 5, 8, 1, 4, 9, 8, 7, 1, 0, 6, 2, 3, 4, 4, 8, 5, 7, 4, 4, 7, 0, 1, 2, 8, 3, 9, 0, 9, 0, 3, 2, 8, 4, 6, 0, 4, 7, 8, 3, 4, 2, 3, 6, 3", both are unambiguous and about maximally concise (I generated the second randomly) - so first is more likely than second.

For incomplete descriptions it's more complicated, but mathematics should be straightforward enough.

Thanks, this angle looks interesting, and new to me. Could you explain some more how it's supposed to work? For example, if the input is an infinite stream of numbers and my hypothesis is something like "next number will be 42" or "prime-numbered members will be prime", should I pad it with infinite additional entropy?


Normally streams are assumed to be finitely long and have symbols from finite set to keep math neat.

Extension to infinitely long streams with finite complexity are conceptually fairly straightforward using trick very similar to non-normalizable Bayesian prior (just like uniform prior over real numbers is well defined even if it sums to infinity so it's not really probability, or entropy of a molecule when atom locations are unbound continuous variables), but math can get considerably messier.

Finite or not, streams must be countable.

So just select some ordering and for stream-tester T and number N world generator is "i=0; for(w in streams) { if(T(w)) i += 1; if(i == N) { print(w); break; } }". There are multiple ways to choose iteration order and encoding for N (simplest would be fixed bit size number, but you need more complicated encoding for infinite number of possibilities obviously), but they're just a constant away from each other, so you can ignore this detail.


Ah, sorry. I deleted my other reply, will reply here.