I read this comment, and after a bit of rambling I realized I was as confused as the poster. A bit more thinking later I ended up with the “definition” of probability under the next heading. It’s not anything groundbreaking, just a distillation (specifically, mine) of things discussed here over the time. It’s just what my brain thinks when I hear the word.

But I was surprised and intrigued when I actually put it in writing and read it back and thought about it. I don’t remember seeing it stated like that (but I probably read some similar things).

It probably won’t teach anyone anything, but it might trigger a similar “distillation” of “mind pictures” in others, and I’m curious to see that.

### What “probability” is...

Or, more exactly, what is the answer to “what’s the probability of X?”

Well, I don’t actually know, and it probably depends on who asks. But here’s the skeleton of the answer procedure:

- Take the set of all (logically) possible universes. Assign to each universe a finite, real value
*m*(see below). - Eliminate from the set those those that are inconsistent with your experiences. Call the remaining set E.
- Construct T, the subset of E where X (happens, or happened, or is true).
- Assign to each universe
*u*in set E a value*p*, such that*p*(*u*) is inversely proportional to*m*(*u*), and the integral of*p*over set E is 1. - Calculate the integral of
*p*over the set T. The result is called “the probability of X”, and is the answer to the question.

I’m aware that this isn’t quite a definition; in fact, it leaves more unsaid (undefined) than it explains. Nevertheless, to me it seems that the *structure* itself is *right*: people might have different interpretations for the details (and, like me, be uncertain about them), but those differences would *still* be mentally structured like above.

In the next section I explain a bit where each piece comes from and what it means, and in the one after I’m going to ramble a bit.

### Clarifications

About (logically possible) **universes**: We don’t actually know what our universe *is*; as such, other possible universes isn’t quite a well-defined concept. For generality, the only constraint I put above is that they be logically possible, for the only reason that the description is (vaguely) mathematical and I don’t have any idea what math without logic means. (I might be missing something, though.)

Note that by “universe” I really mean an entire universe, not just “until now”. E.g., if it so happens your experiences allow for a single possible past (i.e., you *know* the entire history), but your universe is not deterministic, there are still many universes in E (one for each possible future); if it’s deterministic, then E contains just one universe. (And your calculations are a lot easier...)

Before you get too scared or excited by the concept of “all possible universes” remember that not all of them are actually used in the rest of the procedure. We actually need only those **consistent with experience**. That’s still a lot when you think about it, but my mind seems to reel in panic more often I forget this point. (Lest this note makes you too comfortable, I must also mention that the possibility that experience is (even partly) simulated explodes the size of E.)

About that **real value m** I was talking about: “m” comes from “measure”, but that’s a consequence of how I arrived at the schema above. Even now I’m not quite sure it belongs there, because it depends on what you think “possible universes” means. If you just set it to 1 for all universes, everything works.

But, for example, you might consider that the set U is countable, encoding them all as numbers using a well-defined rule, and use the Kolmogorov complexity of the bit-string encoding a universe for that universe’s measure. (Given step [4] above, this would mean that you think simpler universes are more probable; except it doesn’t quite mean that, because “probable” is defined only after you picked your “m”. It’s probably closer to “things that happen in simpler universes are more probable”; more in the ramblings section.)

A bit about the **math**: I used some math terms a bit loosely in the schema above. Depending exactly on how you mean by “possible universes”, the set of them might be finite, countably infinite, not countable, or might be a proper class rather than a set. Depending on that, “integrating” might become a different operation. If you can’t (mathematically, not physically) do such an operation on your collection of possible universes (actually, on those in E) then you have to define your own concept of probability :-P

With regards to **computability**, note that the series of steps above is *not* an algorithm, it’s just the definition. It doesn’t feel intuitive to me that there is any possible universe where you can actually follow the steps above, but math surprises me in that regard sometimes. But note that you don’t really need *p*(X): you just need a good-enough approximation, and you’re free to use any trick you want.

### Musings

If the above didn’t interest you, the rest probably won’t, either. I’ve put in this the most interesting consequences of the schema above. It’s kind of rambling, and I apologize; as in the last section, I’ll **bold** keywords, so you might just skim it for paragraphs that might interest you.

I found it interesting (but not surprising) to note that **Bayesian** statistics correspond well to the schema above. As far as I can tell, the Bayesian **prior** for (any) X is the number assigned in step 5; Bayesian updating is just going back to step 2 whenever you have new experiences. The interesting part is that my description *smells* **frequentist**. I wasn’t that surprised because the main difference (in my head) between the two is the use of priors; frequentist statistics ignore prior knowledge. If you just do frequentist statistics on *every possible event in every possible universe* (for some value of possible), then there is no “prior knowledge” left to ignore.

The schema above describes only true/false–type problems. For **non-binary problems** you just split of E in step 3 into several subsets, one for each possible answer. If the problem is **real-valued** you need to split E into an uncountably infinite number of sets, but I’ve abused set theory terms enough today that I’m not very concerned. Anyway, in practice (in our universe) it’s usually enough to just split the domain of the value in countably many intervals, according to precision you need, and split the universes in E according to which interval they fall in. That is, you don’t actually need to know the probability that a value is, say, sqrt(2), just that it’s closer to sqrt(2) than you can measure it.

With regard to past discussions about a **rationale for rationality**, observe that it’s possible to apply the procedure above to evaluate what is the “rational way”, supposing you define it by “the rational guy *plays to win*”: instead of step (3) generate the set of decision procedures that are applicable in all E, call it D; for each *d* in D, split E into universes where you win and those where you lose (don't win), and call these W(*d*) and L(*d*); instead of step 4, for each decision procedure *d*, calculate the “winningness” of *d* as the integral of *p* over W(*d*) divided by the integral over L(*d*) (with *p* defined like above); instead of step 5, pick a decision *d _{0}* such that it's “winningness” is maximal (no other has a larger value).

Note that I’ve no idea if doing this actually picks the decision procedure above, nor what exactly it would mean if it doesn’t... Of course, if it does, it’s still circular, like any “reason for reasoning”. The procedure might also give different results for people with different E. I found it interesting to contemplate that it might be “possible” for someone in another universe (one much friendlier to applied calculus than ours) to calculate

*exactly*the solution of the procedure for

*my*E, but at the same time for the best procedure for approximating it in

*my*universe to give a different answer. They can’t, of course, communicate this to me (since then they’re

*not*in a different universe in the sense used above).

If your ontology implies a **computable universe** (thus you only need to consider those in E), you might want to use **Kolmogorov complexity **as a measure for the universes. I’ve no idea which encoding you should use to calculate it; there are theorems that say the difference between two encodings is *bounded* by a constant, but I don't see why certain encodings can't be biased to have systematic effects on your probability calculations. (Other than “it's kind of a long shot”.) You might use the above procedure for deciding on decision procedures, of course :-P

There’s also a theorem that say you can’t actually make a program to compute the KC for any arbitrary bit-string. There might be a universe–to–bit-string encoding that generates only bit-strings for which there is such a program, but that’s also kind of a long shot.

If your ontology implies **quantum mechanics** then I *think* the measure of the universes (*m*(*u*) in step 1) must involve wave functions somehow, but my understanding of QM doesn’t allow me to think it through much.

The schema above illuminated a bit something that puzzled me in that comment I was talking about at the beginning: say you are suddenly sent to the planet Progsta and a Sillpruk comes and asks you whether the game of Doldun will be won by the team Strigli; what’s your prior for the answer? What puzzled me was that *the very fact that you were asked that question* communicates an enormous amount of information — see this comment of mine for examples — and yet I couldn’t actually see how that should affect my priors. Of course, the information content of the question restricts hugely the universes in my E. But there were so many there that it’s still huge; more importantly, it *restricts the universes along boundaries that I’ve not previously explored*, and I don’t have ready heuristics to estimate that little *p* above:

If I throw a (correct) dice, I can split the universes in six approximately equal parts on vague symmetry justifications, and just estimate the probability of each side as 1/6. If someone on the street asks me to bet him on *his* dice I can split the universes in those where I win and those where I lose and estimate (using a kind of Montecarlo-integration with various scenarios I can think of) that I’ll probably lose. If I encounter an alien named Sillpruk I’ve no idea how to split the universes to estimate the result of a Doldun match. But if I were to encounter lots of aliens with strange first-questions for a while, I might develop some such simple heuristics based on simple trial and error.

### PS.

I’m sorry if this was too long or just stupid. In the former case I welcome constructive criticism — don’t hesitate to tell me what you think should have been cut. I hereby subject myself to Crocker’s Rules. In the latter case... well, sorry :-)

The set of all possible worlds is a confusing subject, and is related to the so-called "ontology problem" in FAI, where we need to define preference for Friendly AI based on human preference, while human preference may be defined in terms of incomplete and/or incorrect understanding of the world. Furthermore, it doesn't seem possible to rigorously define the collection of all possible worlds, for formal preference to talk about. I currently run a blog sequence on this topic, with two posts so far:

(The concept of

preferenceis assumed in these posts, refer to the previous posts as needed.)It certainly is. That is why the schema above fascinated me when I made it explicit: although the concepts involved are not rigorously defined, the relationships between them (as expressed in the schema)

feelrigorously correct. (ETA: A bit like noticing that elementary particles parallel a precise mathematical structure, but not yet knowing what particles are and why they do that.)In a matter of speaking, the schema explicitly moves confusion about “what probability is” to “what possible worlds to consider”. This is implicitly done in many problems of probability — which reference class to pick for the outside view, or how to count possible outcomes —, but the schema makes this much clear. (To me, at least; I published this in case it has a similar effect on others.)

I’m not sure I agree. The schema doesn’t say what collection of universes to use, but I don’t see why you couldn’t just define rigorously one of your choosing and use it. (If I’m missing something here please give me a link.) Note that the one you picked

canbe “wrong” in some sense, and thus the probabilities you obtain not be helpful, but I don’t see a reason why you can’t do iteven if it’s wrong.Interestingly, the schema

doestheoretically provide a potential way of noticing you picked a bad collection of worlds: If you end up with an empty E (the subset of worlds agreeing with your experiences), then youcertainlypicked bad.I’m a bit fuzzy about what it means when your experiences “consistently” happen to be improbable (but never impossible) according to your calculation. In Bayesian terms, you

correctlyupdate on every experience, but your predictions keep being wrong. The schema seems correct, so either you picked a bad collection of possible worlds, or you just happen to be in a world that’s just unpredictable (in the sense that even if you pick the best decision procedure possible, you still lose; the world just doesn’t allow you to win). In the latter case it’s unsurprising — you can’t win, so it’s “normal” that you can’t find a winning strategy — but the latter case should allow you to “escape” by finding the correct set of worlds.Note that I intentionally didn’t mention preferences anywhere in the post. (Actually, I meant to make it explicit that they’re orthogonal to the problem, I just forgot.)

The question of preferences seems to me perfectly orthogonal to the schema. That is, if you pick a set of possible worlds, but you still can’t define preferences well, than you’re confused about preferences. If, say, you have a class of “possible world sets”, and you

canrigorously define your preferences within each such set, but you can’t pick which set to use, then you’re confused about your ontology.In other words, the schema allows the subdivision of confusion in two separate sources of confusion. It only helps in the sense that it transforms a larger problem in two smaller problems; it’s not its fault that they’re still hard.

There’s a related but more subtle point that fascinated me:

even ifyou don’t like the schema because it’s not helpful enough, itstillseems correct. No matter how else you specify your problem, if you do it correctly it will still be a special case of the schema, and you’re still going to have to face it. In a sense, how well you can solve the schema aboveis a boundof how rational you are.For me it was a bit like finding out that a certain problem is NP-complete: once you know that, you

canfind special cases that are easier to solve but useful, and make do with them; but until your problem-solving is NP-strong, youknowthat you can’t solve the general case. (This doesn’t prevent CS researchers to investigate some properties of those and even harder problems.)ETA: And the reason it was fascinating was that seeing the schema gave me a much clearer notion of

how hardit is.If it's wrong, then it's not clear what exactly are we doing. If you run out of sample space, there is no way to correct this mistake, because it's what the sample space means, options still available.

The problem in choice of sample space for Bayesian updating is the same problem as in finding the formalism for encoding a solution to the ontology problem (a preference that is no longer in danger of being defined in terms of misconceptions).

(And if there is no way to define a correct "sample space", the framework itself is bogus, though the solution seems to be that we shouldn't seek a set of all possible worlds, but rather a set of all possible thoughts...)

As I see it, if it’s wrong, then you’re calculating probabilities for a different “metaverse”. As I said, it seems to me that if you’re wrong then

you should be able to notice: if you do something many times, and you calculate each time a probability 70% of result X, but you never get that X, it’s obvious you’re doing things wrong. And since,as far as I can tell, the structure of the calculation isreallycorrect, then it must be your premise that’s wrong, i.e. which world-set (and measure) you picked to calculate in.The case where you run out of space is just the more obvious one: it’s the world-set you used that’s wrong (because you ran out of worlds before you used the measure).

Again, my interest in this is not that it seems like a good way of calculating probabilities. It’s that it seems like the

only possible way, hard or impossible as it may be.Part of the reason for publishing this was to (hopefully) find through the comments another “formalism” that (a) seems right, (b) seems general, but (c) isn’t reducible to that schema.

⟨comment deleted; it was this one before I edited it, and somehow got duplicated.⟩

If it's wrong, then it's not clear what exactly are we doing. If you run out of sample space, there is no way to correct this mistake, because it's what the sample space means, options still available.

The problem in choice of sample space for Bayesian updating is the same problem as in finding the formalism for encoding a solution to the ontology problem (a preference that is no longer in danger of being defined in terms of misconceptions).

(And if there is no way to define a correct "sample space", the framework itself is bogus.)

I believe the above comment is the old edition that should be deleted.

Upvoted for tackling the issue, at least.

This looks like a mistake to me. QM says a lot about E but there are logically possible universes where quantum mechanics is false. Presumably we want to be able to assign probabilities to the truth of quantum mechanics. Ontological questions in general seem like delineators of E and not

m. Thus I'm confused byas well. Obviously QM and physics generally is entangled with information theory and probability in all kinds of important ways and hopefully an eventual theory of physics will clarify all this. For the time being, it seems more worthwhile to describe the mental operation of assigning subjective probabilities (and thus understanding possible worlds in this sense) and avoid conflating rationality with physics.

Unless you have something else motivating that part of the discussion that I'm not familiar with.

Oh, another thing: you

cando approximations, as long as you’re aware that you’re doing it. So, upon noticing that your world looks like QM, you can try to just do all your calculations with a QM world. That is, you say that your observations of QMexcludesnon-QM worlds from E, even though it’s not strictly true (it may be just a simulation, for example).If your approximation was a bad one you’ll notice. AFAICT,

allthe discussions about probability are just such approximations. When you explain the solution to a balls-in-urns problem by saying “look, there are X possible outcomes, N red and P blue”, what you’re saying is “there are X possible universes; in N of them red and P blue”, then you proceed doing exactly the “integral” above, with each universe having equal measure. If the problem says the ball’s colors are monochromatic, of wavelength uniformly distributed in the spectrum, but the question uses human color words, you might end up with a measure based on the perceptual width of “colors” in wavelength terms.Even though the problem describes urns and balls and, presumably, you picking things up via EM interactions of your and the balls’ atoms, you approximate by saying “that’s just irrelevant detail”. In the terms of the schema, this just means that “other details will add the same factor to the cardinalities of each color class”.

I misread the computability bit, it makes sense.

I'm still confused about what you're trying to say about QM, though. I start with

m. Then Omega tells me "QM is true". Now, I have E which is the set of worlds in which QM is true. But you're saying QM being true affectsm(u) and thats what I don't grok.I think it was a mistake to define

mon U (the set of possible worlds, before restricting it to E); it can work even if you do it like that, but it’s less intuitive.Try this: suppose your U can be partitioned two classes of universes: those in Q are quantum, and those in K are not. That is, U is the union of Q and K, and the intersection Q and K is empty.

A reasonable strategy for defining

m(in the sense that that’s what I was aiming for) could go like this:You define

m(u) as a product ofg(u) withs(u).sis a measure specific for the class of universesuis part of. For (“logically possible”) universes that come from the Q class,smight depend on the amplitude ofu’s wave-function. For those that come from K, it’s a completely different function, e.g. the Kolmogorov complexity of the bit-string definingu. Note that the two “branches” ofsare completely independent: the wave-function doesn’t makeanysense for K, nor does Kolmogorov complexity for those in Q (supposing that Q impliesexactreal-valued functions).The

g(u) part of the measure reflects everything that’s not related to this particular partitioning of U. It may be just a constant 1, or it may be a complex function based on other possible partitioning (e.g. finite/infinite universes).The important part is that your

mincludesfrom the starta term for QM-related measure of universes, butonlyfor those where it makes sense.When Omega comes, you just remove K from U, so your E is a subset of Q. As a result, the rest of the calculations is

neverdependent on thesbranch for K, andalwaysdepends on thesbranch for Q. The effect is not that Omega changed them, it just made part of it irrelevant to the rest of the calculations.As I said in the first sentence of this answer, you can just define

monly on E. But E is much more complex than U (it depends on every experience you had, thus it’s harder to specify, even if it’s “smaller”), so it’s harder to define a functiononlyfor its values.Conceptually it’s easier to pick a somewhat vague

mon U, and then “fill in details” as your E becomes more exact. But, to be clear, this is just because we’re handwaving about without actually being able to do the calculations; since the calculations seem impossible in the general case it’s kind of a moot point which is “really” easier.My motivation for those comments was observing that you don’t need the measure except for worlds in E, those that are compatible with observation.

Say you have in the original “possible worlds” both some that are computable (e.g. Turing machine outputs) and some that are not (continuous, real-valued space and time coordinates). Now, suppose it’s possible to distinguish among those two, and one of your observations eliminates one possibility (e.g., it’s certainly not computable). Then you can certainly use a measure that only makes sense for the remaining one (computable things).

There might be other tricks. Suppose again, as above, you have two classes of possible worlds, and you have an idea how to assign measures within each class but not for the whole “possible worlds” set. Now, if you do the rest of the calculations in both classes, you’ll obtain two “probabilities”, one for each class of possible worlds. If the two agree on a value (within “measurement error”), you’re set, use that. If they don’t, then it means you have a way to test those two worlds.

I certainly didn’t intend the schema as an actual

algorithm. I’m not sure if your comment aboutsubjectiveprobabilities means. (I can parse it both as “this is a potentially useful model of what brains ‘have in mind’ when thinking of probabilities, but not really useful for computation”, and as “this is not a useful model in general, try to come up with one based on how brains assign probability”.)What is interesting for me is that

I couldn’tfindanymodel of probability that doesn’t match that schema after formalizing it. So I conjectured thatanyformalization of probability, if general enough* to apply to real life, will be an instance of it. Of course, it may just be that I’m not imaginative enough, or maybe I’m just the guy with a new hammer seeing nails everywhere.(*: by this, I mean not just claiming, say, that coins fall with equal probability on each side. You can do a lot of probability calculation with that, but it’s not useful at all for which alien team wins the game, what face a dice falls on, or even how real coins work.)

I’m really curious to see a model of probability that seems reasonable and general, and that I

can’treduce to the shape above.