Musings on probability

5Vladimir_Nesov

0bogdanb

0Vladimir_Nesov

0bogdanb

0bogdanb

0Vladimir_Nesov

0Vladimir_Nesov

4Jack

2bogdanb

0Jack

0bogdanb

0bogdanb

New Comment

The set of all possible worlds is a confusing subject, and is related to the so-called "ontology problem" in FAI, where we need to define preference for Friendly AI based on human preference, while human preference may be defined in terms of incomplete and/or incorrect understanding of the world. Furthermore, it doesn't seem possible to rigorously define the collection of all possible worlds, for formal preference to talk about. I currently run a blog sequence on this topic, with two posts so far:

(The concept of *preference* is assumed in these posts, refer to the previous posts as needed.)

The set of all possible worlds is a confusing subject,

It certainly is. That is why the schema above fascinated me when I made it explicit: although the concepts involved are not rigorously defined, the relationships between them (as expressed in the schema) *feel* rigorously correct. (ETA: A bit like noticing that elementary particles parallel a precise mathematical structure, but not yet knowing what particles are and why they do that.)

In a matter of speaking, the schema explicitly moves confusion about “what probability is” to “what possible worlds to consider”. This is implicitly done in many problems of probability — which reference class to pick for the outside view, or how to count possible outcomes —, but the schema makes this much clear. (To me, at least; I published this in case it has a similar effect on others.)

it doesn't seem possible to rigorously define the collection of all possible worlds, for formal preference to talk about

I’m not sure I agree. The schema doesn’t say what collection of universes to use, but I don’t see why you couldn’t just define rigorously one of your choosing and use it. (If I’m missing something here please give me a link.) Note that the one you picked *can* be “wrong” in some sense, and thus the probabilities you obtain not be helpful, but I don’t see a reason why you can’t do it *even if it’s wrong*.

Interestingly, the schema *does* theoretically provide a potential way of noticing you picked a bad collection of worlds: If you end up with an empty E (the subset of worlds agreeing with your experiences), then you *certainly* picked bad.

I’m a bit fuzzy about what it means when your experiences “consistently” happen to be improbable (but never impossible) according to your calculation. In Bayesian terms, you *correctly* update on every experience, but your predictions keep being wrong. The schema seems correct, so either you picked a bad collection of possible worlds, or you just happen to be in a world that’s just unpredictable (in the sense that even if you pick the best decision procedure possible, you still lose; the world just doesn’t allow you to win). In the latter case it’s unsurprising — you can’t win, so it’s “normal” that you can’t find a winning strategy — but the latter case should allow you to “escape” by finding the correct set of worlds.

is related to the so-called "ontology problem" in FAI, where we need to define preference for Friendly AI based on human preference, while human preference may be defined in terms of incomplete and/or incorrect understanding of the world

Note that I intentionally didn’t mention preferences anywhere in the post. (Actually, I meant to make it explicit that they’re orthogonal to the problem, I just forgot.)

The question of preferences seems to me perfectly orthogonal to the schema. That is, if you pick a set of possible worlds, but you still can’t define preferences well, than you’re confused about preferences. If, say, you have a class of “possible world sets”, and you *can* rigorously define your preferences within each such set, but you can’t pick which set to use, then you’re confused about your ontology.

In other words, the schema allows the subdivision of confusion in two separate sources of confusion. It only helps in the sense that it transforms a larger problem in two smaller problems; it’s not its fault that they’re still hard.

There’s a related but more subtle point that fascinated me: *even if* you don’t like the schema because it’s not helpful enough, it *still* seems correct. No matter how else you specify your problem, if you do it correctly it will still be a special case of the schema, and you’re still going to have to face it. In a sense, how well you can solve the schema above *is a bound* of how rational you are.

For me it was a bit like finding out that a certain problem is NP-complete: once you know that, you *can* find special cases that are easier to solve but useful, and make do with them; but until your problem-solving is NP-strong, you *know* that you can’t solve the general case. (This doesn’t prevent CS researchers to investigate some properties of those and even harder problems.)

ETA: And the reason it was fascinating was that seeing the schema gave me a much clearer notion of *how hard* it is.

I don’t see a reason why you can’t do it even if it’s wrong.

If it's wrong, then it's not clear what exactly are we doing. If you run out of sample space, there is no way to correct this mistake, because it's what the sample space means, options still available.

The problem in choice of sample space for Bayesian updating is the same problem as in finding the formalism for encoding a solution to the ontology problem (a preference that is no longer in danger of being defined in terms of misconceptions).

(And if there is no way to define a correct "sample space", the framework itself is bogus, though the solution seems to be that we shouldn't seek a set of all possible worlds, but rather a set of all possible thoughts...)

I don’t see a reason why you can’t do it even if it’s wrong.

If it's wrong, then it's not clear what exactly are we doing. If you run out of sample space, there is no way to correct this mistake, because it's what the sample space means, options still available.

As I see it, if it’s wrong, then you’re calculating probabilities for a different “metaverse”. As I said, it seems to me that if you’re wrong then *you should be able to notice*: if you do something many times, and you calculate each time a probability 70% of result X, but you never get that X, it’s obvious you’re doing things wrong. And since, *as far as I can tell*, the structure of the calculation is *really* correct, then it must be your premise that’s wrong, i.e. which world-set (and measure) you picked to calculate in.

The case where you run out of space is just the more obvious one: it’s the world-set you used that’s wrong (because you ran out of worlds before you used the measure).

Again, my interest in this is not that it seems like a good way of calculating probabilities. It’s that it seems like the *only possible way*, hard or impossible as it may be.

Part of the reason for publishing this was to (hopefully) find through the comments another “formalism” that (a) seems right, (b) seems general, but (c) isn’t reducible to that schema.

I don’t see a reason why you can’t do it even if it’s wrong.

If it's wrong, then it's not clear what exactly are we doing. If you run out of sample space, there is no way to correct this mistake, because it's what the sample space means, options still available.

The problem in choice of sample space for Bayesian updating is the same problem as in finding the formalism for encoding a solution to the ontology problem (a preference that is no longer in danger of being defined in terms of misconceptions).

(And if there is no way to define a correct "sample space", the framework itself is bogus.)

Upvoted for tackling the issue, at least.

If your ontology implies quantum mechanics then I think the measure of the universes (m(u) in step 1) must involve wave functions somehow, but my understanding of QM doesn’t allow me to think it through much.

This looks like a mistake to me. QM says a lot about E but there are logically possible universes where quantum mechanics is false. Presumably we want to be able to assign probabilities to the truth of quantum mechanics. Ontological questions in general seem like delineators of E and not *m*. Thus I'm confused by

If your ontology implies a computable universe (thus you only need to consider those in E)

as well. Obviously QM and physics generally is entangled with information theory and probability in all kinds of important ways and hopefully an eventual theory of physics will clarify all this. For the time being, it seems more worthwhile to describe the mental operation of assigning subjective probabilities (and thus understanding possible worlds in this sense) and avoid conflating rationality with physics.

Unless you have something else motivating that part of the discussion that I'm not familiar with.

Oh, another thing: you *can* do approximations, as long as you’re aware that you’re doing it. So, upon noticing that your world looks like QM, you can try to just do all your calculations with a QM world. That is, you say that your observations of QM *excludes* non-QM worlds from E, even though it’s not strictly true (it may be just a simulation, for example).

If your approximation was a bad one you’ll notice. AFAICT, *all* the discussions about probability are just such approximations. When you explain the solution to a balls-in-urns problem by saying “look, there are X possible outcomes, N red and P blue”, what you’re saying is “there are X possible universes; in N of them red and P blue”, then you proceed doing exactly the “integral” above, with each universe having equal measure. If the problem says the ball’s colors are monochromatic, of wavelength uniformly distributed in the spectrum, but the question uses human color words, you might end up with a measure based on the perceptual width of “colors” in wavelength terms.

Even though the problem describes urns and balls and, presumably, you picking things up via EM interactions of your and the balls’ atoms, you approximate by saying “that’s just irrelevant detail”. In the terms of the schema, this just means that “other details will add the same factor to the cardinalities of each color class”.

I misread the computability bit, it makes sense.

I'm still confused about what you're trying to say about QM, though. I start with *m*. Then Omega tells me "QM is true". Now, I have E which is the set of worlds in which QM is true. But you're saying QM being true affects *m*(u) and thats what I don't grok.

I think it was a mistake to define *m* on U (the set of possible worlds, before restricting it to E); it can work even if you do it like that, but it’s less intuitive.

Try this: suppose your U can be partitioned two classes of universes: those in Q are quantum, and those in K are not. That is, U is the union of Q and K, and the intersection Q and K is empty.

A reasonable strategy for defining *m* (in the sense that that’s what I was aiming for) could go like this:

You define *m*(*u*) as a product of *g*(*u*) with *s*(*u*).

*s* is a measure specific for the class of universes *u* is part of. For (“logically possible”) universes that come from the Q class, *s* might depend on the amplitude of *u*’s wave-function. For those that come from K, it’s a completely different function, e.g. the Kolmogorov complexity of the bit-string defining *u*. Note that the two “branches” of *s* are completely independent: the wave-function doesn’t make *any* sense for K, nor does Kolmogorov complexity for those in Q (supposing that Q implies *exact* real-valued functions).

The *g*(*u*) part of the measure reflects everything that’s not related to this particular partitioning of U. It may be just a constant 1, or it may be a complex function based on other possible partitioning (e.g. finite/infinite universes).

The important part is that your *m* includes *from the start* a term for QM-related measure of universes, but *only* for those where it makes sense.

When Omega comes, you just remove K from U, so your E is a subset of Q. As a result, the rest of the calculations is *never* dependent on the *s* branch for K, and *always* depends on the *s* branch for Q. The effect is not that Omega changed the *m*, it just made part of it irrelevant to the rest of the calculations.

As I said in the first sentence of this answer, you can just define *m* only on E. But E is much more complex than U (it depends on every experience you had, thus it’s harder to specify, even if it’s “smaller”), so it’s harder to define a function *only* for its values.

Conceptually it’s easier to pick a somewhat vague *m* on U, and then “fill in details” as your E becomes more exact. But, to be clear, this is just because we’re handwaving about without actually being able to do the calculations; since the calculations seem impossible in the general case it’s kind of a moot point which is “really” easier.

My motivation for those comments was observing that you don’t need the measure except for worlds in E, those that are compatible with observation.

Say you have in the original “possible worlds” both some that are computable (e.g. Turing machine outputs) and some that are not (continuous, real-valued space and time coordinates). Now, suppose it’s possible to distinguish among those two, and one of your observations eliminates one possibility (e.g., it’s certainly not computable). Then you can certainly use a measure that only makes sense for the remaining one (computable things).

There might be other tricks. Suppose again, as above, you have two classes of possible worlds, and you have an idea how to assign measures within each class but not for the whole “possible worlds” set. Now, if you do the rest of the calculations in both classes, you’ll obtain two “probabilities”, one for each class of possible worlds. If the two agree on a value (within “measurement error”), you’re set, use that. If they don’t, then it means you have a way to test those two worlds.

For the time being, it seems more worthwhile to describe the mental operation of assigning subjective probabilities (and thus understanding possible worlds in this sense) and avoid conflating rationality with physics.

I certainly didn’t intend the schema as an actual *algorithm*. I’m not sure if your comment about *subjective* probabilities means. (I can parse it both as “*this is a potentially useful model of what brains ‘have in mind’ when thinking of probabilities, but not really useful for computation*”, and as “*this is not a useful model in general, try to come up with one based on how brains assign probability*”.)

What is interesting for me is that *I couldn’t* find *any* model of probability that doesn’t match that schema after formalizing it. So I conjectured that *any* formalization of probability, if general enough* to apply to real life, will be an instance of it. Of course, it may just be that I’m not imaginative enough, or maybe I’m just the guy with a new hammer seeing nails everywhere.

(*: by this, I mean not just claiming, say, that coins fall with equal probability on each side. You can do a lot of probability calculation with that, but it’s not useful at all for which alien team wins the game, what face a dice falls on, or even how real coins work.)

I’m really curious to see a model of probability that seems reasonable and general, and that I *can’t* reduce to the shape above.

I read this comment, and after a bit of rambling I realized I was as confused as the poster. A bit more thinking later I ended up with the “definition” of probability under the next heading. It’s not anything groundbreaking, just a distillation (specifically, mine) of things discussed here over the time. It’s just what my brain thinks when I hear the word.

But I was surprised and intrigued when I actually put it in writing and read it back and thought about it. I don’t remember seeing it stated like that (but I probably read some similar things).

It probably won’t teach anyone anything, but it might trigger a similar “distillation” of “mind pictures” in others, and I’m curious to see that.

## What “probability” is...

Or, more exactly, what is the answer to “what’s the probability of X?”

Well, I don’t actually know, and it probably depends on who asks. But here’s the skeleton of the answer procedure:

m(see below).uin set E a valuep, such thatp(u) is inversely proportional tom(u), and the integral ofpover set E is 1.pover the set T. The result is called “the probability of X”, and is the answer to the question.I’m aware that this isn’t quite a definition; in fact, it leaves more unsaid (undefined) than it explains. Nevertheless, to me it seems that the

structureitself isright: people might have different interpretations for the details (and, like me, be uncertain about them), but those differences wouldstillbe mentally structured like above.In the next section I explain a bit where each piece comes from and what it means, and in the one after I’m going to ramble a bit.

## Clarifications

About (logically possible)

universes: We don’t actually know what our universeis; as such, other possible universes isn’t quite a well-defined concept. For generality, the only constraint I put above is that they be logically possible, for the only reason that the description is (vaguely) mathematical and I don’t have any idea what math without logic means. (I might be missing something, though.)Note that by “universe” I really mean an entire universe, not just “until now”. E.g., if it so happens your experiences allow for a single possible past (i.e., you

knowthe entire history), but your universe is not deterministic, there are still many universes in E (one for each possible future); if it’s deterministic, then E contains just one universe. (And your calculations are a lot easier...)Before you get too scared or excited by the concept of “all possible universes” remember that not all of them are actually used in the rest of the procedure. We actually need only those

consistent with experience. That’s still a lot when you think about it, but my mind seems to reel in panic more often I forget this point. (Lest this note makes you too comfortable, I must also mention that the possibility that experience is (even partly) simulated explodes the size of E.)About that

real valueI was talking about: “m” comes from “measure”, but that’s a consequence of how I arrived at the schema above. Even now I’m not quite sure it belongs there, because it depends on what you think “possible universes” means. If you just set it to 1 for all universes, everything works.mBut, for example, you might consider that the set U is countable, encoding them all as numbers using a well-defined rule, and use the Kolmogorov complexity of the bit-string encoding a universe for that universe’s measure. (Given step [4] above, this would mean that you think simpler universes are more probable; except it doesn’t quite mean that, because “probable” is defined only after you picked your “m”. It’s probably closer to “things that happen in simpler universes are more probable”; more in the ramblings section.)

A bit about the

math: I used some math terms a bit loosely in the schema above. Depending exactly on how you mean by “possible universes”, the set of them might be finite, countably infinite, not countable, or might be a proper class rather than a set. Depending on that, “integrating” might become a different operation. If you can’t (mathematically, not physically) do such an operation on your collection of possible universes (actually, on those in E) then you have to define your own concept of probability :-PWith regards to

computability, note that the series of steps above isnotan algorithm, it’s just the definition. It doesn’t feel intuitive to me that there is any possible universe where you can actually follow the steps above, but math surprises me in that regard sometimes. But note that you don’t really needp(X): you just need a good-enough approximation, and you’re free to use any trick you want.## Musings

If the above didn’t interest you, the rest probably won’t, either. I’ve put in this the most interesting consequences of the schema above. It’s kind of rambling, and I apologize; as in the last section, I’ll

boldkeywords, so you might just skim it for paragraphs that might interest you.I found it interesting (but not surprising) to note that

Bayesianstatistics correspond well to the schema above. As far as I can tell, the Bayesianpriorfor (any) X is the number assigned in step 5; Bayesian updating is just going back to step 2 whenever you have new experiences. The interesting part is that my descriptionsmellsfrequentist. I wasn’t that surprised because the main difference (in my head) between the two is the use of priors; frequentist statistics ignore prior knowledge. If you just do frequentist statistics onevery possible event in every possible universe(for some value of possible), then there is no “prior knowledge” left to ignore.The schema above describes only true/false–type problems. For

non-binary problemsyou just split of E in step 3 into several subsets, one for each possible answer. If the problem isreal-valuedyou need to split E into an uncountably infinite number of sets, but I’ve abused set theory terms enough today that I’m not very concerned. Anyway, in practice (in our universe) it’s usually enough to just split the domain of the value in countably many intervals, according to precision you need, and split the universes in E according to which interval they fall in. That is, you don’t actually need to know the probability that a value is, say, sqrt(2), just that it’s closer to sqrt(2) than you can measure it.With regard to past discussions about a

rationale for rationality, observe that it’s possible to apply the procedure above to evaluate what is the “rational way”, supposing you define it by “the rational guyplays to win”: instead of step (3) generate the set of decision procedures that are applicable in all E, call it D; for eachdin D, split E into universes where you win and those where you lose (don't win), and call these W(d) and L(d); instead of step 4, for each decision procedured, calculate the “winningness” ofdas the integral ofpover W(d) divided by the integral over L(d) (withpdefined like above); instead of step 5, pick a decisiondsuch that it's “winningness” is maximal (no other has a larger value)._{0}Note that I’ve no idea if doing this actually picks the decision procedure above, nor what exactly it would mean if it doesn’t... Of course, if it does, it’s still circular, like any “reason for reasoning”. The procedure might also give different results for people with different E. I found it interesting to contemplate that it might be “possible” for someone in another universe (one much friendlier to applied calculus than ours) to calculate

exactlythe solution of the procedure formyE, but at the same time for the best procedure for approximating it inmyuniverse to give a different answer. They can’t, of course, communicate this to me (since then they’renotin a different universe in the sense used above).If your ontology implies a

computable universe(thus you only need to consider those in E), you might want to useKolmogorov complexityas a measure for the universes. I’ve no idea which encoding you should use to calculate it; there are theorems that say the difference between two encodings isboundedby a constant, but I don't see why certain encodings can't be biased to have systematic effects on your probability calculations. (Other than “it's kind of a long shot”.) You might use the above procedure for deciding on decision procedures, of course :-PThere’s also a theorem that say you can’t actually make a program to compute the KC for any arbitrary bit-string. There might be a universe–to–bit-string encoding that generates only bit-strings for which there is such a program, but that’s also kind of a long shot.

If your ontology implies

quantum mechanicsthen Ithinkthe measure of the universes (m(u) in step 1) must involve wave functions somehow, but my understanding of QM doesn’t allow me to think it through much.The schema above illuminated a bit something that puzzled me in that comment I was talking about at the beginning: say you are suddenly sent to the planet Progsta and a Sillpruk comes and asks you whether the game of Doldun will be won by the team Strigli; what’s your prior for the answer? What puzzled me was that

the very fact that you were asked that questioncommunicates an enormous amount of information — see this comment of mine for examples — and yet I couldn’t actually see how that should affect my priors. Of course, the information content of the question restricts hugely the universes in my E. But there were so many there that it’s still huge; more importantly, itrestricts the universes along boundaries that I’ve not previously explored, and I don’t have ready heuristics to estimate that littlepabove:If I throw a (correct) dice, I can split the universes in six approximately equal parts on vague symmetry justifications, and just estimate the probability of each side as 1/6. If someone on the street asks me to bet him on

hisdice I can split the universes in those where I win and those where I lose and estimate (using a kind of Montecarlo-integration with various scenarios I can think of) that I’ll probably lose. If I encounter an alien named Sillpruk I’ve no idea how to split the universes to estimate the result of a Doldun match. But if I were to encounter lots of aliens with strange first-questions for a while, I might develop some such simple heuristics based on simple trial and error.## PS.

I’m sorry if this was too long or just stupid. In the former case I welcome constructive criticism — don’t hesitate to tell me what you think should have been cut. I hereby subject myself to Crocker’s Rules. In the latter case... well, sorry :-)