# 2

I often like to think of my epistemic probability assignments in terms of probabilities-of-probabilities, or meta-probabilities. In other words, what probability would I assign that my probability estimate is accurate? Am I very confident, am I only mildly confident, or do I only have a vague clue?

I often think of it as a sort of bell curve, with the x-axis being possible probability estimates and the y-axis being my confidence in those estimates. So if I have very low confidence in my estimate then the height of the bell will be very low, and if I have high confidence it'll be pretty high.

Here are a few issues and insights that have come up when discussing or thinking about this:

What would a meta-probability actually mean?

There's two ways I have for thinking about it:

2) I know that I'm not even close to being an ideal Bayesian agent, and that my best shots at a probability estimate are fuzzy, imprecise, and likely mistaken anyway. The meta-probability is my prediction for what an ideal Bayesian agent would assign as the probability for the question at hand.

What's the point?

Primarily it's just useful for conveying how sure I am of the probability estimate I'm assigning. It's a way of conveying that a coin flip is 50% heads in a very different sense than me saying "I have not the slightest clue whether it'll rain tomorrow on the other side of the world, and if I need to bet on it I'd give it ~50% odds". I've seen other people convey related sentiments by saying things like, "well 90% is probably too low an estimate, and 99% is probably too high, so somewhere between those". I'd just view the 90% and 99% figures as maybe 95% confidence bounds on a bell curve.

Why not keep going and say how confident you are about your confidence estimates?

True, I could do this, and I sometimes will do this if needed by visualizing a bit of fuzziness in my bell curve. But in any case it's usually enough for my purposes.

Is there any use for such a view in terms of instrumental or utilitarian calculations?

Not sure. I've seen some relevant discussion by Scott Alexander and Holden Karnofsky, but I'm not sure I followed everything there. I also suspect that if you view it as a prediction of how your views might change if you learned more about the subject, then this might imply that it's useful in deciding how much time to invest in further research.

Thoughts?

[Note 1: I discussed this topic about a year ago on LessWrong, and got some insightful responses then. Some commenters disagreed with me then and I'll predict that they'll do so again here - I'd give it, oh, say an 80% chance, moderate confidence ;).]

[Note 2: If you could try to avoid complicated math in your responses that would be appreciated. I'm still on the precalculus level here.]

[Note 3: As I finished writing this I dug up some interesting LessWrong posts on the subject, with links to yet more relevant posts.]

# 2

New Comment

Isn't this just being Bayesian? Bayesians assign distributions to model parameters, and update those. See e.g. here:

http://research.microsoft.com/en-us/um/people/heckerman/tutorial.pdf

If I remember correctly, Jaynes discusses this in Probability Theory and arrives at the conclusion that if a reasoning robot assigned a probability to its changing its mind a certain way, then it should update its belief now. Of course, the general caveat here: humans are not robots, they don't perfectly adhere to either formal logical or plausible reasoning.

If I remember correctly, Jaynes discusses this in Probability Theory

He does, it's in the chapter about Ap distribution, which are basically meta-probability, or better, Ap is the probability assigned to receive a future evidence that will put the probability of A at p. Formally P(A|Ap) = p.
From this you can show that P(A) is the expected value of the Ap distribution.

The Chapter is "Inner and Outer Robots", available here:

http://www-biba.inrialpes.fr/Jaynes/cc18i.pdf

The outer robot, thinking about the real world, uses Aristotelian propositions referring to that world. The inner robot, thinking about the activities of the outer robot, uses propositions that are not Aristotelian in reference to the outer world; but they are still Aristotelian in its context, in reference to the thinking of the outer robot; so of course the same rules of probability theory will apply to them. The term `probability of a probability' misses the point, since the two probabilities are at different levels.

This always seemed like a real promising idea to me. Alas, I have a day job, and it isn't as a Prof.

Ideally, your current probability should include the probabilility-weighted average of all possible future evidence. This is required for consistency of probability across those evidence-producing timelines. Collectively, the set of probabilities of future experiences is your prior.

But this article isn't talking about belief or decision-making, it's talking about communication (and perhaps encoding in a limited storage mechanism like a brain). You really don't have the power to do that calculation well, nor to communicate in this level of detail. The idea of a probability range or probability curve is one reasonable (IMO) way to summarize a large set of partly-correlated future evidence.

You do not need a probability distribution on your probability distribution to represent uncertainty. The uncertainty is captured by the spread (variance) of your prior. I think you are confusing the map and the map of the map.

First, I think you should think about whether the thing you are interested in knowing the truth about is a true/false proposition or something that can have more than two possible values.

Let's imagine you want to know the true value of a number X between negative and positive infinity. Scientist 1 tells you "My prior is represented by a standard deviation with mean 0 and standard deviation 1". Scientist 2 says the same thing, except his standard deviation is 10.

These two scientists have the same belief about the most likely value of X, but they have different certainties. This difference will be reflected in how they respond to data: Scientist 2 will always adjust his beliefs more in response to any new evidence. The point is that you are able to reflect the uncertainty of the beliefs in the prior itself.

Next, let's imagine you are interested in a true/false statement. Since there are only two possibilities (law of excluded middle) you can represent your beliefs with a Bernoulli distribution. This distribution has only one parameter, its variance is equal to p(1-p). Therefore, your estimate p tells me everything there is to know about how certain you are.

If you claim "I believe the statement is true with probability 50%" you have committed yourself to updating that probability only by the likelihood ratio associated with future evidence, which depends only on the probability of the outcome given the hypothesis. This likelihood ratio simply cannot depend on how certain you are about the hypothesis.

The only meaningful interpretation of a probability on a probability, is if you are unsure about what you actually believe. In other words, you are trying to make a map of your map. For example, you can say that "I believe with probability 1/4 that I believe that p=0.40, and I believe with probability 3/4 that I believe that p=0.60.". This however logically implies that you believe the statement is true with p=0.45, which is the only thing that determines how you update your beliefs in response to new evidence.

Of course, if you obtain new information about what you truly believe (which is independent of whether the statement is true), you could update your prior on your prior. However, I fail to see what this represents or why this idea would be useful.

You do not need a probability distribution on your probability distribution to represent uncertainty.

I think I do.

First, my prior is a probability distribution, isn't it? Second, some but not all uncertainty is captured by the variance of my prior. For example, I could be uncertain about the shape of the distribution -- say, it might be skewed but I'm not sure whether it actually is. Or I don't know whether I'm looking at a Student's-t (which e.g. has a defined mean) or I'm looking at Cauchy (which doesn't). How will I express that uncertainty?

The only meaningful interpretation of a probability on a probability, is if you are unsure about what you actually believe.

So, what's wrong with that? Of course I am unsure of what I actually believe -- say, I have some prior about the future values of X, but my confidence in my prior is not 100%, it's quite possible that my prior is wrong. You basically want to collapse all the meta-levels into a single prior, and I think that having one or more meta-levels is actually useful for thinking about the situation.

[-]gjm00

I suggest that:

• When it isn't, the other things it involves may not be best thought of in terms of your probability distribution for anything.

Example: you are looking at the results of a scientific experiment. You have two rival theories for what's going on. One predicts that the frobulator will show an average reading of 11.3, with variance of 3 units and something very close to a normal distribution. One predicts the same average and variance, but expects a geometric distribution. And it's also possible that neither existing theory is right, in which case almost anything could be, though earlier experiments suggest that readings less than 0.4 or more than 21 are extremely unlikely.

I suggest that in this case your uncertainty about the next frobulator reading is reasonably well captured by the following structure:

• You assign, let's say, p=0.6 that Theory A is basically right, in which case the next reading will be roughly normally distributed if measured correctly.
• You assign, let's say, p=0.37 that Theory B is basically right, in which case the next reading will be roughly exponentially distributed if measured correctly.
• You assign p=0.03 that neither theory is correct, conditional on which you have a largely atheoretical prior that maybe looks roughly normal but with larger variance than either Theory A or Theory B.
• You are aware that sometimes measurements are messed up, so you expect that with some quite small probability the measured result will be corrupted in some way you could probably write down a crude distribution for (obtained by reflecting on the kinds of mistakes people make, or past experience of measurement cockups, or something).

So you have uncertainties about things other than the next frobulator reading, but it would be misleading to describe them as uncertainties about your probability distribution; e.g. the sort of thing that would change your prior would be discovering evidence from some other source that favours Theory B, or learning that the person taking the measurements is a hopeless klutz whose mistakes have caused trouble in the past.

Including my Knightian uncertainty?

You assign, let's say, p=0.6 that Theory A is basically right

How do I express my uncertainty about that 0.6 number?

I don't know about that. I am uncertain about the next frobulator reading. I'm treating this reading as a random variable arising out of an unobserved process (=some unobserved distribution). This unobserved process/distribution has a set of parameters theta. I am uncertain about these parameters. Would you describe the uncertainty about these parameters as "uncertainties about [my] probability distribution"?

[-]gjm20

I don't really believe in "Knightian uncertainty" as a fundamental notion, but in so far as you have it I'm not sure you can properly be said to have a prior at all.

Your "uncertainty about that 0.6 number" is a meaningful notion only when there's something in (your model of) the world for it to be about. For instance, perhaps your opinion that Theory A is a bit more likely than not is the result of your having read a speculative paper by someone you think is an expert; but if you think there's a 10% chance she's a charlatan, maybe it would be useful to represent that as p=0.9 of (65% Theory A, 32% Theory B, 3% neither) plus p=0.1 of some other probability distribution over theories. (If that's the only impact of learning that the author is or isn't a charlatan, this doesn't buy you anything relative to just figuring out the overall probabilities for A, B, and Neither; but e.g. perhaps if the author is a charlatan then your ideas about how things might look if A and B are both wrong will change.)

(It might be fruitful in practice to put probability distributions on such things -- it might be easier and almost as accurate as figuring out all the intricate evidential structures that I'm suggesting are the "real" underpinnings of the kind of uncertainty that makes it feel like a good thing to do. But I think that's a heuristic technique and I'm not convinced that there's a way to make it rigorous that doesn't cash it out in terms of the kind of thing I've been describing.)

Would you describe the uncertainty about these parameters as "uncertainties about [my] probability distribution"?

No. I think you're making a type error again. The unobserved process is, or describes, some physical thing within the world; its parameters, whatever they may be, are facts about the world. You are (of course) uncertain about them; that uncertainty is part of your current probability distribution over ways-the-world-could-be. (You may also be uncertain about whether the actual process is the sort you think it is; again, that's represented by your probability distribution over how the world is.)

None of that involves making your probability assignment apply to itself.

Now, having said all that: you are part of the world, and you may in fact be uncertain about various aspects of your mind, including your probability assignments. So if you are trying to predict your own future behaviour or something, then for that purpose you may want to introduce something like uncertainty about your probability distribution. But I think you shouldn't identify your model of your probability distribution, as here, with the probability distribution you're using for calculation, as in the previous paragraphs. (In particular, I suspect that assuming they match may lead you into inconsistencies.)

Let me express my approach in a slightly different way.

Let's say I have a sample of some numbers and I'm interested in the properties of future numbers coming out of the same underlying process.

The simplest approach (say, Level 1) is to have a point estimate. Here is my expected value for the future numbers.

But wait! There is uncertainty. At Level 2 I specify a distribution, say, a Gaussian with a particular mean and standard deviation (note that it implies e.g. very specific "hard" probabilities of seeing particulate future numbers).

But wait! There is more uncertainty! At Level 3 I specify that the mean of that Gaussian is actually uncertain, too, and has a standard error -- in effect it is a distribution (meaning your "hard" probabilities from the previous level just became "soft"). And the variance is uncertain, too, and has parameters of its own.

But wait! You can dive deeper and find yet more turtles down there.

but in so far as you have it I'm not sure you can properly be said to have a prior at all.

I have an uncertain prior. I find that notion intuitive, it seems that you don't.

Your "uncertainty about that 0.6 number" is a meaningful notion only when there's something in (your model of) the world for it to be about.

It is uncertainty about the probability that the theory A is correct. I find the idea of "uncertainty about the probability" meaningful and useful.

I think that in a large number of cases you just do not have enough data for "figuring out all the intricate evidential structures" and the "heuristic technique" is all you can do. As for being rigorous, I'll be happy if in the limit it converges to the right values.

that's represented by your probability distribution over how the world is

But I don't have one. I'm not Omega -- the world is too large for me to have a probability distribution over it. I'm building models all of which are wrong but some of which are useful (hat tip to George Box). Is it useful to me to have multilayered models which involve probabilities of probabilities.

I think we are basically talking about whether to collapse all the meta-levels into one (your and Anders_H's position) or not collapse them (my position).

What would a meta-probability actually mean?

That was a really good question, and 1) seems like a pretty good answer, but I recommend you look into Jaynes, who went into a little detail on this idea.

See comments below, "Inner and Outer Robots".
http://lesswrong.com/r/discussion/lw/msd/some_thoughts_on_metaprobabilties/crnx

If you made no approximations, the normatively correct approach is to carry around your current probability estimate p, and a table which contains what p would be updated to under all possible pieces of evidence you could receive. For example, I might say "I know very little about sports, so I'll assign probability 50% that the Dallas Cowboys will win their next game, but if my friend who follows football tells me they will, I'll assign probability 75%, and if I see a bookie's odds, I'll adopt the implied probability estimate." (This is, of course, an incomplete list--there are many, many other pieces of evidence I could see.) Obviously, these updates should follow the laws of probability on pain of paradox.

Why is this necessary to do things correctly? You can work out that I thought my friend's prediction, because it moved me from 1:1 odds to 3:1 odds, has a likelihood ratio of three. But where did 3 come from? It's the interaction between my knowledge and my friend's knowledge. If the same friend makes the same prediction, then I shouldn't update my probability, because the first time they give me useful info, and the second time they don't give me any useful info. If a second friend also predicts that the Cowboys will win, then I need to estimate how correlated their predictions are in order to determine how to update.

The hyperparameter approach is the clean way to do this in cases where the likelihood of incoming evidence given existing evidence is easy to determine. If I've got a coin flipped in a random fashion (but weighted in an unknown way), then I think that successive flips are independent and equally indicative of the underlying propensity of the coin to land heads when flipped randomly. But if I've got a coin flipped in a precisely controlled deterministic fashion, then I don't think that successive flips are independent and equally indicative of the underlying propensity of the coin to land heads, because that "propensity" is not longer a useful node in my model.