A Bayesian Aggregation Paradox

The framing of this issue that makes the most sense to me is " is a function of $P (B) : P (C)$ ".

When I look at it this way, I disagree with the claim (in "Mennen's ABC example") that "[Bayesian updating] is not invariant when we aggregate outcomes" -- I think it's clearer to say the Bayesian updating is not well-defined when we aggregate outcomes.

Additionally, in "Interpreting Bayesian Networks", the framing seems to make it clearer that the problem is that you used $e_{1, 2} + e_{1, 3}$ for $P (E | B \cup C)$ -- but they're not the same thing! In essence, you're taking the sum where you should be taking the average...

With this focus on (mis)calculating $P (E | B \cup C)$ , the issue seems to me more like "a common error in applying Bayesian updates", rather than a fundamental paradox in Bayesian updating itself. I agree with the takeaway "be careful when grouping together outcomes of a variable" -- because grouping exposes one to committing this error -- but I'm not sure I'm seeing the thing that makes you describe it as unintuitive?

[-]Jsevillamol4y40

I like this framing.

This seems to imply that summarizing beliefs and summarizing updates are two distinct operations.

For summarizing beliefs we can still resort to summing:

But for summarizing updates we need to use an average - which in the absence of prior information will be a simple average:

$⎛ ⎜ ⎝ \begin{matrix} e_{1} e_{2} e_{3} \end{matrix} ⎞ ⎟ ⎠      Update \to (\begin{matrix} e_{1} \frac{e_{2} + e_{3}}{2} \end{matrix})      Summarized update$

Annoyingly and as you point out this is not a perfect summary - we are definitely losing information here and subsequent updates will be not as exact as if we were working with the disaggregated odds.

I still find it quite disturbing that the update after summarizing depends on prior information - but I can't see how to do better than this, pragmatically speaking.

[-]rossry4y30

Right, I agree that for the update aggregation is better than $e_{2} + e_{3}$ (but still lossy). And the thing that $p_{2} : p_{3}$ affects is the weighting in the average -- so if $e_{2} = e_{3}$ then the $p$ s don't matter! (which is a possible answer to your question of "how much aggregation/disaggregation can you do?")

But yeah if $e_{2}$ is very different from $e_{3}$ then I don't think there's any way around it, because the effective $e_{i}$ could be one or the other depending on what the $p_{i}$ are.

[-]AlexMennen4y70

(Possibly a bit of a tangent) It occurred to me while reading this that perhaps average log odds could make sense in the context in which there is a uniform prior, and the probabilities provided by experts differ because the experts disagree on how to interpret evidence that brings them away from the uniform prior. This has some intuitive appeal:

1) Perhaps, when picking questions to ask forecasters, people have a tendency to pick questions for which they believe the probability that the answer is yes is approximately 50%, because that offers the most opportunity to update in response to the beliefs of the forecasters. If average log odds is an appropriate pooling method to use if you have a uniform prior, then this would explain its good empirical performance. I think I mentioned in our discussion on your EA forum post that if there is a tendency for more knowledgeable forecasters to give more extreme probabilities, then this would explain good performance by average log odds, which weights extreme predictions heavily. A tendency for the questions asked to have priors of near 50% according to the typical unknowledgeable person would explain why more knowledgeable forecasters would assign more extreme probabilities on average: it takes more expertise to justifiably bring their probabilities further from 50%.

2) It excuses the incoherent behavior of average log odds on my ABC example as well. If A, B, and C are mutually exclusive, then they can't all have 50% prior probability, so a pooling method that implicitly assumes that they do will not give coherent results.

Ultimately, though, I don't think this is actually true. Consider the example of forecasting a continuous variable x by soliciting probability density functions and $p_{2} (x)$ from two experts, and pooling them to get the pdf proportional to $\sqrt{p_{1} (x) p_{2} (x)}$ (renormalized so it integrates to 1). You could also consider forecasting the variable $y = f (x)$ for some differentiable, strictly increasing function f. Then your experts give you pdfs $q_{1} (y)$ and $q_{2} (y)$ satisfying $p_{i} (x) = f^{'} (x) q_{i} (f (x))$ , and you pool them to get the pdf proportional to $\sqrt{q_{1} (y) q_{2} (y)}$ . I claim that, if what we're doing implicitly depends on a uniform prior in a sneaky way, that the first thing should be the appropriate thing to do if x has a uniform prior, and the second thing should be appropriate if y has a uniform prior. If f is nonlinear, then a uniform prior on x induces a non-uniform prior on y, and vice-versa, so we should get incompatible results from each way of doing this, as we were implicitly using different priors each time. But let's try it: $\sqrt{p_{1} (x) p_{2} (x)} = \sqrt{f^{'} (x) q_{1} (f (x)) f^{'} (x) q_{2} (f (x))} = f^{'} (x) \sqrt{q_{1} (f (x)) q_{2} (f (x))}$ . Thus, given that both experts provided pdfs satisfying the formula $p_{i} (x) = f^{'} (x) q_{i} (f (x))$ making their probability distributions on x and y compatible with $y = f (x)$ , our pooled pdfs also satisfies that formula, and is also compatible with $y = f (x)$ . That is, if we pooled using beliefs about x, and then find the implied beliefs about y, we get the same thing as if we directly pooled using beliefs about y. Different implicit priors don't appear to be ruining anything.

I conclude that the incoherent results in my ABC example cannot be blamed on switching between the uniform prior on {A,B,C} and the uniform prior on {A, $\neg$ A}, and, instead, should be blamed entirely on the experts having different beliefs conditional on $\neg$ A, which is taken account in the calculation using A,B,C, but not in the calculation using A, $\neg$ A.

[-]Jsevillamol4y10

average log odds could make sense in the context in which there is a uniform prior

This is something I have heard from other people too, and I still cannot make sense of it. Why would questions where uninformed forecasters produce uniform priors make logodds averaging work better?

A tendency for the questions asked to have priors of near 50% according to the typical unknowledgeable person would explain why more knowledgeable forecasters would assign more extreme probabilities on average: it takes more expertise to justifiably bring their probabilities further from 50%.

I don't understand your point. Why would forecasters care about what other people would do? They only want to maximize their own score.

If A, B, and C are mutually exclusive, then they can't all have 50% prior probability, so a pooling method that implicitly assumes that they do will not give coherent results.

This also doesn't make much sense to me, though it might be because I still don't understand the point about needing uniform priors for logodd pooling.

Different implicit priors don't appear to be ruining anything.

Neat!

I conclude that the incoherent results in my ABC example cannot be blamed on switching between the uniform prior on {A,B,C} and the uniform prior on {A,A}, and, instead, should be blamed entirely on the experts having different beliefs conditional on $\neg$ A, which is taken account in the calculation using A,B,C, but not in the calculation using A, $\neg$ A.

I agree with this.

[-]AlexMennen4y30

Why would questions where uninformed forecasters produce uniform priors make logodds averaging work better?

Because it produces situations where more extreme probability estimates correlate with more expertise (assuming all forecasters are well-calibrated).

I don't understand your point. Why would forecasters care about what other people would do? They only want to maximize their own score.

They wouldn't. But if both would have started with priors around 50% before they acquired any of their expertise, and it's their expertise that updates them away from 50%, then more expertise is required to get more extreme odds. If the probability is a martingale that starts at 50%, and the time axis is taken to be expertise, then more extreme probabilities will on average be sampled from later in the martingale; i.e. with more expertise.

This also doesn't make much sense to me, though it might be because I still don't understand the point about needing uniform priors for logodd pooling.

If logodd pooling implicitly assumes a uniform prior, then logodd pooling on A vs A assumes A has prior probability 1/2, and logodd pooling on A vs B vs C assumes A has a prior of 1/3, which, if the implicit prior actually was important, could explain the different results.

[-]Unnamed4y70

I think I've followed the basic argument here? Let me try a couple examples, first a toy problem and then a more realistic one.

Example 1: Dice. A person rolls some fair 20-sided dice and then tells you the highest number that appeared on any of the dice. They either rolled 1 die (and told you the number on it), or 5 dice (and told you the highest of the 5 numbers), or 6 dice (and told you the highest of the 6 numbers).

For some reason you care a lot about whether there were exactly 5 dice, so you could break this down into two hypotheses:

H1: They rolled 5 dice
H2: They rolled 1 or 6 dice

Let's say they roll and tell you that the highest number rolled was 20. This favors 5 dice over 1 die, and to a lesser degree it favors 6 dice over 5 dice. So if you started with equal (1/3) probabilities on the 3 possibilities, you'll update in favor of H1. Someone who also started with a 1/3 chance on H1, but who thought that 1 die was more likely than 6 dice, would update even more in favor of H1. And someone whose prior was that 6 dice was more likely than 1 die would update less in favor of H1, or even in the other direction if it was lopsided enough.

Relatedly, if you repeated this experiment many times and got lots of 20s, that would eventually become evidence against H1. If the 100th roll is 20, then that favors 6 dice over 5, and by that point the possibility of there being only 1 die is negligible (if the first 99 rolls were large enough) so it basically doesn't matter that the 20 also favors 5 dice over 1. This seems like another angle on the same phenomenon, since your posterior after 99 rolls is your prior for the 100th roll (and the evidence from the first 99 rolls has made it lopsided enough so that the 20 counts as evidence against H1).

Example 2: College choice. A high school freshman hopes & expects to attend Harvard for college in a few years. One observer thinks that's unlikely, because Harvard admissions is very selective even for very good students. Another observer thinks that's unlikely because the student is into STEM and will probably wind up going to a more technical university like MIT; they haven't thought much yet about choosing a college and Harvard is probably just serving as a default stand-in for a really good school.

The two observers might give the same p(Harvard), but for very different reasons. And because their models are so different, they could even update in opposite directions on the same new data. For instance, perhaps the student does really well on a math contest, and the first observer updates in favor of the student attending Harvard (that's an impressive accomplishment, maybe they will make it past the admissions filter) while the second observer updates a bit against the student attending Harvard (yep, they're a STEM person).

You could fit this into the "three outcomes" framing of this post, if you split "not attending Harvard" into "being rejected by Harvard" and "choosing not to attend Harvard".

[-]cousin_it4y*60

I think your first example could be even simpler. Imagine you have a coin that's either fair, all-heads, or all-tails. If your prior is "fair or all-heads with probability 1/2 each", then seeing heads is evidence against "fair". But if your prior is "fair or all-tails with probability 1/2 each", then seeing heads is evidence for "fair". Even though "fair" started as 1/2 in both cases. So the moral of the story is that there's no such thing as evidence for or against a hypothesis, only evidence that favors one hypothesis over another.

[-]Pattern4y20

That's a great explanation. Evidence may also be compatible or incompatible with a hypothesis. For instance, if I get a die (without the dots on the sides that indicate 1-6), and I instead label* it:

Red, 4, Life, X-Wing, Int, path through a tree

Then finding out I rolled a 4, without knowing what die I used, is compatible with the regular dice hypothesis, but any of the other rolls, is not.

*(likely using symbols, for space reasons)

[-]Unnamed4y40

This seems related to philosophy of science stuff, where updating is about pitting hypotheses against each other. In order to do that you have to locate the leading alternative hypotheses. It doesn't work well to just pit a hypothesis against "everything else" (it's hard to say what p(E|not-H) is, and it can change as you collect more data). You need to find data that distinguishes your hypothesis from leading alternatives. An experiment that favors Newtonian mechanics over Aristotelian mechanics won't favor Newtonian mechanics over general relativity.

[-]Forged Invariant4y60

Seeing the equations, it was hard to intuitively grasp why updates work this way. This example made things more intuitive for me:

If an event can have 3 outcomes, and we encounter strong evidence against outcomes B and C, then the update looks like this:

The information about what hypotheses are in the running is important, and pooling the updates can make the evidence look much weaker than it is.

[-]Jsevillamol4y*20

Note that you are making the same mistake than me! Updates are not summarized in the same way as beliefs - for the update the "correct" way is to take an average of the likelihoods:

$⎛ ⎜ ⎝ \begin{matrix} 1 0.01 0.01 \end{matrix} ⎞ ⎟ ⎠      Posterior = ⎛ ⎜ ⎝ \begin{matrix} 111 \end{matrix} ⎞ ⎟ ⎠      Prior \times ⎛ ⎜ ⎝ \begin{matrix} 1 0.01 1 \end{matrix} ⎞ ⎟ ⎠      Refute B \times ⎛ ⎜ ⎝ \begin{matrix} 11 0.01 \end{matrix} ⎞ ⎟ ⎠      Refute C \neq (\begin{matrix} 1 1 + 1 \end{matrix})      Prior \times (\begin{matrix} 1 \frac{0.01 + 1}{2} \end{matrix})      Refute B \times (\begin{matrix} 1 \frac{1 + 0.01}{2} \end{matrix})      Refute C \approx (\begin{matrix} 1 0.5 \end{matrix})      Posterior$

This does not invalidate the example though!

Thanks for suggesting, I think it helps clarify the conondrum.

[-]Forged Invariant4y10

The left hand side of the example is deliberately making the mistake described in your article, as a way to build intuition on why it is a mistake.

(Adding instead of averaging in the update summaries was an unintended mistake)

Thanks for explaining how to summarize updates, it took me a bit to see why averaging works.

[-]Jsevillamol3y40

As it is often the case, I just found out that Jaynes was already discussing a similar issue to the paradox here in his seminal book.

This wikipedia article summarizes the gist of it.

[-]a gently pricked vein4y*30

There's probably a radical constructivist argument for not really believing in open/noncompact categories like . I don't know how to make that argument, but this post too updates me slightly towards such a Tao of conceptualization.

(To not commit this same error at the meta level: Specifically, I update away from thinking of general negations as "real" concepts, disallowing statements like "Consider a non-chair, ...").

But this is maybe a tangent, since just adopting this rule doesn't resolve the care required in aggregation with even compact categories.

[-]MrMind3y30

There is, at least at a mathematical / type theoretic level.

In intuitionistic logic, is translated to $A \to 0$ , which is the type of processes that turn an element of $A$ into an element of $0$ , but since $0$ is empty, the whole $\neg A$ is absurd as long as $A$ is istantiated (if not, then the only member is the empty identity). This is also why constructively $A \to \neg \neg A$ but not $\neg \neg A \to A$

Closely related to constructive logic is topology, and indeed if concepts are open set, the logical complement is not a concept. Topology is also nice because it formalizes the concept of edge case

[-]a gently pricked vein3y10

I'm unsure if open sets (or whatever generalization) are a good formal underpinning of what we call concepts, but I'm in agreement that there seems needed at least a careful reconsideration of intuitions one takes for granted when working with a concept, when you're actually working with a negation-of-concept. And "believing in" might be one of those things that you can't really do with negation-of-concepts.

Also, I think a typo: you said "logical complement", I'm imagining you meant "set-theoretic complement". (This seems important to point out since in topological semantics for intuitionistic logic, the "logical complement" is in fact defined to be the interior of the set-theoretic complement, which guarantees an open.)

[-]MrMind3y20

I should have written "algebraic complement", which becomes logical negation or set-theoretic complement depending on the model of the theory.

Anyway, my intuition on why open sets are an interesting model for concepts is this: "I know when I see it" seems to describe a lot of the way we think about concepts. Often we don't have a precise definition that could argue all the edge case, but we pretty much have a strong intuition when a concept does apply. This is what happens to recursively enumerable sets: if a number belongs to a R.E. set, you will find out, but if it doesn't, you need to wait an infinite amount of time. Systems that take seriously the idea that confirmation of truth is easy falls under the banner of "geometric logic", whose algebraic model are frames, and topologies are just frames of subsets. So I see the relation between "facts" and "concepts" a little bit like the relation between "points" and "open sets", but more in a "internal language of a topos" or "pointless topology" fashion: we don't have access to points per se, only to open sets, and we imagine that points are infinite chains of ever precise open sets

[-]DanielFilan2y20

Is this just the thing where evidence is theory-laden? Like, for example, how the evidentiary value of the WHO report on the question of COVID origins depends on how likely one thinks it is that people would effectively cover up a lab leak?

[-]Adele Lopez4y20

I think entropy is a key to understanding this more deeply. I believe you could consider the unaggregated distribution as the "microstates" and the aggregated one as the "macrostates". The entropy would then tell you how much information you lose by aggregating in this way.

Minor quibble: The likelihood part of probability is also subjective in the sense that it depends on the evidence the agent is aware of.

[-]JonasMoss4y*10

I find the beginning of this post somewhat strange, and I'm not sure your post proves what you claim it does. You start out discussing what appears to be a combination of two forecasts, but present it as Bayesian updating. Recall that Bayes theorem says . To use this theorem, you need both an $x$ (your data / evidence), and a $θ$ (your parameter). Using “posterior $\propto$ prior $\times$ likelihood” (with priors $p_{1}, p_{2}, p_{3}$ and likelihoods $e_{1}, e_{2}, e_{3}$ ), you're talking as if your expert's likelihood equals $p (x ∣ θ)$ – but is that true in any sense? A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

I can see two interpretations of what you're doing at the beginning of your post:

You're combining two forecasts. That is, with $θ \in A, B, C$ being the outcome, you have your own pmf $p_{1} (θ)$ and the expert's $e = p_{2} (θ)$ , then combine them using $p (θ) \propto p_{1} (θ) p_{2} (θ)$ . That's fair enough, but I suppose $p (θ) \propto \sqrt{p_{1} (θ) p_{2} (θ)}$ or maybe $p (θ) \propto p_{1} (θ)^{q} p_{2} (θ)^{1 - q}$ for some $q \in [0, 1]$ would be a better way to do it.
It might be possible to interpret your calculations as a proper application of Bayes' rule, but that requires stretching it. Suppose $θ$ is your subjective probability vector for the outcomes $A, B, C$ and $x$ is the subjective probability vector for the event supplied by an expert (the value of $x$ is unknown to us). To use Bayes' rule, we will have to say that the evidence vector $e = p (x ∣ θ)$ , the probability of observing an expert judgment of $x$ given that $θ$ is true. I'm not sure we ever observe such quantities directly, and it is pretty clear from your post that you're talking about $e = p_{2} (θ)$ in the sense used above, not $p (x ∣ θ)$ .

Assuming interpretation 1, the rest of your calculations are not that interesting, as you're using a method of knowledge pooling no one advocates.

Assuming interpretation 2, ~~the rest of your calculations are probably incorrect. I don't think there is a unique way to go from~~ $p (x ∣ θ)$ ~~to, let's say,~~ $p (x^{*} ∣ θ^{*})$ , ~~where~~ $x^{*}$ ~~is the expert's probability vector over~~ $A, A^{c}$ ~~and~~ $θ^{*}$ ~~your probability vector over~~ $A, A^{c}$ .

[-]Jsevillamol4y20

Thanks for engaging!

To use this theorem, you need both an (your data / evidence), and a $θ$ (your parameter).

Parameters are abstractions we use to simplify modelling. What we actually care about is the probability of unkown events given past observations.

You start out discussing what appears to be a combination of two forecasts

To clarify: this is not what I wanted to discuss. The expert is reporting how you should update your priors given the evidence, and remaining agnostic on what the priors should be.

A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

The whole point of Bayesianism is that it offer a precise, quantitative answer to how you should update your priors given some evidence - and that is multiplying by the likelihoods.

This is why it is often recommend in social sciences and elsewhere to report your likelihoods.

I'm not sure we ever observe [the evidence vector] directly

I agree this is not common in judgemental forecasting, where the whole updating process is very illegible. I think it holds for most Bayesian-leaning scientific reporting.

it is pretty clear from your post that you're talking about $e = p_{2} (θ)$ in the sense used above, not $p (x ∣ θ)$ .

I am not, I am talking about evidence = likelihood vectors.

One way to think about this is that the expert is just informing us about how we should update our beliefs. "Given that the pandemic broke out in Wuhan, your subjective probability of a lab break should increase and it should increase by this amount". But the final probability depends on your prior beliefs, that the expert cannot possibly know.

I don't think there is a unique way to go from $p (x ∣ θ)$ to, let's say, $p (x^{*} ∣ θ^{*})$ , where $x^{*}$ is the expert's probability vector over $A, A^{c}$ and $θ^{*}$ your probability vector over $A, A^{c}$ .

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and neccessarily loses some information.

[-]JonasMoss4y10

Okay, thanks for the clarification! Let's see if I understand your setup correctly. Suppose we have the probability measures $p_{E}$ and $p_{1}$ , where $p_{E}$ is the probability measure of the expert. Moreover, we have an outcome $x \in {A, B, C} .$

In your post, you use $p_{1} (x ∣ z) \propto p_{E} (z ∣ x) p_{1} (x)$ , where $z$ is an unknown outcome known only to the expert. To use Bayes' rule, we must make the assumption that $p_{1} (z ∣ x) = p_{E} (z ∣ x)$ . This assumption doesn't sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations.

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and necessarily loses some information.

I'm not sure. When we're looking directly at the probability of an event $x$ (instead of the probability of the probability an event), things get much simpler than I thought.

Let's see what happens to the likelihood when you aggregate from the expert's point of view. Letting $x \in {A, B, C}$ , we need to calculate the expert's likelihoods $p_{E} (z ∣ A)$ and $p_{E} (z ∣ B \cup C)$ . In this case,

\begin{matrix} p_{E} (z ∣ B \cup C) & = & \frac{p_{E} (B \cup C ∣ z)}{p_{E} (B \cup C)} p_{E} (z), = & \frac{p_{E} (B ∣ z) + p_{E} (C ∣ z)}{p_{E} (B \cup C)} p_{E} (z), = & \frac{p_{E} (z ∣ B) P (B) + p_{E} (z ∣ C) P (C)}{p_{E} (B) + p_{E} (C)}, \end{matrix}

which is essentially your calculations, but from the expert's point of view. The likelihood $p_{E} (z ∣ B \cup C)$ depends on $p_{E} (B \cup C)$ , the prior of the expert, which is unknown to you. That shouldn't come as a surprise, as he needs to use the prior of in order to combine the probability of the events $B$ and $C$ .

But the calculations are exactly the same from your point of view, leading to

p_{1} (z ∣ B \cup C) = \frac{p_{E} (z ∣ B) p_{1} (B) + p_{E} (z ∣ C) p_{1} (C)}{p_{1} (B) + p_{1} (C)}

Now, suppose we want to generally ensure that $p_{E} (z ∣ B \cup C) = p_{1} (z ∣ B \cup C)$ . Which is what I believe you want to do, and which seems pretty natural to do, at least since we're allowed to assume that $p_{E} (z ∣ x) = p_{1} (z ∣ x)$ for all simple events $x$ . To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or $p_{1} (z, x) = p_{E} (z, x)$ .

Do you agree with this summary?

LESSWRONG
LW

LESSWRONG
LW

87

A Bayesian Aggregation Paradox

87

87

Why this is weird

Examples in the wild

Mennen’s ABC example

Interpreting Bayesian Networks

Consequences

In summary

Acknowledgements