[EDIT: I have written an update of this post here]
Epistemic status: confused. I also seem to have spent some time thinking about it and still went with my gut without having a great reason for it.
When you ask several people or build several models to forecast the probability of an event you are left with the question of how to aggregate them to harness their collective wisdom.
Arguably, the best possible thing would be to examine the divergence in opinions and try to update towards a shared view. But this is not always possible - sharing models is hard and time consuming. How can we then aggregate forecasts naively?
I have been interested in this problem for a while, specially in the context of running forecasting workshops where my colleagues produce different estimations but we dont have enough time to discuss the differences.
To approach this problem I first lay my intuition, then skim two papers approaching this question and finally outline my best guess from what I learned.
I have a weak intution that taking the geometric mean is the best simple way to aggregate probabilities. When I try to inquire why do I believe that, my explanation is something like:
1) we want the aggregation procedure to be be simple (to not overfit), widely used (which is evidence of usefulness) and have a good theoretical justification
2) arithmetic means are simple, widely used and are the MLE estimator for the expected value of a normal distribution, which we expect to be common-ish based on the limit theorem
3) but summing probabilities is ontologically wrong / feels weird. p_1 + p_2 is not a probability.
4) a common approach in this kind of situation is to take the logarithm of your quantities of interest
5) the MLE of the expected value of a log normal is the geometric mean of the measures
So in practice I mostly use geometric means to aggregate forecasts and don't feel to bad about it.
There is still a part of me that feels like the explanation above is too handwavy, so I have been reading a bit into the literature of aggregating forecasts to try to undertstand better the topic.
Satopaa et al write about this issue, by deriving an estimator of the best possible estimator based on some statistical assumptions and then testing it on synthetic and real data.
The estimator has the form
Where $p_i$ are the individual forecasts, $N$ is the number of forecasts and $a$ is a parameter indicating systematic bias in the invidual forecasts.
Interestingly, they show that the statistical assumptions about the distribution of forecasts are wrong in their real dataset. But their estimator outperforms in terms of the Brier score other simple estimators such as the mean and the median, and some fancy estimators like the logarithmic opinion pool and a beta transformed linear opinion pool.
This estimator takes the geometric mean of the odds instead of raw probabilities, and scales them up by a factor $a$. This factor is fitted to the data at hand, and they estimate that the value that minimizes the Brier score is $a \in [1.161, 3.921]$.
Sadly no empirical comparison with a geometric mean of probabilities is explored in the paper (if someone is interested in doing this, it would be a cool project to write up).
Digging a bit deeper I found this article by Allard et al surveying probability aggregation methods.
One of the cooler parts of their analysis is them assesing theoretical desiraderata a forecast aggregator must satisfy.
Some other desiderata are discussed and argued against, but the three that draw most my attention are external bayesianity, forcing and marginalization.
External bayesianity is satified when the pooling operation commutes with Bayesian updating - that is; new information should not affect the end result whether it is applied before or after the pooling.
The authors claim that it is a compelling property.
I find myself a bit confused about it. This property seems to talk about perfect Bayesians, is that too strong of an idealization? Doesnt the Bayesian update depend on the information already available to the forecasters (which we are abstracting away in the aggregation exercise) and thus be too restricting for our purposes?
Interestingly, the authors claim that the class of functions that satisfy external bayesianity are generalized weighted geometric means:
where $\sum w_i = 1$
Marginalization requires that the pooling operator commutes with marginalizing a joint probability distribution. This requires us to expand beyond the binary scenario to make sense.
The class of functions that satisfies marginalization are generalized weighted arithmetic means.
where $P_0$ is a constant.
The authors dont provide any commentary on how desirable marginalization is, albeit their writing suggests that Bayesian extenality is a more compelling property.
The authors also look into maximing the KL entropy under some tame constraints, and find that the maximizing pooling formula is:
More ground is covered in the paper, but I wont cover it.
- Satopaa et al find that the geometric mean of odds beats some other agrgegation methods
- Allard et al argue that a generalized geometric mean of probabilities satisfies some desirable desiderata (external Bayesianity), but also study other desiderata that leads to other pooling functions
All in all, it seems like there are some credible alternatives, but I am still confused.
There is some empirical evidence that linear aggregation of probabilities is outperformed by other methods. The theoretical case is not clear cut, since linear aggregation still preserves some properties that seem desirable like marginalization, but fails at other desiderata like external Bayesianity.
But if not linear aggregation, what should we use? The two that stick out to me as credible candidates within the realm of simplicity are geometric aggregation of probabilites and geometric aggregation of odds.
I dont have a good reason for preferring one over the other - no empirical or theoretical case.
I would love to see a comparison of a geometric mean and the geometric mean of odds approach as in Satopaa et al, in either a simulation or a real dataset.
Ideally I would love an unambiguous answer motivating either of them or suggesting an alternative, but given the complexity of the papers I skimmed while writing this post I am moderately skeptical that this is going to happen anytime soon.
I was talking of geometric mean of odds and geometric mean of probabilities as different things, but UnexpectedValues points outs that after a (neccessary) normalization they are one and the same.
Now, the remaining questions are:
1) Are the theoretical reasons by Satopaa et al compelling (mainly external Bayesianity)?
2) Are there any credible alternatives that beat geometric aggregation in Allard et al's dataset?