Review

A forecaster is well-calibrated if, for every , of the propositions that they assign probability approximately  to, the fraction of them that are true is approximately . However, there is no natural probability distribution over propositions, so this notion is not well-defined.

Often, people aren't even using an implicit probability distribution over propositions when they talk about calibration, and instead are refering to limiting densities over a particular sequence of propositions. For instance, a forecaster may be asked to predict every bit in a bitstream, and be judged well-calibrated if for every , the fraction of the first  of the propositions that they assign probability approximately  to that are true approximately converges to  as  goes to infinity.

Calibration is not just a relationship between probability assignments and the truth, but a relationship between probability assignments, the truth, and some model for what it means to say that some percentage of a set of propositions is true. This model could be a probability distribution over propositions, or an explicit sequence of them. The dependence on what you mean by percentage of propositions is fairly dramatic.

For any atomless probability measure, you can pick a sequence of propositions such that, in terms of limiting frequencies along the sequence, the probability distribution is guaranteed to be well-calibrated, no matter what the ground truth is. To make a sequence of propositions, all of which are given probability , and limiting frequency  of which are true: First pick a sequence  of independent random variables, where each  is  uniformly distributed on . For each  and  with , let  be the proposition that . For each , exactly  of the propositions  are true, no matter what  actually is. So if you list  in order of increasing , then the fraction of them that are true converges to  as you go along the list. Now you can create sequences of propositions with varying probabilities assigned to them, and well-calibrated on all probability levels, by interspersing these sequences for each rational number between  and .

Alternatively, for any atomless probability measure, you can pick a sequence of propositions such that, in terms of limiting frequencies along the sequence, the probability distribution is extremely poorly calibrated, no matter what the ground truth is. To do this, let  be a random variable distributed uniformly on . For each , let  be the proposition that  gets assigned probability . No matter what  actually is, for each  (except for one, namely ), either all of the  for rational  are true, or none of them are, rather than the desired fraction  of them. One might object to this example on the grounds that the reason for poor calibration is that every proposition that gets assigned probability approximately  is approximately the same proposition. But this is merely an extreme version of something that could realistically happen with real forecasting questions; sufficiently consequential events can have causal effects on a large fraction of questions a forecaster might predict, so a non-negligible-probability event may throw off even a good forecaster's calibration via correlated effects across many questions.

The previous two sequences of propositions don't have to be on different subjects; they could be different ways of organizing the exact same information. For instance, the propositions in the first example could be about the digits in the base- expansion of the random variable in the second example. So this is more fundamental than just that people may be well calibrated on some topics but poorly calibrated on others; how well someone scores on a calibration test will depend on how the test is organized, not just on what information they're forecasting.

So what are people actually measuring when they measure calibration? What it's intended to measure is failure to form a coherent probability distribution at all, rather than any notion of how accurate a given probability distribution is. Forecasters don't give explicit probability distributions over everything that could happen; they just attach numbers to a certain set of propositions, and these numbers are intended to be interpreted as probabilities. But if you produce these numbers by thinking about how qualitatively likely something is and then attempting to represent that with an appropriate-seeming number, then these aren't likely to actually be the probabilities that those propositions have in any actual probability distribution. I.e. if you report probabilities for propositions by having some underlying probability distribution, and then applying some monotonicly increasing bijection  to probabilities of all propositions, then you will be poorly calibrated on sequences of propositions on which the original probability distribution are well-calibrated, and you won't be reporting probabilities from any actual probability distribution, because, for instance, if you take three equally likely possible outcomes, exactly one of which must occur, then the probabilities you assign to each of these will be , and their sum will be , rather than . A suggested way to train to become calibrated is to try giving probabilities to large numbers of propositions whose truth-values can be checked later, and then calculate the fraction  of times the propositions you assigned probability approximately  to turned out to be true, so that next time you feel the level of confidence you previously assigned probability  to, you can now assign it probability  instead. If your reported probabilities were a monotonically increasing function  of the probabilities of some probability distribution which is well-calibrated on the given propositions, then this allows you to learn and undo the function , so you can accurately  report the probabilities from your underlying distribution. But if you already were reporting probabilities from a coherent probability distribution, and that probability distribution was poorly calibrated on the given propositions for whatever reason, then this will make you a worse forecaster, by making the probabilities you give not even form a coherent probability distribution. For example, if  and  are independent random variables uniformly distributed on , then forecasting a bunch of propositions of the form , and then discovering the true value of , and adjusting so that you only ever give probabilities  or  so that you would have been well-calibrated on those questions, will make you worse at forecasting . So calibration training implicitly assumes that if you adjust your probabilities so that they are coherent, then they will be well-calibrated on the questions people tend to train on in practice. I expect this assumption is likely to be close to true, provided the questions being forecasted are sufficiently diverse that correlations between them don't throw off calibration.

New Comment
1 comment, sorted by Click to highlight new comments since:

I guess I would summarize by saying:

  • If the things you're predicting are completely independent, then naive "calibration" works fine: if you're good at putting things into an "80% likely" bucket, then in practice ~80% of those predictions will be true.
  • If the things you're predicting are highly correlated with each other—e.g. questions like "Will company X fail?", "Will company Y fail?", and so on, when the most likely way for company X to fail involves a general economic downturn that affects all the companies—then even if you were perfect at putting propositions into the 5% bucket, the actual outcomes may look a lot more like "0% became true" or "100% became true" than like "5% became true".
  • Therefore, when evaluating someone's calibration, or creating a set of predictions one plans to evaluate later, one should take these correlations into account.
    • If one expects correlated outcomes, probably the best thing is to factor out the correlated part into its own prediction—e.g. "Chance of overall downturn [i.e. GDP is below X or something]: 4%" and "Chance of company X failing, conditional on overall downturn: 70%" and "Chance of company X failing, conditional on no downturn: 2.3%" (which comes out to ~5% total).
    • If the predictor didn't do this, but there was an obvious-in-retrospect common cause affecting many propositions... well, you still don't know what probability the predictor would have assigned to that common cause, which is unfortunate, and makes it difficult to judge.  Seems like the most rigorous thing you can do is pick one of the correlated propositions, and throw out the rest, so that the resulting set of propositions is (mostly) independent.  If this leaves you with too few propositions to do good statistics with, that is unfortunate.
      • One might think that if you're evaluating buckets separately (e.g. "the 80% bucket", "the 90% bucket"), it's ok if there's a proposition in one bucket that's correlated with a proposition in another bucket; as long as there's no correlation within each bucket, it remains the case that, if the predictor was good, then ~80% of the propositions in the 80% bucket should be true.  But then you can't do a meta-evaluation at the end that combines the results of separate buckets: e.g. if they said "5% company X fails, 10% company Y fails, 15% company Z fails, 20% company Q fails", and there was a downturn and they all failed, then saying "The predictor tended to be underconfident" would be illegitimate.