A forecaster is well-calibrated if, for every , of the propositions that they assign probability approximately to, the fraction of them that are true is approximately . However, there is no natural probability distribution over propositions, so this notion is not well-defined.

Often, people aren't even using an implicit probability distribution over propositions when they talk about calibration, and instead are refering to limiting densities over a particular sequence of propositions. For instance, a forecaster may be asked to predict every bit in a bitstream, and be judged well-calibrated if for every , the fraction of the first of the propositions that they assign probability approximately to that are true approximately converges to as goes to infinity.

Calibration is not just a relationship between probability assignments and the truth, but a relationship between probability assignments, the truth, and some model for what it means to say that some percentage of a set of propositions is true. This model could be a probability distribution over propositions, or an explicit sequence of them. The dependence on what you mean by percentage of propositions is fairly dramatic.

For any atomless probability measure, you can pick a sequence of propositions such that, in terms of limiting frequencies along the sequence, the probability distribution is guaranteed to be well-calibrated, no matter what the ground truth is. To make a sequence of propositions, all of which are given probability , and limiting frequency of which are true: First pick a sequence of independent random variables, where each is uniformly distributed on . For each and with , let be the proposition that . For each , exactly of the propositions are true, no matter what actually is. So if you list in order of increasing , then the fraction of them that are true converges to as you go along the list. Now you can create sequences of propositions with varying probabilities assigned to them, and well-calibrated on all probability levels, by interspersing these sequences for each rational number between and .

Alternatively, for any atomless probability measure, you can pick a sequence of propositions such that, in terms of limiting frequencies along the sequence, the probability distribution is extremely poorly calibrated, no matter what the ground truth is. To do this, let be a random variable distributed uniformly on . For each , let be the proposition that . gets assigned probability . No matter what actually is, for each (except for one, namely ), either all of the for rational are true, or none of them are, rather than the desired fraction of them. One might object to this example on the grounds that the reason for poor calibration is that every proposition that gets assigned probability approximately is approximately the same proposition. But this is merely an extreme version of something that could realistically happen with real forecasting questions; sufficiently consequential events can have causal effects on a large fraction of questions a forecaster might predict, so a non-negligible-probability event may throw off even a good forecaster's calibration via correlated effects across many questions.

The previous two sequences of propositions don't have to be on different subjects; they could be different ways of organizing the exact same information. For instance, the propositions in the first example could be about the digits in the base- expansion of the random variable in the second example. So this is more fundamental than just that people may be well calibrated on some topics but poorly calibrated on others; how well someone scores on a calibration test will depend on how the test is organized, not just on what information they're forecasting.

So what are people actually measuring when they measure calibration? What it's intended to measure is failure to form a coherent probability distribution at all, rather than any notion of how accurate a given probability distribution is. Forecasters don't give explicit probability distributions over everything that could happen; they just attach numbers to a certain set of propositions, and these numbers are intended to be interpreted as probabilities. But if you produce these numbers by thinking about how qualitatively likely something is and then attempting to represent that with an appropriate-seeming number, then these aren't likely to actually be the probabilities that those propositions have in any actual probability distribution. I.e. if you report probabilities for propositions by having some underlying probability distribution, and then applying some monotonicly increasing bijection to probabilities of all propositions, then you will be poorly calibrated on sequences of propositions on which the original probability distribution are well-calibrated, and you won't be reporting probabilities from any actual probability distribution, because, for instance, if you take three equally likely possible outcomes, exactly one of which must occur, then the probabilities you assign to each of these will be , and their sum will be , rather than . A suggested way to train to become calibrated is to try giving probabilities to large numbers of propositions whose truth-values can be checked later, and then calculate the fraction of times the propositions you assigned probability approximately to turned out to be true, so that next time you feel the level of confidence you previously assigned probability to, you can now assign it probability instead. If your reported probabilities were a monotonically increasing function of the probabilities of some probability distribution which is well-calibrated on the given propositions, then this allows you to learn and undo the function , so you can accurately report the probabilities from your underlying distribution. But if you already were reporting probabilities from a coherent probability distribution, and that probability distribution was poorly calibrated on the given propositions for whatever reason, then this will make you a worse forecaster, by making the probabilities you give not even form a coherent probability distribution. For example, if and are independent random variables uniformly distributed on , then forecasting a bunch of propositions of the form , and then discovering the true value of , and adjusting so that you only ever give probabilities or so that you would have been well-calibrated on those questions, will make you worse at forecasting . So calibration training implicitly assumes that if you adjust your probabilities so that they are coherent, then they will be well-calibrated on the questions people tend to train on in practice. I expect this assumption is likely to be close to true, provided the questions being forecasted are sufficiently diverse that correlations between them don't throw off calibration.

I guess I would summarize by saying:

oneof the correlated propositions, and throw out the rest, so that the resulting set of propositions is (mostly) independent. If this leaves you with too few propositions to do good statistics with, that is unfortunate.withineach bucket, it remains the case that, if the predictor was good, then ~80% of the propositions in the 80% bucket should be true. But then you can't do a meta-evaluation at the end that combines the results of separate buckets: e.g. if they said "5% company X fails, 10% company Y fails, 15% company Z fails, 20% company Q fails", and there was a downturn and they all failed, then saying "The predictor tended to be underconfident" would be illegitimate.