On platforms like PredictionBook, a user's credibility will be measured by looking at a histogram of all of their predictions (edit: After being told about "brier score", I have become able to notice that PredictionBook totally shows a brier score right under the histogram. Still, thinking about brier scores, they're not very good either. You can only maximise your brier score by betting on things you're certain of, it creates perverse incentives to avoid placing bets on uncertain events, which we already have enough of in natural discourse). For a perfectly well calibrated agent, exactly 90% of their 0.9 predictions should have come true, exactly 20% of their 0.2 predictions should have come true, etc. We can reduce a person's calibration to a number by, I suppose, getting the sum of squares of the differences of each histogram from its middle point (I don't know if that's how they do it).
But whatever they do with the histogram, I think I'll find it very unsatisfying, because the result are going to depend on what spacing we used for the histogram segments, and that's arbitrary. We could have a histogram with ten segments, or a histogram with fifty segments, we'll get different scores. I feel like there must be some exact.. continuous way of scoring a predictor's calibration, and that the math will probably be generally useful for other stuff. Is there a method?
Doesn't
Indicate that this is using histogram buckets? I'm trying to say I'm looking for methods that avoid grouping probabilities into an arbitrary (chosen by the analyst) number of categories. For instance.. in the (possibly straw) histogram method that I discussed in the question, if a predictor makes a lot of 0.97 bets and no corresponding 0.93 bets, their [0.9 1] category will be called slightly pessimistic about its predictions even if those forecasts came true exactly 0.97 of the time, I wouldn't describe anything in that genre as exact, even if it is the best we have.