When I see people grading their predictions, it's always by: (a) bucketing their predictions by probability (into a "46-55%" bucket, a "56-75%" bucket, ...), and then (b) plotting each bucket's nominal probability vs empirical frequency-of-correctness. See e.g. Scott Alexander here.

This seems... fine... but the bucketing step has a certain inelegance to it: just as you can build many different-looking histograms for the same dataset, you can build many different-looking calibration curves for the same predictions, based on a semi-arbitrary choice of bucketing algorithm. Also, by bucketing datapoints together and then aggregating over the bucket, information is destroyed.

For histograms, there's an information-preserving, zero-degree-of-freedom alternative: the CDF. The CDF isn't perfect, but it at least has a different set of problems from histograms.

Is there any similar tool for grading predictions?

New Answer
New Comment

3 Answers sorted by

Log of my attempts so far:

  • Attempt #1: note that, for any probability p, you can compute "number of predictions you made with probability less than p that came true". If you're perfectly-calibrated, then this should be a random variable with:

      mean = sum(q for q in prediction_probs if q<p)
      variance = sum(q*(1-q) for q in prediction_probs if q<p)

    Let's see what this looks like if we plot it as a function of p. Let's consider three people:

    • one perfectly-calibrated (green)
    • one systematically overconfident (red) (i.e. when they say "1%" or "99%" the true probability is more like 2% or 98%)
    • one systematically underconfident (green) (i.e. when they say "10%" or "90%" the true probability is more like 5% or 95%).

    Let's have each person make 1000 predictions with probabilities uniformly distributed in [0,1]; and then sample outcomes for each set of predictions and plot out their num-true-predictions-below functions. (The gray lines show the mean and first 3 stdev intervals for a perfectly calibrated predictor.)

    Hrrm. The y-axis is too big to see the variation, Let's subtract off the mean.

    And to get a feeling for how else this plot could have looked, let's run 100 more simulations for each the three people:

    Okay, this is pretty good!

    • The overconfident (red) person tends to see way too many 1%-20% predictions come true, as evidenced by the red lines quickly rising past the +3stdev line in that range.
    • The underconfident (blue) person sees way too few 10%-40% predictions come true, as evidenced by the blue lines falling past the -3stdev line in that range.
    • The perfect (green) person stays within 1-2stdev of the mean.

    But it's not perfect: everything's too squished together on the left to see what's happening -- a predictor could be really screwing up their very-low-probability predictions and this graph would hide it. Possibly related to that squishing, I feel like the plot should be right-left symmetric, to reflect the symmetries of the predictors' biases. But it's not.

  • Attempt #2: the same thing, except instead of plotting

      sum((1 if came_true else 0) for q in prediction_probs if q<p)

    we plot

      sum(-log(prob you assigned to the correct outcome) for q in prediction_probs if q<p)

    i.e. we measure the total "surprisal" for all your predictions with probability under p. (I'm very fond of surprisal; it has some very appealing information-theory-esque properties.)

    On the bright side, this plot has less overlap between the three predictors' typical sets of lines. And the red curves look... more symmetrical, kinda, like an odd function, if you squint. Same for the blue curves.

    On the dark side, everything is still too squished together on the left. (I think this is a problem inherent to any "sum(... for q in prediction_probs if q<p)" function. I tried normalizing everything in terms of stdevs, but it ruined the symmetry and made everything kinda crazy on the left-hand side.)

There is the brier score, or any other proper scoring rule. These each have the advantage of being zero-degree-of-freedom up to the choice of scoring rule, though it isn't information preserving, and isn't comparable across different sets of predictions. (Though neither is any analogue of a CDF.)

The problem is that this measures their amount of knowledge about the questions as well as their calibration.

My model would be as follows. For a fixed source of questions, each person has a distribution describing how much they know about the questions. It describes how likely it is that a given question is one they should say p on. Each person also has a calibration function f, such that when they should say p they instead say f(p). Then by assigning priors over the spaces of these distributions and calibration functions, and applying Bayes' rule we get a... (read more)

Agreed that the proposal combines knowledge with calibration, but your procedure doesn't actually seem implementable.

Let  be your estimation of propability of some event.

Let's define penalty  if event happened,  otherwise.

If  is propability of event, expected penalty equals:

And expected penalty derivative equals 

So, if you try to minimize average penalty, you are motivated to give your best possible estimate. It's possible that you can use it to grade your own predictions, I don't know for sure.

New to LessWrong?