Log of my attempts so far:
Attempt #1: note that, for any probability p, you can compute "number of predictions you made with probability less than p that came true". If you're perfectly-calibrated, then this should be a random variable with:
mean = sum(q for q in prediction_probs if q<p) variance = sum(q*(1-q) for q in prediction_probs if q<p)
Let's see what this looks like if we plot it as a function of p. Let's consider three people:
Let's have each person make 1000 predictions with probabilities uniformly distributed in [0,1]; and then sample outcomes for each set of predictions and plot out their num-true-predictions-below functions. (The gray lines show the mean and first 3 stdev intervals for a perfectly calibrated predictor.)
Hrrm. The y-axis is too big to see the variation, Let's subtract off the mean.
And to get a feeling for how else this plot could have looked, let's run 100 more simulations for each the three people:
Okay, this is pretty good!
But it's not perfect: everything's too squished together on the left to see what's happening -- a predictor could be really screwing up their very-low-probability predictions and this graph would hide it. Possibly related to that squishing, I feel like the plot should be right-left symmetric, to reflect the symmetries of the predictors' biases. But it's not.
Attempt #2: the same thing, except instead of plotting
sum((1 if came_true else 0) for q in prediction_probs if q<p)
sum(-log(prob you assigned to the correct outcome) for q in prediction_probs if q<p)
i.e. we measure the total "surprisal" for all your predictions with probability under p. (I'm very fond of surprisal; it has some very appealing information-theory-esque properties.)
On the bright side, this plot has less overlap between the three predictors' typical sets of lines. And the red curves look... more symmetrical, kinda, like an odd function, if you squint. Same for the blue curves.
On the dark side, everything is still too squished together on the left. (I think this is a problem inherent to any "sum(... for q in prediction_probs if q<p)" function. I tried normalizing everything in terms of stdevs, but it ruined the symmetry and made everything kinda crazy on the left-hand side.)
There is the brier score, or any other proper scoring rule. These each have the advantage of being zero-degree-of-freedom up to the choice of scoring rule, though it isn't information preserving, and isn't comparable across different sets of predictions. (Though neither is any analogue of a CDF.)
Let be your estimation of propability of some event.
Let's define penalty if event happened, otherwise.
If is propability of event, expected penalty equals:
And expected penalty derivative equals
So, if you try to minimize average penalty, you are motivated to give your best possible estimate. It's possible that you can use it to grade your own predictions, I don't know for sure.