When I see people grading their predictions, it's always by: (a) bucketing their predictions by probability (into a "46-55%" bucket, a "56-75%" bucket, ...), and then (b) plotting each bucket's nominal probability vs empirical frequency-of-correctness. See e.g. Scott Alexander here.

This seems... fine... but the bucketing step has a certain inelegance to it: just as you can build many different-looking histograms for the same dataset, you can build many different-looking calibration curves for the same predictions, based on a semi-arbitrary choice of bucketing algorithm. Also, by bucketing datapoints together and then aggregating over the bucket, information is destroyed.

For histograms, there's an information-preserving, zero-degree-of-freedom alternative: the CDF. The CDF isn't perfect, but it at least has a *different* set of problems from histograms.

Is there any similar tool for grading predictions?

The problem is that this measures their amount of knowledge about the questions as well as their calibration.

My model would be as follows. For a fixed source of questions, each person has a distribution describing how much they know about the questions. It describes how likely it is that a given question is one they should say p on. Each person also has a calibration function f, such that when they should say p they instead say f(p). Then by assigning priors over the spaces of these distributions and calibration functions, and applying Bayes' rule we get a... (read more)