Prediction and Calibration - Part 1

Jan Christian Refsgaard

Scott Alexander is a darling of the Bayesian rationalist community, he has a lot more epistemic humility than most, despite being an impressively well-calibrated predictor.

In this series we will try to achieve 2 things:

(this post) We try to understand what a likelihood function is, and use it to evaluate predictions
(next post) We Make a Bayesian calibration model, and get an uncertainty estimate over our calibration.

The likelihood function

Let's first look at Bayes Theorem

In common parlance, the 4 parts of Bayes Theorem are called:

$p o s t e r i o r = \frac{l i k e l i h o o d \times p r i o r}{d a t a}$

What we want is our posterior, the probability of some model parameters (often $θ$ ) given some data ( $y$ ). We construct a model with two things, a prior function which describes what we believe before seeing the data, and a likelihood function ( $p (y ∣ θ)$ ) which given a model ( $θ$ , drawn from the prior) scores the data.

The simplest and most relevant likelihood function is the Bernoulli

$p (y | θ) = θ^{y} (1 - θ)^{1 - y}$

Here $y$ is 1 when our prediction turns out to be correct and is 0 otherwise. And $θ$ represents our model. Our model for now is just 'what Scott predicted.'

As an example, let's take a prediction of $θ = 0.6$ . If the prediction turns out to be true ( $y = 1$ ), then the Bernoulli likelihood function is equal to 0.6:

$\begin{matrix} p (y = 1 | θ = 0.6) & = θ^{y} (1 - θ)^{1 - y} = {0.6}^{1} (1 - 0.6)^{1 - 1} = {0.6}^{1} \times {0.4}^{0} = 0.6 \end{matrix}$

And if the prediction turned out wrong ( $y = 0$ ), then:

$\begin{matrix} p (y = 0 | θ = 0.6) & = θ^{y} (1 - θ)^{1 - y} = {0.6}^{0} (1 - 0.6)^{1 - 0} = {0.6}^{0} \times {0.4}^{1} = 0.4 \end{matrix}$

The likelihood function says that there was a 40% chance you were wrong. Which is the same a predicting not $θ$ with 40%.

If a person makes 3 predictions $θ = [0.6, 0.6, 0.7]$ and the outcomes were $y = [1, 0, 1]$ , then the likelihood of all 3 observations is simply the product of the 3 Bernoulli likelihoods:

$\begin{matrix} p (y ∣ θ) & = 3 \prod i = 1 p (y_{i} ∣ θ_{i}) = 0.6 \times (1 - 0.6) \times 0.7 = 0.168 \end{matrix}$

Better predictions will have higher numbers.

It can be useful to divide by the null predictor to compare against random performance:

$p (y ∣ θ = 0.5) = N \prod i = 1 p (y_{i} ∣ θ_{i} = 0.5) = {0.5}^{N}$

So the likelihood of the 3 above predictions are $\frac{0.168}{{0.5}^{3}} \approx 1.34$ times more likely than random. Making this person slightly better than random.

How good a predictor is Scott

Because Scott has made a lot of predictions, and because we will later implement a 'calibration' model of Scott, let's try to compare the likelihood of his 2019 predictions with the null model which predicts everything with 50% (which implicitly mean that it also predicts it doesn't happen with 50%).

First we import numeric and scientific python libraries

import numpy as np
import scipy as sp
import scipy.stats

Then we code Scott Alexanders 2019 prediction as [Guess, Outcome].

Because Outcome is what we want to predict, we put that in the $y$ variable, and put Guess in the predictor variable $x$ .

data = np.array((
    [[0.5, 1]] *  7 + [[0.5, 0]] * 4 +
    [[0.6, 1]] * 15 + [[0.6, 0]] * 7 +
    [[0.7, 1]] * 12 + [[0.7, 0]] * 5 +
    [[0.8, 1]] * 31 + [[0.8, 0]] * 6 +
    [[0.9, 1]] * 16 + [[0.9, 0]] * 1 +
    [[0.95, 1]] * 5 + [[0.95, 0]] * 0
))
y = data[:, 1]
X = data[:, 0]

The person who made 3 predictions and got 2 correct was slightly better than random. How much better than random is Scott?

Let's take the product of all his predictions.

scott_likelihood = sp.stats.bernoulli(X).pmf(y).prod()
random_predictor = 0.5 ** len(y)
f"{scott_likelihood / random_predictor:g}"

'7.4624e+09'

So 7 billion times more likely! There are two reasons why this number is so large: 1) Scott made a lot of predictions and 2) Scott is a very good predictor. It is easy to become a better predictor than Scott if you simply make a lot of predictions about things that are easy to predict. The hard part is being as well-calibrated as Scott.

Prediction vs Calibration

Predictor:

A good predictor is a person who predicts better than random:
- $\prod p (y ∣ θ) >> {0.5}^{N}$
A bad predictor is a person who predicts close to random:
- $\prod p (y ∣ θ) \approx {0.5}^{N}$
A terrible predictor is one who are worse than random:
- $\prod p (y ∣ θ) < {0.5}^{N}$

It may be hard to understand how you can be worse than random, and that of course takes skill, but if Scott had flipped all his guesses, his likelihood ratio would be $\frac{1}{7 \times 10^{9}}$ which is much less than 1.

Now that we all agree that Scott is a good predictor, we can finally introduce what we want to talk about: How well-calibrated is Scott and how do we measure that?

Calibrated:

A well-calibrated predictor makes predictions that match the outcome frequency.

Example

Person A predicts 100 things with 60% confidence, 61 of them turns out to occur, because $\frac{61}{100} \approx 0.6$ this person is very well-calibrated.
Person B predicts 100 things with 80% confidence, 67 of them turns out to occur, because $\frac{67}{100} \neq 0.8$ this person is not very well-calibrated.

Because 67 > 61, is Person B the better predictor, even though they're not as well-calibrated? Let's evaluate the likelihood of their claims.

Person A's prediction is equivalent to 61 'correct' 60% predictions and 39 'correct' 40% predictions, yielding the following likelihood:

${0.6}^{61} \times {0.4}^{39} \approx 8.86 \times 10^{- 30}$

Person B's prediction is equivalent to 67 'correct' 80% predictions and 33 'correct' 20% predictions, yielding the following likelihood

${0.8}^{67} \times {0.2}^{33} \approx 2.76 \times 10^{- 30}$

Because $8.86 \times 10^{- 30} > 2.76 \times 10^{- 30}$ Person A is also a slightly better predictor than person B. To understand why, let's consider Person C:

Person C predicts 100 things with 100% confidence and 99 of them turn out to occur. Thus, he will spend an eternity in Probability hell for assigning 0% probability to something that actually occurred. This is also reflected in the likelihood of his predictions, which is zero:

$1^{99} \times 0^{1} = 0$

As renato points out in the comments, the likelihood tracks a combination of how many you got right and how well calibrated you are. Thus for your predictions to get more likely, you can either "git good" or "get calibrated", where get calibrated seems like the more achievable goal. In the next post we will make a model that tracks calibration independent of prediction, this post is a teaser to introduce the necessary concepts for none statisticians.

Summary so far

We can improve the likelihood of our predictions by being both well-calibrated and very knowledgeable. The next post in this series will focus on measuring calibration.

How good a predictor you are can be evaluated by the product of your likelihood function. Is there a better way to evaluate this? Yes, make a model!

We can also make a model to find out how well-calibrated we are. That is what we will explore in the next post.

There are two reasons why this number is so large: 1) Scott made a lot of predictions and 2) Scott is a very good predictor.

To address #1, you can take Nth root of this number where N is the total number of predictions made. This gives you scott's edge vs randomness per each prediction (on average)

you mean the N'th root of 2 right?, which is what I called the null predictor and divided Scott predictions by in the code:

random_predictor = 0.5 ** len(y)

which is equivalent to where $N$ is the total number of predictions

Because 8.86×10−30>2.76×10−30 Person A is also a slightly better predictor than person B.

Wait, i got confused by the function you used to assign the calibration score. It worked in that case, but it will yield higher values for those who make more 'correct' predictions, not those who are more calibrated. For example, person A predicts 100 things with 60% confidence, 61 of them turns out to occur and person D predicts 100 things with 60% confidence, 60 of them turns out to occur. Person D is more calibrated, but gets a lower score than person A, ~5.9e-30 vs ~8.86e-30 (and person E who made 100 predictions with 60 % confidence, which all turned out to be true, would score ~6.53e-21).

I have tried to add a paragraph about this, because I think it's a good point, and it's unlikely that you were the only one who got confused about this, Next weekend I will finish part 2 where I make a model that can track calibration independent of prediction, and in that model the 60% 61/100 will have a better posterior of the calibration parameter than then 60% 100/100, though the likelihood of the 100/100 will of course still be highest.

I'm looking forward to read it, because I think one of the current bottlenecks that limit how many predictions i do is that i cannot easily compare how i'm doing week after week, and i have been looking for a model that help me check how i'm doing for several predictions.

you may be disappointed, unless you make 40+ predictions per week it will be hard to compare weekly drift, the Bernoulli distribution has a much higher variance compared to the normal distribution, so the uncertainty estimate of the calibration is correspondingly wide (high uncertainty of data -> high uncertainty of regression parameters). My post 3 will be a hierarchical model which may suite your needs better but it will maybe be a month before I get around to making that model.

If there are many people like you then we may try to make a hackish model that down weights older predictions as they are less predictive of your current calibration than newer predictions, but I will have to think long and hard to make than into a full Bayesian model, so I am making no promises

You are absolutely right, any framework that punishes you for being right would be bad, my point is that increasing your calibration helps a surprising amount and is much more achievable than "just git good" which is required for improving prediction.

I will try to put your point into the draft when I am off work , thanks

I have not been consistent with my probability notation, I sometimes use upper case P and sometimes lower case p, in future posts I will try to use the same notation as Andrew Gelman, which is for things that are probabilities (numbers) such as $P r (y = 1) = 0.7$ and $p$ for distributions such as $p \sim N (0, 2)$ . However since this is my first post, I am afraid that 'editing it' will waist the moderators time as they will have to read it again to check for trolling, what is the proper course of action?

I think it is best to try to edit it anyway. I think if you have already seen the post, it does not take that long to see that there isn't a line added that is trolly. Also, you should do it for the sake of mathematical accuracy.

Thanks, also thanks for pointing out that I had written a few places instead of $p (y ∣ θ)$ , since everything is the bernoulli distribution I have changed everything to $p$