Making predictions is a good practice, writing them down is even better.

However, we often make binary predictions when it is not necessary, such as

  • Biden win popular vote: 91%
  • Danish COVID deaths above 10.000 by January 1. 2022: 84%

Alternatively, we could make predictions from a normal distribution, such as ('~' means ‘comes from’):

  • Biden’s popular vote ~ N(0.54, 0.03)
  • Danish COVID deaths by January 1. 2022 ~ N(15,000, 5,000)

While making "Normal" predictions seems complicated, this post should be enough to get you started, and more importantly to get you a method for tracking your calibration, which is much harder with dichotomous predictions.

The key points are these:

  1. Predicting from a normal is surprisingly easy.
  2. Getting an actionable number for how over/under confident you are requires only simple math!
  3. The normal distribution carries more information than the Bernoulli (binary outcome such as coins) and will therefore give you more information to act on!

Things this post will answer:

  1. How do I make a normal prediction?
  2. Why do I want to do this?
  3. How do I track my calibration?

Quick recap about the normal distribution

The normal distribution is usually written as N(,) has 2 parameters:

  • a location parameter (pronounced mu) which is both the most likely and the average value
  • a scale parameter (pronounced sigma) which captures uncertainty, high implying high uncertainty

the 68-95-99.7 rule states that:

  • 68% of your predictions should fall in

  • 95% of your predictions should fall in

  • 99.7% of your predictions should fall in

    Towards Data Science 68-95-99.7 Image

50% of the predictions should fall within , which can be used as a quick spot check.

The last piece of Normal trivia we need to know is this: the variance of the Normal is simply :

How to make predictions

To make a prediction, there are two steps. Step 1 is predicting . Step 2 is using the 68-95-99.7 rule to capture your uncertainty in .

I tried to predict Biden’s national vote share in the 2020 election. From the polls, I got 54% as a point estimate, so that seemed like a good guess for . For I used the 68-95-99.7 rule and tried to see what that would imply for different values of . Here is a table for 2-5%

Intervals 68% 95% 99.7%
52-56% 50-58% 48-60%
51-57% 48-60% 45-63%
50-58% 46-62% 42-66%
49-59% 44-64% 39-69%

implies a 97.5% (95% interval + half a tail) chance that Biden would get more than 50% of the votes; I was not that confident. implies a 84% chance that Biden would get more than 50% of the votes (68% + 32%/2), and a 16% chance Trump wins, I likewise found this too high, so I settled on .

Why do I want to do this

Biden Got 52% of the vote share, which was within 1 sigma of my prediction. There are two weak lessons that I drew from this ONE data point.

  1. The pollsters screwed up, so I should have regressed towards the mean (50%), such as predicting 53% instead of 54%
  2. The prediction was exactly from , so the was on the 50%/50% boundary just as expected. This was lucky, but it's weak evidence that the was well chosen.

Imagine I instead had predicted Biden wins (the popular vote) 91%, well guess what he won, so I was right... and that is it. Thinking I should have predicted 80% because the pollsters screwed up seems weird, as that is a weaker prediction and the bold one was right! I would need to predict a lot of other elections to see whether I am over or under confident.

How to track your calibration

Note: In the previous section we used and for predictions. In this section we will use and where i is the index (prediction 1, prediction 2... prediction N). We will use for the calibration point estimate; this means that is a number such as 1.73. In the next post in this series, we will use for the calibration distribution, this means that is a distribution like your predictions and thus has an uncertainty.

I also made a terrible prediction, during the early lock down in 2020. I predicted N(15,000, 5,000) COVID deaths by 2022 in Denmark. It turned out to be 3,200, which is standard deviations away, so outside the 95% interval!

In this section we will transform your predictions to the Unit normal. This is called z-scoring, because if all predictions are on the same scale, then they are comparable

Normally when you convert to z-scores you use the data itself to calculate and , which guarantee a N(0,1). Here, we will use our predicted and . This means there will be a discrepancy between and our . This discrepancy describes how under/over confident your intervals are, and thus describes your calibration, such that if = 2 then all your intervals should be twice as wide to achieve

First we z-score our data by calculating how many they are away from the observed data , using this formula:

Second we calculate as the RMSE (root mean squared error) of all predictions:

And that is, let's calculate for my two predictions, first we calculate the variances:

Then we calculate

So if these were my only two predictions, then I should widen my future intervals by 73%. In other words, because is 1.73 and not 1, thus my intervals are too small by a factor of 1.73. If I instead had gotten , such as then this would be evidence that my intervals were to wide and should be "scaled back" by multiplying my intervals by 0.5.

Still not convinced?

Here are some bonus arguments:

  1. Weak 50/50: Sometimes you are actually 50/50, such as Scott's prediction that Bitcoin had a 50-50 shot of going over 3000 in 2019; that could be reformulated as "Bitcoin ~ N(3000, 1500)" such that a price of 10000 counts against the prediction. Now a weak prediction still gives evidence of calibration!
  2. Overshooting and Undershooting: If Biden had gotten 20 or 80% of the votes, both things would be strong evidence of my prediction being wrong, where the binary predictions can only be 'wrong in one direction'
  3. High Confidence Predictions are easier to calibrate: In Binary land a 99% prediction is very hard to calibrate because you need to make hundreds of them to get enough data (unless many turn out wrong of course). A Corresponding Normal prediction would have a small σ and thus give as much evidence of calibration as a 60% prediction.
  4. Right for the Wrong Reason: All of N(50.67, 0.5), N(54, 3), N(58, 6), give Biden a 91% win chance, but for very different reasons, and will thus lead people to update differently after observing .

Advanced Techniques

Sometimes your beliefs do not follow a Normal distribution. For example, the Bitcoin prediction N(3000, 1500) implies I believe there is a 2.5% chance the price will become negative, which is impossible. There are 3 solutions in increasing order of fanciness to deal with this:

  1. Have different for each direction such as (HN = Half Normal):

This means if it's above then , while if it's below then . If you do this, then you can use "the relevant " when calibrating and ignore the other one, so if the price of bitcoin ended up being then z becomes :

  1. Often you believe something goes up or down by a factor, such as Bitcoin dropping to half or doubling. For ease of example let imagine that Scott thought there was a 68% chance that Bitcoin’s value would change by less than a factor of 2.

z-scoring works the same way, so if the Bitcoin price was 10.000 then:

  1. (If this makes no sense, then ignore it): Using an arbitrary distribution for predictions, then use its CDF (Universality of the Uniform) to convert to , and then transform to z-score using the inverse CDF (percentile point function) of the Unit Normal. Finally use this as in when calculating your calibration.

Final Remarks

I want you to stop and appreciate that we can get a specific actionable number after 2 predictions, which is basically impossible with binary predictions! So start making normal predictions, rather than dichotomous ones!

As a final note, keep this distinction in mind:

  1. If the data and the prediction are close, then you are a good predictor
  2. If the mean prediction error on the z-scale is close to 1, then you are a well calibrated predictor.

Getting good at 1 requires domain knowledge for each specific prediction, while getting good at 2 is a general skill that applies to all predictions.

This post we calculated the point estimate based on 2 data points. There is a lot of uncertainty in a point estimate based on two data points, so we should expect the calibration distribution over to be quite wide. The next post in this series will tackle this by calculating a Frequentest confidence interval for and a Bayesian posterior over . This allows us to make statements such as: I am 90% confident that , so it's much more likely that I am badly calibrated than unlucky. With only two data points it is however hard to tell the difference with much confidence.

Finally I would like to thank my editors Justis Mills and eric135 for making this readable.

New Comment
69 comments, sorted by Click to highlight new comments since:

We rationalists are very good at making predictions, and the best of us, such as Scott Alexander

This weird self-congratulatory tribalism is a completely unnecessary distraction from an otherwise informative post. Are "we" unusually good at making predictions, compared to similarly informed and motivated people who haven't pledged fealty to our social group? How do you know?

Scott Alexander is a justly popular writer, and I've certainly benefitted from reading many of his posts, but it seems cultish and bizarre to put him on a pedestal as "the best" of "us" like this as the first sentence of a post that has nothing to do with him.

changed to "Making predictions is a good practice, writing them down is even better."

does anyone have a better way of introducing this post?

Overall great post: by retrospectively evaluating your prior predictions (documented so as to avoid one's tendency to 'nudge' your memories based on actual events which transpired) using a 'two valued' Normal distribution (guess and 'distance' from guess as confidence interval), rather than a 'single-valued' bernoulli/binary distribution (yes/no on guess-actual over/under), one is able to glean more information and therefore more efficiently improve future predictions.

That opening statement, while good and useful, does come off a little 'non sequitur'-ish. I urge to find a more impactful opening statement (but don't ahve a recommendation, other than some simplification resulting from what I said above).

My original opening statement got trashed for being to self congratulatory, so the current one is a hot fix :), So I agree with you!

(Edit: the above post has 10 up votes, so many people feel like that, so I will change the intro)

You have two critiques:

  1. Scott Alexander evokes tribalism

  2. We predict more than people outside our group holding everything else constant

  3. I was not aware of it, and I will change if more than 40% agree

Remove reference to Scott Alexander from the intro: [poll]{Agree}{Disagree}

  1. I think this is true, but have no hard facts, more importantly you think I am wrong, or if this also evokes tribalism it should likewise be removed...

Also Remove "We rationalists are very good at making predictions" from the intro: [poll]{Agree}{Disagree}

If i remove both then I need a new intro :D

[This comment is no longer endorsed by its author]Reply

I think you're advocating two things here:

  1. Make a continuous forecast when forecasting a continuous variable
  2. Use a normal distribution to approximate your continuous forecast

I think that 1. is an excellent tip in general for modelling. Here is Andrew Gelman making the same point

However, I don't think it's actually always good advice when eliciting forecasts. For example, fairly often people ask whether or not they should make a question on Metaculus continuous or binary. Almost always my answer is "make it binary". Binary questions get considerable more interest and are much easier to reason about. The additional value of having a more general estimate is almost always offset by:

  1. Fewer predictors => less valuable forecast
  2. People update less frequently => Stale forecast
  3. Harder to visualize changes over time => Less engagement from the general public

I think your point 2. has been well dealt with by gbear605, but let me add my voice to his. Normal approximations are probably especially bad for lots of things we forecast. Metaculus uses a logistic distribution by default because it automatically includes slightly heavier tails than normal distributions.

Agreed 100% on 1) and with 2) I think my point is "start using the normal predictions as a gate way drug to over dispersed and model based predictions"

I stole the idea from Gelman and simplified it for the general community, I am mostly trying to raise the sanity waterline by spreading the gospel of predicting on the scale of the observed data. All your critiques of normal forecasts are spot on.

Ideally everybody would use mixtures of over-dispersed distributions or models when making predictions to capture all sources of uncertainty

It is my hope that by educating people in continuous prediction the Metaculus trade off you mention will slowly start to favor the continuous predictions because people find it as easy as binary prediction... but this is probably a pipe dream, so I take your point

Foretold (https://www.foretold.io/) supports many continuous functions including normal for predictions and resolutions. It also had scoring rules for continuous predictions and resolution functions, and composite functions for both. The creator, Ozzie Gooen, was working on an even more sophisticated system but I'm not sure what stage that's currently at.

The more sophisticated system is Squiggle. It's basically a prototype. I haven't updated it since the posts I made about it last year.
https://www.lesswrong.com/posts/i5BWqSzuLbpTSoTc4/squiggle-an-overview 

I generally agree with the idea - a range prediction is much better than a binary prediction - but a normal prediction is not necessarily the best. It’s simple and easy to calculate, which is great, but it doesn’t account for edge cases.

Suppose you made the prediction about Biden a couple months before the election, and then he died. If he had been taken off of the ballot, he would have received zero votes, and even if he had been left on “Biden” would have received much fewer votes. Under your normal model, the chance of either of those happening is essentially zero, but there was probably a 1-5% chance of it happening. You can adjust for this by adding multiple normal curves and giving different weights to each curve, though I’m not sure how to do the scoring with this.

It also doesn’t work well for exponential behavior. For COVID cases in a given period, a few days difference in changing behavior could alter the number of deaths by a factor of 2 or more. That can be easily rectified though by putting your predictions in log form, but you have to remember to do that.

Overall though, normal predictions work well for most predictions, and we’d be better off using them!

Good Points, Everything is a conditional probability, so you can simply make conditional normal predictions:

Let A = Biden alive

Let B = Biden vote share

Then the normal probability is conditional on him being alive and does not count otherwise :)

Another solution is to make predictions from a T-distribution to get fatter tails. and then use "Advanced trick 3" to transform it back to a normal when calculating your calibration.

Given also the fact that many significant events seem to occur with on distributions with fat tails, assuming normal distributions may lead you to be systematically overconfident in your predictions. Though it's still probably far, far better than using binary estimates.

You could make predictions from a t distribution to get fatter tails, but then the "easy math" for calibration becomes more scary... You can then take the "quartile" from the t distribution and ask what sigma in the normal that corresponds to. That is what I outlined/hinted at in the "Advanced Techniques 3"

One of the things I like about a Brier Score is that I feel like I intuitively understand how it rewards calibration and also decisiveness.

It is trivial to be perfectly calibrated on multiple choice (with two choices being a "binary" multiple choice answer) simply by throwing decisiveness out the window: generate answers with coin flips and give confidence for all answers of 1/N.  You will come out with perfect calibration, but also the practice is pointless, which shows that we intuitively don't care only about being calibrated.

However, this trick gets a very bad (edited from low thanks to GWS for seeing the typo) Brier Score, because the Brier Score was invented partly in response to the ideas that motivate the trick :-)

We also want to see "1+1=3" and assign it "1E-7" probability, because that equation is false and the uncertainty is more likely to come from typos and model error and so on.  Giving probabilities like this will give you very very very low Brier Scores... as it should! :-)

The best possible Brier Score is 0.0 in the same way that the best RMSE is 0.0. This is reasonable because the RMSE and Brier Score are in some sense the same concept.

It makes sense to me that for both your goal is to make them zero. Just zero. The goal then is to know all the things... and to know that you know them by getting away with assigning everything very very high or very very low probabilities (and thus maxing the decisiveness)! <3

Second we calculate  as the RMSE (root mean squared error) of all predictions... Then we calculate  ...

So if these were my only two predictions, then I should widen my future intervals by 73%. In other words, because  is 1.73 and not 1, thus my intervals are too small by a factor of 1.73.

I'm not sure if you're doing something conceptually interesting here (like how Brier Scores interestingly goes over and above mere "Accuracy" or mere "Calibration" by taking several good things into account in a balanced way), or... maybe... are you making some sort of error? 

RMSE works with nothing but point predictions. It seems like you recognize that the standard deviations aren't completely necessary when you write:

(1) If the data  and the prediction  are close, then you are a good predictor

Thus maybe you don't need to also elicit a distribution and a variance estimate from the predicter? I think? There does seem to be something vaguely pleasing about aiming for an RMSE of 1.0 I guess (instead of aiming for 0.00000001) because it does seem like it would be nice for a "prediction consumer" to get error bars as part of what the predictor provides?

But I feel like this might be conceptually akin to sacrificing some of your decisiveness on the altar of calibration (as with guessing randomly over outcomes and always using a probability of 1/N).

The crux might be something like a third thing over and above "decisiveness & calibration" that is also good and might be named... uh... "non-hubris"? Maybe "intervalic coherence"? Maybe "predictive methodical self-awareness"?

Is it your intention to advocate aiming for RMSE=1.0 and also simultaneously advocate for eliciting some third virtuous quality from forecasters?

Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0.
Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?

With binary predictions you can cheat and predict 50/50 as you point out... You can't cheat with continuous predictions as there is no "natural" midpoint.

The insight you are missing is this:

  1. I "try" to Convert my predictions to the Normal N(0, 1) using the predicted mean and error.
  2. The variance of the unit Normal is 1: Var(N(0, 1)) = 1^2 = 1
  3. If my calculated variance deviate from the unit normal, then that is evidence that I am wrong, I am making the implicit assumption that I cannot make "better point predictions" (change ) and thus is forced to only update my future uncertainty interval by .

To make it concrete, If I had predicted (sigma here is 10 wider than in the post):

  • Biden ~ N(54, 30)
  • COVID ~ N(15.000, 50.000)

then the math would give . Both the post predictions and the "10 times wider predictions in this comment" implies the same "recalibrated" :

(On a side note I hate brier scores and prefer Bernoulli likelihood, because brier says that predicting 0% or 2% on something that happens 1% of the time is 'equally wrong' (same square error)... where the Bernoulli says you are an idiot for saying 0% when it can actually happen)

When I google for [Bernoulli likelihood] I end up at the distribution and I don't see anything there about how to use it as a measure of calibration and/or decisiveness and/or anything else.

One hypothesis I have is that you have some core idea like "the deep true nature of every mental motion comes out as a distribution over a continuous variable... and the only valid comparison is ultimately a comparison between two distributions"... and then if this is what you believe then by pointing to a different distribution you would have pointed me towards "a different scoring method" (even though I can't see a scoring method here)... 

Another consequence of you thinking that distributions are the "atoms of statistics" (in some sense) would (if true) imply that you think that a Brier Score has some distribution assumption already lurking inside it as its "true form" and furthermore that this distribution is less sensible to use than the Bernoulli?

...

As to the original issue, I think a lack of an ability, with continuous variables, to "max the calibration and totally fail at knowing things and still get an ok <some kind of score> (or not be able to do such a thing)" might not prove very much about <that score>?

Here I explore for a bit... can I come up with a N(m,s) guessing system that knows nothing but seems calibrated?

One thought I had: perhaps whoever is picking the continuous numbers has biases, and then you could make predictions of sigma basically at random at first, and then as confirming data comes in for that source, that tells you about the kinds of questions you're getting, so in future rounds you might tweak your guesses with no particular awareness of the semantics of any of the questions... such as by using the same kind of reasoning that lead you to concluding "widen my future intervals by 73%" in the example in the OP.

With a bit of extra glue logic that says something vaguely like "use all past means to predict a new mean of all numbers so far" that plays nicely with the sigma guesses... I think the standard sigma and mean used for all the questions would stabilize? Probably? Maybe?

I think I'd have to actually sit down and do real math (and maybe some numerical experiments) to be sure that it would. But is seems like the mean would probably stabilize, and once the mean stabilizes the S could be adjusted to get 1.0 eventually too? Maybe some assumptions about the biases of the source of the numbers have to be added to get this result, but I'm not sure if there are any unique such assumptions that are privileged. Certainly a Gaussian distribution seems unlikely to me. (Most of the natural data I run across is fat-tailed and "power law looking".)

The method I suggest above would then give you a "natural number scale and deviation" for whatever the source was for the supply of "guess this continuous variable" puzzles. 

As the number of questions goes up (into the thousands? the billions? the quadrillions?) I feel like this content neutral sigma could approach 1.0 if the underlying source of continuous numbers to estimate was not set up in some abusive way that was often asking questions whose answer was "Graham's Number" (or doing power law stuff, or doing anything similarly weird). I might be wrong here. This is just my hunch before numerical simulations <3

And if my proposed "generic sigma for this source of numbers" algorithm works here... it would not be exactly the same as "pick an option among N at random and assert 1/N confidence and thereby seem like you're calibrated even though you know literally nothing about the object level questions" but it would be kinda similar.

My method is purposefully essentially contentless... except it seems like it would capture the biases of the continuous number source for most reasonable kinds of number sources.

...

Something I noticed... I remember back in the early days of LW there was an attempt to come up with a fun game for meetups that exercises calibration on continuous variables.  It ended up ALSO needing two numbers (not just a point estimate).

The idea was to have have a description of a number and a (maybe implicitly) asserted calibration/accuracy rate that a player should aim for (like being 50% confident or 99% confident or whatever). 

Then, for each question, each player emits two numbers between -Inf and +Inf and gets penalized if the true number is outside their bounds, and rewarded if the true number is inside, and rewarded more for a narrower bound than anyone else. The reward schedule should be such that an accuracy rate they have been told to aim for would be the winning calibration to have.

One version of this we tried that was pretty fun and pretty easy to score aimed for "very very high certainty" by having the scoring rule be: (1) we play N rounds, (2) if the true number is ever outside the bounds you get -2N points for that round (enough to essentially kick you out of the "real" game), (3) whoever has the narrowest bounds that contains the answer gets 1 point for that round. Winner has the most points at the end. 

Playing this game for 10 rounds, the winner in practice was often someone who just turned in [-Inf, +Inf] for every question, because it turns out people seem to be really terrible at "knowing what they numerically know" <3

The thing that I'm struck by is that we basically needed two numbers to make the scoring system transcend the problems of "different scales or distributions on different questions".

That old game used "two point estimates" to get two numbers.  You're using a midpoint and a fuzz factor that you seem strongly attached to for reasons I don't really understand. In both cases, to make the game work, it feels necessary to have two numbers, which is... interesting. 

It is weird to think that this problem space (related to one-dimensional uncertainty) is sort of intrinsically two dimensional. It feels like something there could be a theorem about, but I don't know of any off the top of my head.

There is a new game currently sold at Target that is about calibration and estimation.

Each round has two big numbers, researched from things like "how many youtube videos were uploaded per hour in 2020?" Or "how many pounds does Mars weigh?" Each player guesses how much larger one is than the other (ie 2x, 5x. 10x, 100x, 1000x), and can bet on themselves if they are confident in thier estation.

Rather than using z-scoring, one can use log probabilities to measure prediction accuracy. They are computed by