# 19

In the book "How to Measure Anything" D. Hubbard presents a step-by-step method for calibrating your confidence intervals, which he has tested on hundreds of people, showing that it can make 90% of people almost perfect estimators within half a day of training.

I've been told that the Less Wrong and CFAR community is mostly not aware of this work, so given the importance of making good estimates to rationality, I thought it would be of interest.

(although note CFAR has developed its own games for training confidence interval calibration)

The main techniques to employ are:

### Equivalent bet:

For each estimate imagine that you are betting \$1000 on the answer being within your 90% CI. Now compare this to betting \$1000 on a spinner where 90% of the time you win and 10% of the time you lose. Would you prefer to take a spin? If so, your range is too small and you need to increase it. If you decide to answer the question your range is too large and you need to reduce it. If you don’t mind whether you answer the question or take a spin then it really is your 90% CI.

### Absurdity Test:

Start with an absurdly large range, maybe from minus infinity to plus infinity, and then begin reducing it based upon things you know to be highly unlikely or even impossible.

### Pros and cons:

Identify two pros and two cons for the range that you have given to help clarify your reasons for making this estimate.

Once you have used these techniques you can make another equivalent bet to check whether your new estimate is your 90% CI.

To train yourself, practice making estimates repeatedly while using these techniques, until you reach 100% accuracy.

To read more and try sample questions, read the article we prepared on 80,000 Hours here.

# 19

New Comment

Great book. It was percolating around CFAR a few months back - I (Dan from CFAR) read it, several other people read at least part of the book or my notes on it, and we had some conversations about it. A few things from the book that stuck out to me (although some may have been slightly distorted by memory):

• the definition of "measurement of X" as anything you do that reduces your uncertainty about X (which is nice and Bayesian)
• the first step in dealing with a problem, which Hubbard often had to lead people through when they brought him in as a consultant, is being specific about what the concrete issue at stake is, and why it matters. e.g., translating IT security, into things like "people being unable to work due to network downtime." (CFAR already had a unit on Being Specific, and it turned out that Hubbard had an exercise that was extremely similar to the Monday-Tuesday game that we were using)
• the importance of the skill of calibrated estimation, and calibration techniques discussed in the OP
• the value of Fermi estimation - Hubbard said that the Fermi method of decomposing a business question into subcomponents was usually necessary, and sometimes sufficient, for figuring out what to do
• Hubbard also has an approach for combining Fermi estimation with calibrated confidence intervals on subcomponents, and using Monte Carlo simulation to get a calibrated confidence interval for the main question. It would be cool to get that method down, but I haven't used it.
• Before you seek out information, identify what information would actually be useful - would this information change what I do? Figure out the value of information. VOI already was part of the LW idea library and the subject of a CFAR unit, but I suspect that How to Measure Anything has helped me internalize that question.

For a 90% CI there is a 10% chance that the answer lies outside your estimate, and if you split this there is a 5% chance that the answer is above your upper bound and a 5% chance that the answer is below your lower bound.

This isn't always true. For example, one calibration question I've done is, "How long are all 3 extended Lord of the Rings movies back to back?

On this, I was almost certain they were at least 3 hours long, but I wasn't sure how much more than that they were. So, my minimum was 9 hours. I was fairly confident they weren't more than 4 hours, so my upper was 12 (this was for a 70% interval). Almost all my uncertainty was on the upper end, while very little was on the lower.

FYI, I told the CFAR principals about How to Measure Anything, and specifically about the calibration exercises detailed in chapter 5, on September 9th of last year, at which time Anna said she had previously read the first half of the book.

But yeah, it hasn't been discussed on LW much, though it has been on my recommended books page for a long time.

Sorry Luke, I didn't want to bother you so didn't ask, but I should have guessed you would have found this :)

So, since basically everyone in the world is overconfident, you can make them better calibrated just by making them come up with an interval and then doubling it.

What I've never really got is how you become accurately calibrated at the long tails. Are there really people who can consistently give both 90% and 95% confidence intervals? To me those both just feel like "really likely", and the higher the granularity, the harder it gets - note that a 98% confidence interval should probably be twice as wide as a 95% confidence interval. Are there people who have truly internalised this?

I personally like this two player calibration game, which I was introduced to by Paul Christiano at a meetup a couple of years ago:

1. think of an unknown quantity (What year was the first woman elected to the US Congress?)
2. Player 1 comes up with a 50% confidence interval (I guess, technically, this is a credible interval...).
3. Player 2 chooses whether they want to take the "in" or the "out" side of the bet.

There's no need to choose a minimum width confidence interval (is there a technical term for that?) e.g. "before 1920" would be an acceptable confidence interval for the question given above.

The big advantage of 50% confidence intervals over 90% confidence intervals (other than that they make a nice easy structure for the game) is that you get much faster feedback. 20 trials can meaningfully tell you that your 50% confidence intervals are off in one direction or the other. 20 trials is enough to tell you if you're overconfident, but it can't tell you if you're underconfident.

The big disadvantage is that 50% confidence intervals somehow don't feel as useful as 90% confidence intervals. I'm not sure this is really true, as there's nothing special about 90% (by my reckoning 50% is about as far away from 90% as 90% is from 98%), but it feels true. Of course, it's pretty trivial to change the game so it works with intervals other than 50%, but you have to play longer, and it gets more complicated.

This is fantastic input. Thank you very much.

I am a little skeptical of the first technique of the wheel. I thought that was something I did naturally in any case. Of course, I do need to read the book to really figure out what's happening here though.