A Proper Scoring Rule for Confidence Intervals

21Scott Garrabrant

7SilentCal

5Scott Garrabrant

4Neil Fitzgerald

2gjm

1Neil Fitzgerald

4Ben Pace

7Scott Garrabrant

3Scott Garrabrant

4Qiaochu_Yuan

2Kevin S. Van Horn

1Scott Garrabrant

2Ben Pace

2habryka

11habryka

10Kaj_Sotala

15Blake Riley

14lukeprog

7StefanDeYoung

9Scott Garrabrant

4Qiaochu_Yuan

7gjm

2Dr_Manhattan

4gjm

1StefanDeYoung

2Куля Ботаніки

4StefanDeYoung

1Rafael Harth

3habryka

4habryka

6habryka

3philip-bl

4Qiaochu_Yuan

2habryka

2Qiaochu_Yuan

4habryka

4Qiaochu_Yuan

3Ben Pace

2ChristianKl

3Scott Garrabrant

1tunesmith

1Kevin S. Van Horn

1Scott Garrabrant

3Kevin S. Van Horn

1Scott Garrabrant

3Scott Garrabrant

1Scott Garrabrant

New Comment

EDIT: I originally said you can do this for multiple choice questions, which is wrong. It only works for questions with two answers.

(In a comment, to keep top level post short.)

One cute way to do calibration for probabilities, is to construst a spinner. If you have a true/false question, you can construct a spinner which is divided up according to your probability that each answer is the correct answer.

If you were to then spin the spinner once, and win if it comes up on the correct answer, this would not incentize constructing the spinner to represent your true beliefs. The best strategy is to put all the mass on the most likely answer.

However, if you spin the spinner twice, and win if either spin lands on the correct answer, you are actually incentivized to make the spinner match your true probabilities!

One reason this game is nice is that it does not require having a correctly specified utility function that you are trying to maximize in expectation. There are only two states, win and lose, and as long as winning is prefered to losing, you should construct your spinner with your true probabilities.

Unfortunately this doesnt work for the confidence intervals, since they seem to require a score that is not bounded below.

You are correct. It doesn't work for more than two answers. I knew that when I thought about this before, but forgot. Corrected above.

I dont have a nice algorithm for N answers. I tried a bunch of the obvious simple things, and they dont work.

I think an algorithm for N outcomes is: spin twice, gain 1 every time you get the answer right but lose 1 if both guesses are the same.

One can "see intuitively" why it works: when we increase the spinner-probability of outcome i by a small delta (imagining that all other probabilities stay fixed, and not worrying about the fact that our sum of probabilities is now 1 + delta) then the spinner-probability of getting the same outcome twice goes up by 2 x delta x p[i]. However, on each spin we get the right answer delta x q[i] more of the time, where q[i] is the true probability of outcome i. Since we're spinning twice we get the right answer 2 x delta x q[i] more often. These cancel out if and only if p[i] = q[i]. [Obviously some work would need to be done to turn that into a proof...]

In a true/false question that is true with probability , if you assign probability , your probability of losing is . (The probabily the answer is true and you spin false twice plus the probability the answer is false and you spin true twice.)

This probability is minimized when its derivative with respect to is , or at the boundary. This derivative is , whis is when . We now know the minimum is achieved when is , , or . The probability of losing when is . The probability of losing when is . The probability of losing when is , which is the lowest of the three options.

Copied without LaTeX:

In a true/false question that is true with probability p, if you assign probability q, your probability of losing is p(1−q)^2+(1−p)q^2. (The probabily the answer is true and you spin false twice plus the probability the answer is false and you spin true twice.)

This probability is minimized when its derivative with respect to q is 0, or at the boundary. This derivative is −2p(1−q)+2(1−p)q, whis is 0 when q=p. We now know the minimum is achieved when q is 0, 1, or p. The probability of losing when q=0 is p. The probability of losing when q=1 is 1−p. The probability of losing when q=p is p(1−p), which is the lowest of the three options.

Not exactly. Its expected value is the same as the expected value of the Brier score, but the score itself is either 0 or 1.

For some reason, the latex is not rendering for me. I can see it when I edit the comment, but not otherwise.

The comment has just started rendering for me.

Edit: Oh wait no, you just added another comment without LaTex.

This is an underappreciated fact! I like how simple the rule is when framed in terms of size and distance.

You mention both the linear and log rules. The log rule has the benefit of being scale-invariant, so your score isn't affect by the units the answer is measured in, but it can't deal with negatives and gets overly sensitive around zero. The linear rule doesn't blow up around zero, is shift-invariant, and can handle negative values fine. The best generic scoring rule would have all these properties.

Turns out (based on Lambert and Shoham, "Eliciting truthful answers to multiple choice questions") that all scoring rules for symmetric confidence intervals with coverage probability can be represented (up to affine transformation) as

where is the true value, is the indicator function, and is any increasing function. Unsurprisingly, the linear rule uses and the log rule uses . If we want scale-invariance on the whole real line, first thing I'd be tempted to do is use for positive

You probably already know that you can incentivise honest reporting of probabilities using a proper scoring rule like log score, but did you know that you can also incentivize honest reporting of confidence intervals?

To incentize reporting of a 90% confidence interval, take the score −S−20⋅D, where S is the size of your confidence interval, and D is the distance between the true value and the interval. D is 0 whenever the true value is in the interval.

This incentivizes not only giving an interval that has the true value 90% of the time, but also distributes the remaining 10% equally between overestimates and underestimates.

To keep the lower bound of the interval important, I recommend measuring S and D in log space. So if the true value is T and the interval is (L,U), then S is log(UL) and D is log(TU) for underestimates and log(LT) for overestimates. Of course, you need questions with positive answers to do this.

To do a P% confidence interval, take the score −S−200100−P⋅D.

This can be used to make training calibration, using something like Wits and Wagers cards more fun. I also think it could be turned into app, if one could get a large list of questions with numerical values.