A Proper Scoring Rule for Confidence Intervals

Scott Garrabrant

EDIT: I originally said you can do this for multiple choice questions, which is wrong. It only works for questions with two answers.

(In a comment, to keep top level post short.)

One cute way to do calibration for probabilities, is to construst a spinner. If you have a true/false question, you can construct a spinner which is divided up according to your probability that each answer is the correct answer.

If you were to then spin the spinner once, and win if it comes up on the correct answer, this would not incentize constructing the spinner to represent your true beliefs. The best strategy is to put all the mass on the most likely answer.

However, if you spin the spinner twice, and win if either spin lands on the correct answer, you are actually incentivized to make the spinner match your true probabilities!

One reason this game is nice is that it does not require having a correctly specified utility function that you are trying to maximize in expectation. There are only two states, win and lose, and as long as winning is prefered to losing, you should construct your spinner with your true probabilities.

Unfortunately this doesnt work for the confidence intervals, since they seem to require a score that is not bounded below.

[-]SilentCal7y70

Two spins only works for two possible answers. Do you need N spins for N answers?

[-]Scott Garrabrant7y50

You are correct. It doesn't work for more than two answers. I knew that when I thought about this before, but forgot. Corrected above.

I dont have a nice algorithm for N answers. I tried a bunch of the obvious simple things, and they dont work.

[-]Neil Fitzgerald7y40

I think an algorithm for N outcomes is: spin twice, gain 1 every time you get the answer right but lose 1 if both guesses are the same.

One can "see intuitively" why it works: when we increase the spinner-probability of outcome i by a small delta (imagining that all other probabilities stay fixed, and not worrying about the fact that our sum of probabilities is now 1 + delta) then the spinner-probability of getting the same outcome twice goes up by 2 x delta x p[i]. However, on each spin we get the right answer delta x q[i] more of the time, where q[i] is the true probability of outcome i. Since we're spinning twice we get the right answer 2 x delta x q[i] more often. These cancel out if and only if p[i] = q[i]. [Obviously some work would need to be done to turn that into a proof...]

[-]gjm7y20

Just to be clear: if you spin twice and both come up right, you're gaining 2 and then losing 1? (I.e., this is equivalent to what you wrote in an earlier version of the comment?)

[-]Neil Fitzgerald7y10

That's right.

[-]Ben Pace7y40

(Why does the two-spin work?)

[-]Scott Garrabrant7y70

In a true/false question that is true with probability $p$ , if you assign probability $q$ , your probability of losing is $p (1 - q)^{2} + (1 - p) q^{2}$ . (The probabily the answer is true and you spin false twice plus the probability the answer is false and you spin true twice.)

This probability is minimized when its derivative with respect to $q$ is $0$ , or at the boundary. This derivative is $- 2 p (1 - q) + 2 (1 - p) q$ , whis is $0$ when $q = p$ . We now know the minimum is achieved when $q$ is $0$ , $1$ , or $p$ . The probability of losing when $q = 0$ is $p$ . The probability of losing when $q = 1$ is $1 - p$ . The probability of losing when $q = p$ is $p (1 - p)$ , which is the lowest of the three options.

[-]Scott Garrabrant7y30

Copied without LaTeX:

In a true/false question that is true with probability p, if you assign probability q, your probability of losing is p(1−q)^2+(1−p)q^2. (The probabily the answer is true and you spin false twice plus the probability the answer is false and you spin true twice.)

This probability is minimized when its derivative with respect to q is 0, or at the boundary. This derivative is −2p(1−q)+2(1−p)q, whis is 0 when q=p. We now know the minimum is achieved when q is 0, 1, or p. The probability of losing when q=0 is p. The probability of losing when q=1 is 1−p. The probability of losing when q=p is p(1−p), which is the lowest of the three options.

[-]Qiaochu_Yuan7y40

This is called either Brier or quadratic scoring, not sure which.

[-]Kevin S. Van Horn7y20

Not exactly. Its expected value is the same as the expected value of the Brier score, but the score itself is either 0 or 1.

[-]Scott Garrabrant7y10

For some reason, the latex is not rendering for me. I can see it when I edit the comment, but not otherwise.

[-]Ben Pace7y20

The comment has just started rendering for me.

Edit: Oh wait no, you just added another comment without LaTex.

[-]habryka7y20

Huh, that’s really weird. The server must somehow be choking on the specific LaTeX you posted. Will check it out.

[-]habryka7y120

Ok, I found the bug. I will fix it in the morning.

[-]Kaj_Sotala7y100

And you did! Cheers for your hard work. :)

[-]Blake Riley7y150

This is an underappreciated fact! I like how simple the rule is when framed in terms of size and distance.

You mention both the linear and log rules. The log rule has the benefit of being scale-invariant, so your score isn't affect by the units the answer is measured in, but it can't deal with negatives and gets overly sensitive around zero. The linear rule doesn't blow up around zero, is shift-invariant, and can handle negative values fine. The best generic scoring rule would have all these properties.

Turns out (based on Lambert and Shoham, "Eliciting truthful answers to multiple choice questions") that all scoring rules for symmetric confidence intervals $(a, b)$ with coverage probability $1 - α$ can be represented (up to affine transformation) as

S_{α} (a, b, x) = α \cdot (g (b) - g (a)) + I (x < a) \cdot (g (x) - g (a)) + I (x > b) \cdot (g (b) - g (x))

where $x$ is the true value, $I$ is the indicator function, and $g (\cdot)$ is any increasing function. Unsurprisingly, the linear rule uses $g (x) = x$ and the log rule uses $g (x) = l o g (x)$ . If we want scale-invariance on the whole real line, first thing I'd be tempted to do is use $l o g (x)$ for positive $x$ and $- log (| x |)$ for negative $x$ except for that pesky bit about going off to $\pm \infty$ around zero. Let's paste in a linear portion around zero so the function is increasing everywhere: $g (x) = I (| x | \leq 10) \cdot (x / 10) + I (| x | > 10) \cdot sign (x) \cdot {log}_{10} (| x |)$

Using this $g (\cdot)$ , the score is sensitive to absolute values around zero and sensitive to relative values on both sides of it. Since the rule expects more accuracy around zero, the origin should vary depending on question domain. Like if the question is about dates, accuracy should be the highest around the present year and get less accurate going into the past or future. That suggests we should set the origin at the present year. For temperatures, the origin should probably be room temperature. Are there any other standard domains that should have a non-zero origin? An alternate origin $t$ can be added as a shift everywhere:

g (x) = {\begin{matrix} (x - t) / 10 & if | x - t | \leq 10 sign (x - t) \cdot {log}_{10} (| x - t |) & if | x - t | > 10 \end{matrix}

Not something you'd want to calculate by hand, but if someone implements a calibration app, this has more consistent scores. Going one step further, the scores could be made more intepretable by comparison to a perfectly calibrated reference score: $100 + k \cdot (S_{α} (a, b, x) - S_{α}^{*})$ where $S_{α}^{*}$ is the expected score for perfectly calibrated intervals if, say, $x \sim N (0, 10)$ and $k$ is a fixed value chosen to keep plausible scores mostly positive.

[-]lukeprog7y140

This scoring rules has some downsides from a usability standpoint. See Greenberg 2018, a whitepaper prepared as background material for a (forthcoming) calibration training app.

[-]StefanDeYoung7y70

I need help figuring out how to use this scoring rule. Please consider the following application.

How much does it cost to mail a letter under 30g in Canada? $^{1}$

I remember when I was a child buying 45c stamps, so it's likely to be larger than that. It's been over a decade or so, and assuming a 2% rise in cost per year, then we should be around $45 * (1.02)^{10} \sim 60$ c per stamp. However, we also had big budget cuts to our postal service that even I learned about despite not reading the news. Let's say that Canada Post increased their prices by 25% to accomodate some shortfall. My estimate is that stamps cost 75c.

What should be my confidence interval? Would I be surprised if a stamp cost a dollar? Not really, but it feels like an upper bound. Would I be surprised if a stamp cost less than 50c? Yes. 60c? Yes. 70c? Hmmm.... Assume that I'm well calibrated, so I'm reporting 90% confidence for an interval of stamps costing 70c to 100c.

Answer: Stamps in booklets cost 85c each, individual stamps are 100c each. Because I would always buy stamps in booklets, I will use the 85c figure.

S is the size of my confidence interval, $S = 100 - 70 = 30$ . D is the distance between the true value and the interval, but is 0 in this case because the true value is in the interval.

$Score = - S - 20 \cdot D = - 30$

I'm not really sure what to do with this number, so let's move to the next paragraph of the post.

The true value is $T = 85$ and the interval is $(L, U) = (70, 100)$ . Because the true value is contained in the interval, $D = 0$ .

$S = l o g (\frac{U}{L}) = l o g (\frac{100}{70}) = 0.15$

$Score = - S - 20 \cdot D = - 0.15$

How does this incentivise honest reporting of confidence intervals?

Let's say that, when I intuited my confidence interval above that I was perturbed that it wasn't symmetric about my estimate of 75c, so I set it to $(L, U) = (50, 100)$ for aesthetic reasons. In this case, my score would be $Score = - 0.30$ Which is worse than my previous score by a factor of 2.

Let's say that, when I remembered the price of stamps in my childhood, I was way off and remembered 14c stamps. Then I would believe that stamps should cost around 22c now. (Here I have the feeling of "nothing costs less than a quarter!", so I would probably reject this estimate.)That would likely anchor me, so that I would set a high confidence on the price being within $(L, U) = (20, 24)$

$S = 0.08$ , $D = l o g (\frac{L}{T}) = l o g (\frac{20}{85}) = - 0.63$

$Score = - S - 20 \cdot D = 12.52$

Am I trying to maximize this score?

$^{1}$ I looked up the answer, and the lowest cost standard delivery is for letters under 30g.

[-]Scott Garrabrant7y90

I messed up, and swapped the words overestimate and underestimate in the 4th paragraph. I fixed it now. Score should always be negative.

This will change the value at the end to $D = log (\frac{85}{24})$ , or $0.55$ , making the score $- 11.06$ .

This score is a very negative number, so you get punished for having a bad interval, relative to the $- 0.15$ above.

[-]Qiaochu_Yuan7y40

The idea is that the two terms in the score balance between two effects: trying to make S as small as possible means making your interval as small as possible, but if you make it too small you're more likely to use an interval which doesn't contain the truth. Trying to make D as small as possible means making your interval more likely to contain the truth. The coefficients balance the tradeoff between the two so that the interval you end up with is your 90% confidence interval. (According to Scott; I haven't verified this personally.)

[-]gjm7y70

I have verified it. I was in the process of writing a (fairly lengthy) reply to Stefan's comment, including a proof that Scott's scoring rule does indeed have the property that your expected score (according to your actual beliefs about the quantity you're estimating) is maximized when the confidence interval you state has (again according to your actual beliefs) a 5% chance that the quantity lies below its lower bound and a 5% chance that the quantity lies above its upper bound ... but then something I did (I have no inkling what, though it coincided with some combination of keypresses as I was trying to enter some mathematics) made the page go entirely blank, and I didn't find any way to get my partially-written comment back again.

Anyway, here's one way (I don't guarantee it's best and it feels like there should be a slicker way) to prove it. Let's suppose the confidence interval you state is (l,r); consider the derivative w.r.t. either of those bounds -- let's say r, but l is similar -- of your expected score. The first term in the score is just l-r, and the derivative of that is always -1. The second term can be written as an integral; differentiating it w.r.t. r turns out to give you 20Pr(X>r). (The calculation is easy.) So the derivative is zero only when 1-20Pr(X>r)=0; that is, when Pr(X>r)=5%. So if the confidence interval you state doesn't have the property that you expect to be above it exactly 5% of the time, then this derivative is nonzero and therefore some small change in r increases your expected score.

[-]Dr_Manhattan7y20

would you mind spelling out the integral part?

[-]gjm7y40

Suppose f is your probability density function for the quantity X you're interested in.

Then the expectation of D is the integral of D(x)f(x), which equals the integral of [max(0,l-x)+max(0,x-r)]f(x). When we differentiate w.r.t. r, the first term obviously goes away because it's independent of r, so we get the integral of [d/dr max(0,x-r)] f(x). That derivative is 0 for x<r and 1 for x>r, so this is the integral of f(x) from r upwards; in other words it's Pr(X>r). So d(score)/dr = 1-20Pr(X>r).

The calculation for l is exactly the same but with a change of sign; we end up with 20Pr(X<l)-1.

[-]StefanDeYoung7y10

Thanks for this reply. The technique of asking what each term of your equation represents is one I have not practiced in some time.

This answer very much helped me to understand the model.

[-]Куля Ботаніки7y20

Thank you for providing an example!

[-]StefanDeYoung7y40

You're welcome. Something that I'm trying to improve about how I engage with lesswrong is writing out either a summary of the article (without re-refering to the article) or an explicit example of the concept in the article. My hope is that this will help me to actually grok what we're discussing.

[-]Rafael Harth7y10

I get a dozen 'refresh to render LaTeX's here (but refreshing doesn't fix it).

[-]habryka7y30

Just wrapped up the fix for this. Pushing the fix in the next few minutes.

[-]habryka7y40

Fixed! Sorry for the inconvenience!

[-]habryka6y60Nomination for 2018 Review

This is one of these simple insights that I think are actually important for a large variety of forecasting related tasks, and that turn out to be a large pain to rederive for yourself.

[-]philip-bl7y30

Is there a way to incentivize reporting true probability distribution? Say I Bob wants Alice to provide her probability distribution of IQ she'll get on the test. He is willing to give her a real number as a reward, he wants to hear her probability distribution of her result. What should he do?

Would be nice if it worked for both discrete and non-discrete probability spaces.

[-]Qiaochu_Yuan7y40

In the discrete case log scoring still works, it generalizes past the binary case.

That is, if $S$ is the set of possible outcomes of the test, Bob elicits from Alice a probability distribution $q (s)$ on $S$ , then Alice takes the test and gets some outcome $s \in S$ , then Bob rewards Alice $log q (s)$ . (This number is unfortunately always negative; you can add a positive constant to it if you want.)

Alice's expected payoff according to her true probability distribution $p (s)$ is

$\sum_{s \in S} p (s) log q (s)$

also known as the (negative of the) cross entropy between $p$ and $q$ . And you can do a computation, e.g. with Lagrange multipliers, which will verify that for fixed $p$ , the optimal value of $q$ is $q = p$ . I do this calculation in this blog post.

A test isn't a good example to use because the outcome of the test is under Alice's control, so she can e.g. throw the test and predict this fact. This procedure is best used to elicit Alice's prediction of something which she cannot influence in any way.

[-]habryka7y20

How did using LaTeX fail?

[-]Qiaochu_Yuan7y20

I tried starting with a dollar sign, which brought up a yellow prompt that I couldn't figure out how to easily exit; hitting Enter just started a new line in the prompt. The only way I've found to exit it so far is Ctrl + Enter, which submits the comment with the LaTeX displaying as "refresh to display LaTeX," and continuing to display that after I refresh.

[-]habryka7y40

Ah, you exit the yellow prompt with Esc and in the yellow prompt you can type any LaTeX, with a live-preview beneath it.

Somewhat surprised that it continued to show "refresh to display LaTeX' even after you refreshed. I never had that happen to me. That might have been a result of you submitting from the inside of the prompt, which I can imagine causing errors.

[-]Qiaochu_Yuan7y40

Awesome, everything's fine now.

[-]Ben Pace7y30

Incentivising accurate probabilistic predictions is central to any art of rationality, this post gives a signficant part of this that's super readable, so I've curated it.

(Also, nice move adding extra points in the comments.)

[-]ChristianKl7y20

Do you have some argument that your proposed formula are optimal?

[-]Scott Garrabrant7y30

What do you mean by optimal?

If you mean they are proper (i.e. incentivize honest reporting), gjm's comment gives a quick sketch of a proof.

[-]tunesmith4y10

Is there a way to adjust this to support better scores for tighter confidence intervals?

For instance, using natural log, with a range of 8-10 and a true value of 10, I get -0.2231 whether I pick a 90% confidence interval, or a 95% confidence interval (coefficient of 40). It'd be nice if the latter scored better.

[-]Kevin S. Van Horn7y10

[Edit: I'm retracting this comment, as I made some incorrect assumptions about Scott's claim.] This is wrong. It is well known that the only strictly proper scoring rule that depends only on the probability at the actually occurring value is the logarithmic scoring rule (if there are more than two alternatives), or translations and/or positive scaling of the same. In this case, that would be log(Normal(x | mu, sigma)), where x is the value that occurs, and mu and sigma^2 are the mean and variance of the normal distribution that fits the interval you defined at the given confidence level. This may be simplified to

-log(sigma^2) - (x - mu)^2 / sigma^2.

Your scoring rule is not a translation and/or positive scaling of the logarithmic scoring rule.

[-]Scott Garrabrant7y10

Throwing out an attempt to resolve the disagreement, sorry if this is actually what we are disagreeing about:

Am unknownigly using words that imply that I care about normal distributions? I am imagining getting honest reporting out of an agent trying to maximize expected score, but with arbitrary beliefs. I am only trying to get an honest reporting of the subjective 5th and 95th percentiles, and am not trying to get any other information.

[-]Kevin S. Van Horn7y30

I'm used to seeing normal (or log-normal) distributions fit to subjective confidence intervals -- because the confidence intervals are being used to do some subjective probabilistic analysis. I assumed that was what you were doing, given that you were using the actual attained value x, and not just which of the three possibilities A:(x < left), B:(left < x < right), and C:(right < x) occurred.

Hmmm... you seem to have evaded the theorem about the only strictly proper local scoring rule being the logarithmic score, by only seeking to find the confidence interval, but using more information than just the region (A, B, or C) the outcome belongs to.

It would help to see a proof of the claim; do you have a reference or a link to a URL giving the proof?

[-]Scott Garrabrant7y10

I dont have a reference. gjm's comment gives a quick sketch.

[-]Scott Garrabrant7y30

Oh, a quick thing thats not a proof that may convince you it is true:

It works exactly the same way as saying that measuring the distance between reported value and true value incentivizes honest reporting of your median. (The point you think it the true value is above with probability 50%)

[-]Scott Garrabrant7y10

This scoring rule does not depend only on the probability at the actually occuring value. You dont even report the probability at any value. I am not trying to incentivize reporting of probabilities of specific value, I am trying to incentivize reporting an interval such that the person reporting the belief believes the point will lie in with probability $90 %$ .

Your rule seems to be trying to do something else, but it will not incentivize me giving my $90 %$ confidence interval in cases where my beliefs are not normally distributed.