Test Your Calibration!

[-]Rune16y230

Advice for future creators of tests: There are people who live outside the US. No one outside the US cares about the 3rd person to be the second dead uncle of the fourth president of the US.

For instance, a majority of tommccabe's quiz questions are highly US-specific.

The point here is that non-Americans will end up guessing almost all questions, making the whole exercise painful and useless.

[-]SoerenMind11y20

The best calibration IMO exercises I was able to find (which also work for non-Americans) can be downloaded from the website of How to Measure Anything.

http://www.howtomeasureanything.com/

[-]alyssavance16y10

Noted, but I didn't write those questions, they were taken from the open-source MisterHouse project. If you know of any sources of free trivia questions that aren't US-specific please do PM me.

[-]Cyan16y30

You can get a ton of free non-U.S.-specific trivia from the CIA World Factbook.

[-]Alicorn16y30

It seems like it'd be pretty easy to write your own trivia questions by permitting yourself to surf Wikipedia for a while and extract facts from the articles. What's the advantage to trivia questions you don't write yourself - just speed, or something else too?

[-]alyssavance16y70

Just speed. At two minutes per trivia question it would take a full day to make another set of 250.

[-]bentarm16y130

Why isn't there a 33% option for your test? What if I'm pretty certain that 1 of the answers is wrong, but have no clue which of the others is most likely to be right? Then my confidence is exactly 33%, and I have to either overestimate or underestimate it. The 50% and 25% options seem to cover the other two versions of this scenario (I can eliminate either 2 or 0 of the options almost certainly) but this appears to be a gap.

(incidentally, this only occurred to me because it happened to be the case for the first question on the first of your quizzes...)

[-]alyssavance16y10

There probably should be, mea culpa.

[-]gerg16y70

Part of the output of your quizzes is a line of the form "Your chance of being well calibrated, relative to the null hypothesis, is 50.445538580926 percent." How is this number computed?

I chose "25% confident" for 25 questions and got 6 of them (24%) right. That seems like a pretty good calibration ... but 50.44% chance of being well calibrated relative to null doesn't seem that good. Does that sentence mean that an observer, given my test results, would assign a 50.44% probability to my being well calibrated and a 49.56% probability to my not being well calibrated? (or to my randomly choosing answers?) Or something else?

[-]skepsci14y50

It's also completely ridiculous, with a sample size of ~10 questions, to give the success rate and probability of being well calibrated as percentages with 12 decimals. Since the uncertainty in such a small sample is on the order of several percent, just round to the nearest percentage.

[-]MTGandP11y00

It probably just computes it as a float and then prints the whole float.

(I do recognize the silliness of replying to a three-year old comment that itself is replying to a six-year old comment.)

[-]Soothsilver10y00

It's not silly. I still find these newer comments useful.

[-]MTGandP10y00

And here we are one year later!

[-]Sunny from QAD6y10

Yes, do it for posterity!

[-]Matteo De Stefano4y10

I would like to chime in and point out that as today the domain "acceleratingfuture (dot) com" is owned by a russian bookmaker.

[-]XelaP7mo10

Does "chance relative to null is x%" mean "An observer, given my results, would assign an x% to me being calibrated"

No! P(Test results | Perfect calibration) / P(Test results | Whatever the null is) ≠ P(Perfect Calibration | Test results) !

You can also lodge this is a problem with null hypothesis testing - I would've thought that perfect calibration would be the null. Perhaps the null is a model where you just randomly say a probability from 0 to 100.

I'm assuming that they really calculated a likelihood function P(Data|Perfect) / P(Data|Null) instead of the posteriorP(Perfect|Data) / P(Null|Data) as the words they used would mean if taken literally. But maybe they have some priors P(Perfect) / P(Null) that they used. (The thing they should do is just report the likelihood ratio, instead of their posterior).

If you have your data and want to compute P(Data|Perfect), you can compute a total product Π_i (p_i if it happened, (1-p_i) if it didn't)

So for example if I predicted 20%, 70%, 30% and the actual results were No, Yes, Yes, then P(Data|Perfect) = .8 * .7 * .3. If you have some other hypothesis (e.g. whatever their null is), you can compute P(Data|Other Hypothesis) by using the predictions that hypothesis makes for how your reported probabilities relate to propensities of events. A hypothesis here should be a function f(reported) = P(Event happens | reported).

[-]elazdins4y50

Just launched my own version of a calibration test here - https://calibration.lazdini.lv/ it is pretty much identical to http://confidence.success-equation.com/ except the questions should be different each time you visit the site, allowing for regular calibration/recalibration. Questions are retrieved from the free API provided by https://opentdb.com/.

[-]jimrandomh16y50

I would like to see a calibration test with open-ended questions rather than multiple choice. Multiple choice makes it easier to judge confidence, but I'm afraid the calibrations won't transfer well to other domains.

(The test-taker would have to grade their test, since open ended questions may have multiple answers, and typos and minor variations shouldn't count as errors. But other than that, the test would be pretty much the same.)

[-]Isaac King4y10

An open-ended probability calibration test is something I've been planning to build. I'd be curious to hear your thoughts on how the specifics should be implemented. How should they grade their own test in a way that avoids bias and still gives useful results?

[-]SK216y40

I have seen a problem with selection bias in calibration tests, where trick questions are overrepresented. For example, in this PDF article, the authors ask subjects to provide a 90% confidence interval estimating the number of employees IBM has. They find that fewer than 90% of subjects select a suitable range, which they conclude results from overconfidence. However, IBM has almost 400,000 employees, which is atypically high (more than 4x Microsoft). The results of this study have just as much to do with the question asked as with the overconfidence of the subjects.

Similarly, trivia questions are frequently (though not always) designed to have interesting/unintuitive answers, making them problematic for a calibration quiz where people are expecting straightforward questions. I don't know that to be the case for the AcceleratingFuture quizzes, but it is an issue in general.

[-]Blueberry16y20

That really shouldn't matter. Your calibration should include the chances of the question being a "trick question". If fewer than 90% of subjects give confidence intervals containing the actual number of employees, they're being overconfident by underestimating the probability that the question has an unexpected answer.

[-]SK216y80

Imagine an experiment where we randomize subjects into two groups. All subjects are given a 20-question quiz that asks them to provide a confidence interval on the temperatures in various cities around the world on various dates in the past year. However, the cities and dates for group 1 are chosen at random, whereas the cities and dates for group 2 are chosen because they were record highs or lows.

This will result in two radically different estimates of overconfidence. The fact that the result of a calibration test depends heavily on the questions being asked should suggest that the methodology is problematic.

What this comes down to is: how do you estimate the probability that a question has an unexpected answer? See this quiz: maybe the quizzer is trying to trick you, maybe he's trying to reverse-trick you, or maybe he just chose his questions at random. It's a meaningless exercise because you're being asked to estimate values from an unknown distribution. The only rational thing to do is guess at random.

People taking a calibration test should first see the answers to a sample of the data set they will be tested on.

[-]pengvado16y30

I think the two of you are looking at different parts of the process.

"Amount of trickiness" is a random variable that is rolled once per quiz. Averaging over a sufficiently large number of quizzes will eliminate any error it causes, which makes it a contribution to variance, not systematic bias.

Otoh, "estimate of the average trickiness of quizzes" is a single question that people can be wrong about. No amount of averaging will reduce the influence of that question on the results, so if your reason for caring about calibration isn't to get that particular question right, it does cause a systematic bias when applying the results to every other situation.

[-]JamesAndrix16y40

Wow, hmm I took quiz 1 so far, and all my high confidence answer groups all scored much lower. For now I blame too much experience with easy multiple choice tests. I only got 19 out of 50 overall

Another problem, I think :

You marked your answers to 17 questions as '25% accurate'. Out of these, 1 answers were correct, for a success rate of 5.8823529411765 percent.

Now I thought I was choosing 25% when I didn't know the answer, but this seems to indicate that I had some information, and was biased against playing my (sometimes correct) hunches when marking 25%.

[-]steven046116y30

(I believe this is his LW account, but feel free to correct me)

This is my current LW account.

There were sequels to the Aumann game here here and here; these have better questions but probably the lack of auto-scoring makes it not worth the effort.

[-]alyssavance16y00

Added, thanks!

[-]jimmy16y20

If anyone is thinking about creating their own, I would suggest questions with numerical answers so you can give upper and lower bounds of varying confidence, rather than trying to pick your confidence on a binary question and try to force binning or do some sort of filtering.

Also, this lets you give several probability estimates for each question.

[-]aretae16y20

Douglas Hubbard writes on the topic of calibration as well. He focuses on RW application of this stuff, and calibration is clearly a part of that.

His 1st book: http://www.amazon.com/How-Measure-Anything-Intangibles-ebook/dp/B001BPE8ZQ/ref=sr_1_3?ie=UTF8&s=books&qid=1258133710&sr=8-3

His site: http://www.hubbardresearch.com/dotnetnuke/

[-]gwern14y00

I found How to Measure Anything pretty interesting in its thorough application of calibration and Fermi calculation to all sorts of problems, although I didn't find the digressions into Excel very useful. Definitely recommended if you don't already have the mental knack for Fermi stuff.