Open Thread, Jul. 27 - Aug 02, 2015

Suppose that I am given a calibration question about a racehorse and I guess "Secretariat" (since that's the only horse I remember) and give a 30% probability (since I figure it's a somewhat plausible answer). If it turns out that Secretariat is the correct answer, then I'll look really underconfident.

But that's just a sample size of one. Giving one question to one LWer is a bad method for testing whether LWers are overconfident or underconfident (or appropriately confident). So, what if we give that same question to 1000 LWers?

That is why I looked at all 10 questions in aggregate.

Well, you did not look at calibration, you looked at overconfidence which I don't think is a terribly useful metric -- it ignores the actual calibration (the match between the confidence and the answer) and just smushes everything into two averages.

It reminds me of an old joke about a guy who went hunting with his friend the statistician. They found a deer, the hunter aimed, fired -- and missed. The bullet went six feet to the left of the deer. Amazingly, the deer ignored the shot, so the hunter aime...

