There's actually a big problem with using Brier scores for open-ended questions like this, which is that the optimal option if you're, say, 50% confident you have the right answer, is to instead report "Don't know / bleeblabloo, probability 0.0001". Then you get a good Brier score for knowing you would be wrong.

We ran this at our meetup today and it was the subject of much discussion. A big conclusion seemed to be that Brier scores work best when there is a fixed, limited number of possibilities to guess from; when the number of possibilities is large/unknown and you can guess "I don't know," you get this bad behavior.

We came up with a kind of hacky solution that gave you negative points for wrong answers and positive points for right ones, scaled to the probability you gave, plus regular Brier scores for the True/False questions. It's unlikely that solution was a proper scoring rule, but it was somewhat better in removing the incentive to always guess "[wrong answer] with probability epsilon."

Reply

[-]Screwtape3y20

The quick hack I'd use if I didn't want people to be able to easily guess wrong with high certainty would be to use True/False or multiple choice questions. That said, I don't currently think of this as a big problem?

There are two scores; Calibration and Correct Answers. If someone has remarkably good calibration and almost no correct answers, then they're probably deliberately guessing outlandish answers and being sure that they're wrong. That's not worth bragging rights, it's the equivalent of running to the side of the obstacles on an obstacle course. Someone who's correctly 20% confident on most of the questions can get a lower Brier but six Correct Answer points, or an excellent Brier and zero Correct Answer points, and the former is (to me) more impressive. If you are actually totally clueless, then "[wrong answer] with probability epsilon" is actually the right response.

"I notice that I don't actually know this" is (in my opinion) a useful skill to pick up, if you can avoid also picking up "I should pretend that I know nothing." Still, the option to make it multiple choice exists, and there might be a better scoring rule. (I deliberately avoided making some kind of combined score, because I didn't want less obvious strategic exchange rates between correct answers and calibration.)

Reply

[-]Dacyn3y21

The people with the best calibration scores will not be those with the most skill at calibration. It will be those who "don't guess" on the trivia questions -- they either know it or they don't (100% of 0% chance of getting it right). This is because if you guess and have (e.g.) a 50% chance of getting it right, then even if you are perfectly calibrated about that 50%, you will still get a Brier score of 0.25, as opposed to a score of 0 for someone who "doesn't guess".

Consequently, I don't really see this game as being very useful at measuring calibration.

Reply

[-]Screwtape3y10

Feedback and suggestions for improvement are very welcome!

It's true that someone can easily get an excellent calibration score at the cost of getting no points. This tends to be very obvious when you read out the leaderboard. A quick patch is to turn all the questions into statements and have people estimate how likely they think the statement is true. "What is the element with Atomic Weight 29" becomes "The element with Atomic Weight 29 is Copper." Then there is no easy path to excellent scores of either kind.

That version is a little less fun and I don't think the change is necessary. I'm curious is if that patch would satisfy your objection? It might be relevant that I don't view the goal as measuring calibration, but to train it. When I've run this, I often see a rapid change in confidences over the course of the first dozen questions as some people who hadn't previously practiced the skill begin to use numbers other than the highest and lowest available.

Reply

[-]Dacyn3y10

Sure, that patch wouldn't have the problem I described.

Anyway, do whatever works for you -- if you find this exercise helps people train their calibration, then I suppose that's a good thing. I guess my main point would be not to take too seriously what this method tells us about who is "best" at calibration -- and I guess you're saying people already don't take seriously in the case of someone who is doing badly at the trivia portion, but I think the failure mode is a bit more general than that. Anyway, I guess it doesn't matter too much.

Reply

[+][comment deleted]3y10

[-]Saul Munn2y10

If you and your audience have smartphones, we suggest making use of a copy of this spreadsheet and google form.

are "spreadsheet" and "google form" meant to be linked to something?

Reply

[-]Screwtape3y10

Updated with a couple of variations and a link to a google drive folder with multiple question sets. The True/False version is from the comments and suggestions people left. The range version is from Ben Orlin's Outrangeous.

Reply

[-]benjamincosman3y10

The link labeled "Calibration Trivia Sets" goes to a single slideshow labeled "Calibration Trivia Set 1 TF" rather than a folder with multiple sets; I assume (with 95% probability :) ) that this is a mistake?

Reply