Summary: A game of trivia where you answer factual questions about the world, but stating how sure you are that you’re right and trying to be well calibrated. 

Tags: Large, Repeatable 

Purpose: Calibration Trivia is designed to practice proper calibration – recognizing when you're very sure of something vs when you aren't very sure of it. 

Materials: Minimally, you need a list of trivia questions and some writing implements for your audience. If you and your audience have smartphones, we suggest making use of a copy of this spreadsheet and google form. In both cases, a timer can be useful to time each question, though it's perfectly acceptable to just advance to the next question after what feels like a couple of minutes or when it looks like most people are done. 

Announcement: We’re planning to host a trivia game with a twist! If you’ve never been to a trivia night before, one the person running it will call out questions, we'll write our answers, and a good time is had by all. In addition to answering the question however, you'll be able to write down how confident you are in your guess and at the end we check if you're well calibrated – that is, do you know when you do and do not know the answer? Categories are Literature, Math and Science, History, Sports, and Tabletop Roleplaying Games.

Note: You should make sure to change the categories to match whatever you're using.


1. Describe the following rules to the participants.

"This is a game of trivia, with a special tweak. For anyone unfamiliar, the way trivia works is that I'll present a question, and you'll have a couple of minutes to write down an answer. Then I'll reveal the answer, and if you got it right then you'll get one point. Feel free to chat with each other once you're done guessing and while you're waiting for the next question."

"The tweak is, in addition to writing your answer down, you will also write down how confident you are that your answer is correct in the form of a percentage. If you are very confident, you might write 95, which means if you were this sure about twenty questions you'd expect to only get one of them wrong. If you were guessing wildly, you might write down 1, which means if you were that uncertain about a hundred things, you think you'd get one of them right mostly by coincidence. You'll be scored on calibration according to what's called a Brier Score, which is a Strictly Proper Scoring Rule for predictions – that means that you want to give your actual estimation of how likely you are to be right. You'll do generally do worse if you try and answer higher or lower than your actual estimation.  Does anyone have any questions?"

Note: The scoring mechanism suggested is (1-their probability)^2 if they're right, and (0-their probability)^2 if they're wrong. Average the scores from each question together. Someone who correctly answered with a 90% confidence gets scored (1-.9)^2=.01. The best theoretical Brier Score would be 0, which is impossible to achieve but one can try and get close.

2. One at a time, read each question aloud. (A collection of questions is included below, under "Calibration Trivia Questions.") Be sure to speak clearly and loudly enough for everyone to hear. If you happen to have a projector or screen, it can help to put the question up there as well.

Every six questions, announce or display the current points and scores. If you have a very large crowd, it can speed things up to only announce the top five for Correct Answers and the top five for Best Calibrated. In both cases, it's best to announce from the bottom up, starting with the worst scorer and ending with the best.

Repeat until the entire set of questions has been worked through. 

3. Announce the final points and scores. 

Notes: You'll want a venue where you can talk loud enough for everyone to hear you. You may also want to adjust the question list or the number of questions based on how the interests of your group or how long you wish the event to run for. 

Calibration Trivia Questions: Set 1, example scoresheet  1



3 comments, sorted by Click to highlight new comments since: Today at 9:17 AM
New Comment

The people with the best calibration scores will not be those with the most skill at calibration. It will be those who "don't guess" on the trivia questions -- they either know it or they don't (100% of 0% chance of getting it right). This is because if you guess and have (e.g.) a 50% chance of getting it right, then even if you are perfectly calibrated about that 50%, you will still get a Brier score of 0.25, as opposed to a score of 0 for someone who "doesn't guess".

Consequently, I don't really see this game as being very useful at measuring calibration.

Feedback and suggestions for improvement are very welcome!

It's true that someone can easily get an excellent calibration score at the cost of getting no points. This tends to be very obvious when you read out the leaderboard. A quick patch is to turn all the questions into statements and have people estimate how likely they think the statement is true. "What is the element with Atomic Weight 29" becomes "The element with Atomic Weight 29 is Copper." Then there is no easy path to excellent scores of either kind.

That version is a little less fun and I don't think the change is necessary. I'm curious is if that patch would satisfy your objection? It might be relevant that I don't view the goal as measuring calibration, but to train it. When I've run this, I often see a rapid change in confidences over the course of the first dozen questions as some people who hadn't previously practiced the skill begin to use numbers other than the highest and lowest available.

Sure, that patch wouldn't have the problem I described.

Anyway, do whatever works for you -- if you find this exercise helps people train their calibration, then I suppose that's a good thing. I guess my main point would be not to take too seriously what this method tells us about who is "best" at calibration -- and I guess you're saying people already don't take seriously in the case of someone who is doing badly at the trivia portion, but I think the failure mode is a bit more general than that. Anyway, I guess it doesn't matter too much.

New to LessWrong?