I recently spent a while looking at how different people had designed their probability calibration exercises (for ideas on how to design my own), and they turned out to be quite difficult to find. Many of the best ones were the least advertised and hardest to locate online. I figured I'd compile them all here in case anyone else finds themselves in a similar position. Please let me know about any I missed and I'll add them to the post. Many of these are old and no longer maintained, so no guarantees as to quality.
https://www.openphilanthropy.org/calibration or https://80000hours.org/calibration-training/ (Different URLs for same application.)
Funny timing. I'm actually in the process of working on https://calibration-training.netlify.app/ and am planning to post some sort of initial alpha release sort of thing to LessWrong soon! I need to seed the database with more questions first though. Right now there are only 10. I have a script and approach that should make it easy enough to get tens of thousands soon enough. This is helpful though. I'll look through the existing resources and see if there's anything I can use to improve my app.
The Open Philanthropy and 80,000 Hours links are for the same app, just at different URLs.
I made an Android app based on http://acritch.com/credence-game/ you can find here.
And funny timing for me too, I just hosted a web version of the Aumann Agreement Game at https://aumann.io/ (Most likely more riddled with bugs than a dumpster mattres) last week and was holding off testing it until I had some free time to post about it.
This looks super neat, thank you for sharing. I just did a quick test and can confirm that it is in fact riddled with bugs. If it would help, I can write up a list of what needs fixing.
That would be helpful if you have the time, thanks!
Well the biggest problem is that it doesn't seem to work. I tested in a 2-player game where we both locked in an answer, but the game didn't progress to the next round. I waited for the timer to run out, but it still didn't progress to the next round, just stayed at 0:00. Changes in my probability are also not visible to the other players until I lock mine in.
A few more minor issues:
Thanks! I'll look into these. Refactoring the entire frontend codebase is probably worth it, considering I wrote it months ago and it's kinda embarrassing to look back at.
This is fantastic. We used Critch's calibration game and the Metaculus calibration trainer for our our Practical Decision-Theory course but it's always good to have a very wide variety of exercises and questions.
It would be nice if you wrote a short paragraph for each link, "requires download", "questions are from 2011", or you sorted the list somehow :)
Metaculus has a calibration tutorial too: https://www.metaculus.com/tutorials/
I've been thinking about adding a calibration exercise to https://manifold.markets as well, so I'm curious: what makes one particular set of calibration exercises more valuable than another? Better UI? Interesting questions? Legible or shareable results?
Questions about a topic that I don't know about result in me just putting the max entropy distribution on that question, which is fine if it's rare, but leads to unhelpful results if they make up a large proportion of all the questions. Most calibration tests I found pulled from generic trivia categories such as sports, politics, celebrities, science, and geography. I didn't find many that were domain-specific, so that might be a good area to focus on.
Some of them don't tell me what the right answers are at the end, or even which questions I got wrong, which I found unsatisfying. If there's a question that I marked as 95% and got wrong, I'd like to know what it was so that I can look into that topic further.
It's easiest to get people to answer small numbers of questions (<50), but that leads to a lot of noise in the results. A perfectly calibrated human answering 25 questions at 70% confidence could easily get 80% or 60% of them right and show up as miscalibrated. Incorporating statistical techniques to prevent that would be good. (For example, calculate the standard deviation for that number of questions at that confidence level, and only tell the user that they're over/under confident if they fall outside it.) The fifth one in my list above does something neat where they say "Your chance of being well calibrated, relative to the null hypothesis, is X percent". I'm not sure how that's calculated though.
Nice! Added these to the wiki on calibration: https://www.lesswrong.com/tag/calibration