Like ignoranceprior said, my AI Impacts post has three intuitive ways of thinking about the results:
Way One: Let’s calculate some examples of prediction patterns that would give you Brier scores like those mentioned above. Suppose you make a bunch of predictions with 80% confidence and you are correct 80% of the time. Then your Brier score would be 0.32, roughly middle of the pack in this tournament. If instead it was 93% confidence correct 93% of the time, your Brier score would be 0.132, very close to the best superforecasters and to GJP’s aggregated forecasts.14 In these examples, you are perfectly calibrated, which helps your score—more realistically you would be imperfectly calibrated and thus would need to be right even more often to get those scores.
Way Two: “An alternative measure of forecast accuracy is the proportion of days on which forecasters’ estimates were on the correct side of 50%. … For all questions in the sample, a chance score was 47%. The mean proportion of days with correct estimates was 75%…”15 According to this chart, the superforecasters were on the right side of 50% almost all the time:16
Way Three: “Across all four years of the tournament, superforecasters looking out three hundred days were more accurate than regular forecasters looking out one hundred days.”17 (Bear in mind, this wouldn’t necessarily hold for a different genre of questions. For example, information about the weather decays in days, while information about the climate lasts for decades or more.)
Brier scores are scoring three things:
Note that in Tetlock's research there is no hard cutoff from regular forecasters to superforecasters - he arbitrarily declared that the top 2% were superforecasters, and showed that 1) the top 2% of forecasters tended to remain in the top 2% between years and 2) that some of the techniques they used for thinking about forecasts could be shown in an RCT to improve the forecasting accuracy of most people.
This is a bit complicated, but to start, we can answer this question related to only the types of questions we have empirical data from superforecasters about. That's because the fact that superforecasters do better is an empirical observation, not a clear predictive/quantitative theory about what makes people better or worse. I'm going to use data from the AIImpacts blog post - https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project-an-accompanying-blog-post/ - because I don't have the book or the datasets handy right now.
The original tournament was about short and medium term geopolitical and similar questions. The scoring used time-weighted brier scores, and note that brier scores themselves are question-set specific. For these questions, an aggregate of superforecaster predictions had about 60-70% lower brier scores than the control group of "regular" forecasters. The best superforecaster had a score of 0.14, while the no-skill brier score on these questions, which is if someone just assigns equal probability to everything, is 0.53. But that's not the right comparison if comparing superforcasting to forecasters. The average of forecasters (including supreforecasters, it seems) was close to 0.35. If we adjust to 0.4 to roughly remove superforecasters, 65% lower than that is 0.14 - the same score as the best superforecaster.
How is that possible? Aggregation. And the benefits of aggregation aren't due to the skill of superforecasters, they are due to the law of large numbers - so maybe we don't want to give the superforecasters credit for being better, but superforecasting does include it.
So how do we understand a brier score? It's the average squared distance from being correct, i.e. 1 or 0. That means that a brier score of .14 means that on average, you predicted things that did / did not happen were 65% / 35% likely. But we had a time-weighted average score - if someone predicted 50% on day 1, and went steadily down to 20% at the close of the question, and it resolves negatively, my average prediction is 35%, and my brier score is 0.14.
One way is how far out people can predict before their predictions get as noisy as chance. One of the surprising findings of the GJP was that even the best forecasters decayed to chance around the one year mark (IIRC?). Normal people can't even predict what has already happened ( they object to/can't coherently update on basic facts about the present).
Is there an intuitive way to explain how much better superforecasters are than regular forecasters? (I can look at the tables in https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions but I don't have an intuitive understanding of what brier scores mean, so I'm not sure what to think about it).