Grading my 2020 predictions

by UnexpectedValues11 min read7th Jan 2021No comments


World Modeling

[Edit: when I copy-pasted from my blog, the colors didn't show up, and I'm not sure how to make them show up; sorry!]

In December 2019, I made 132 probabilistic predictions for 2020. As promised, I’ve come back to evaluate them on three criteria: calibration, personal optimism/pessimism, and performance relative to PredictIt (and an anonymous friend who sent me their predictions for some of the events).

I’ll get to all of those, but first, here are my predictions, color-coded by whether or not the events transpired (green = happened, red = didn’t happen).



I. US Politics

A. Primary elections

1. Biden wins Iowa caucus popular vote (EDIT 1/30/2020: I meant after those who caucused for candidates with less than 15% support realign): 25%
2. Sanders wins Iowa caucus popular vote: 21% (lost on delegates but won popular vote)
3. Warren wins Iowa caucus popular vote: 10%
4. Buttigieg wins Iowa caucus popular vote: 27%
5. Klobuchar wins Iowa caucus popular vote: 13%
6. Biden wins New Hampshire primary popular vote: 21%
7. Sanders wins New Hampshire primary popular vote: 26%
8. Warren wins New Hampshire primary popular vote: 20%
9. Buttigieg wins New Hampshire primary popular vote: 21%
10. Klobuchar wins New Hampshire primary popular vote: 6%
11. Biden wins Nevada caucus popular vote: 37%
12. Sanders wins Nevada caucus popular vote: 33%
13. Warren wins Nevada caucus popular vote: 9%
14. Buttigieg wins Nevada caucus popular vote: 10%
15. Klobuchar wins Nevada caucus popular vote: 5%
16. Biden wins South Carolina primary popular vote: 67%
17. Sanders wins South Carolina primary popular vote: 15%
18. Warren wins South Carolina primary popular vote: 7%
19. Buttigieg wins South Carolina primary popular vote: 5%
20. Klobuchar wins South Carolina primary popular vote: 2%
21. Biden wins the most Super Tuesday delegates (not counting superdelegates): 43%
22. Sanders wins the most Super Tuesday delegates (not counting superdelegates): 18%
23. Warren wins the most Super Tuesday delegates (not counting superdelegates): 14%
24. Buttigieg wins the most Super Tuesday delegates (not counting superdelegates): 15%
25. Klobuchar wins the most Super Tuesday delegates (not counting superdelegates): 6%
26. Biden wins the Democratic nomination: 38%
27. Sanders wins the Democratic nomination: 16%
28. Warren wins the Democratic nomination: 17%
29. Buttigieg wins the Democratic nomination: 19%
30. Klobuchar wins the Democratic nomination: 6%
31. Multiple convention votes are needed to choose the Democratic nominee: 32%
32. Stacey Abrams is the Democratic running mate: 16%
33. Kamala Harris is the Democratic running mate: 8%
34. Cory Booker is the Democratic running mate: 6%
35. Amy Klobuchar is the Democratic running mate: 5%
36. Pete Buttigieg is the Democratic running mate: 4%
37. Sherrod Brown is the Democratic running mate: 3%
38. Julian Castro is the Democratic running mate: 2%
39. Trump wins the Republican nomination: 96%

B. General election

40. Trump wins the general election popular vote: 26%
41. The Republican nominee wins the 2020 presidential election: 44%
42. The Republican nominee wins Arizona: 42%
43. The Republican nominee wins Colorado: 18%
44. The Republican nominee wins Florida: 55%
45. The Republican nominee wins Georgia: 73%
46. The Republican nominee wins Iowa: 63%
47. The Republican nominee wins Maine: 10%
48. The Republican nominee wins ME-02: 60%
49. The Republican nominee wins Michigan: 40%
50. The Republican nominee wins Minnesota: 28%
51. The Republican nominee wins NE-02: 48%
52. The Republican nominee wins Nevada: 25%
53. The Republican nominee wins New Hampshire: 36%
54. The Republican nominee wins North Carolina: 60%
55. The Republican nominee wins Ohio: 80%
56. The Republican nominee wins Pennsylvania: 45%
57. The Republican nominee wins Texas: 75%
58. The Republican nominee wins Virginia: 20%
59. The Republican nominee wins Wisconsin: 50%
60. Democrats keep the House: 75%
61. Republicans keep the Senate: 80%
62. Doug Jones (D-AL) keeps his seat: 30%
63. Martha McSally (R-AZ) keeps her seat: 50%
64. Cory Gardner (R-CO) keeps his seat: 40%
65. Joni Ernst (R-IA) keeps her seat: 80%
66. Susan Collins (R-ME) keeps her seat: 65%
67. Gary Peters (D-MI) keeps his seat: 75%
68. Jeanne Shaheen (D-NH) keeps her seat: 90%
69. Thom Tillis (R-NC) keeps his seat: 65%
70. John Cornyn (R-TX) keeps his seat: 85%
71. The Republican and Democratic nominees total to less than 95% of the vote: 35%

C. Other

72. Trump is removed from office by the US Senate in 2020: 4%
73. Ruth Bader Ginsburg remains a Supreme Court justice at the end of 2020: 80%
74. Tulsi Gabbard runs for president as an independent: 17%
75. Tulsi Gabbard is a Fox News contributor or anchor by the end of 2020: 23% (silly in retrospect: she would have had to resign prior to completing her term)


II. Personal

A. Blog

76. I write 5 or more blog posts in 2020: 94%
77. I write 10 or more blog posts in 2020: 73%
78. I write 20 or more blog posts in 2020: 48% (25)
79. I write 30 or more blog posts in 2020: 26%
80. I write 50 or more blog posts in 2020: 12%
81. The total number of views of my blog in 2020 is at least 500: 95%
82. The total number of views of my blog in 2020 is at least 1000: 80%
83. The total number of views of my blog in 2020 is at least 2000: 65%
84. The total number of views of my blog in 2020 is at least 5000: 35%
85. The total number of views of my blog in 2020 is at least 10000: 20% (17,575)
86. The total number of views of my blog in 2020 is at least 100000: 4%

B. Grad school/research

87. I publish a computer science paper in a conference held in 2020 or a journal edition issued in 2020 (RadicalXchange does not count): 58%
88. I do a plurality of my work in 2020 with Tim Roughgarden: 75%
89. I prove a 3/4-approximation positive result for the problem I’m currently working on by the end of 2020: 62%
90. I go to EC (in Budapest) in 2020: 40% (This was an edge case: EC wasn’t held in Budapest, and I attended a workshop, but would have attended a lot more if it had been in person. I asked a friend, who ruled that this resolved “no”.)

C. EA/Rationality

91. I’m a SPARC staff member in 2020: 33%
92. By the end of 2020, I’m part of a project to implement something like my election-charity platform idea, with a registered domain name: 23%
93. I (co-)run some OBNYC (NYC rationalist) meetup in 2020: 65%
94. I spend at least a month in California in 2020: 80%
95. By the end of 2020, animal welfare considerations will have substantial influence over my diet: 38% (I mostly don’t eat chicken anymore)
96. I consider myself a vegetarian at the end of 2020: 15%
97. I consider myself a vegan at the end of 2020: 2%
98. I make a donation of at least $50 to a third world poverty charity in 2020 (counting College Pulse donations): 93%
99. I make a donation of at least $50 to an existential risk/long-term future charity in 2020: 45%
100. I make a donation of at least $50 to an animal welfare charity in 2020: 60%
101. I go to the RadicalXchange conference (in Sao Paulo) in 2020: 25%

D. Politics

For events 102 through 108, I have put the options Joe Biden, Bernie Sanders, Elizabeth Warren, Pete Buttigieg, Amy Klobuchar, and None of the Above in a random order that I’ve recorded and called them Option 1 through Option 6. (“None of the Above” includes the possibility that I don’t vote.)

102. I will vote for [Option 1]: 2%
103. I will vote for [Option 2]: 50% (this was Biden, see my endorsement here)
104. I will vote for [Option 3]: 6%
105. I will vote for [Option 4]: 25%
106. I will vote for [Option 5]: 13%
107. I will vote for [Option 6]: 4%
108. I use the same word to describe my political identity at the end of 2020 as I do now: 72%
109. I try to vote-swap in the 2020 presidential election: 30%
110. I successfully vote-swap in the 2020 presidential election: 15%

E. Other

111. I’m a Mathcamp mentor in 2020: 20%
112. I publish a non-academic piece of writing in some publication in 2020: 16%
113. I read a book in 2020: 60%
114. I read at least two books in 2020: 35% (the first two books of Brandon Sanderson’s The Reckoners; I got bored of it partway through the third book)
115. I read at least three books in 2020: 25%
116. I write at least three puzzle hunt style puzzles in 2020: 30% (here, here, and one that isn’t public)
117. I write a song (with music, not just lyrics) in 2020: 15%
118. I play squash on at least 25 days in 2020: 78%
119. I visit a country that is not the United States, Hungary, or Brazil in 2020: 25%


III. Other

A. International events

120. Benjamin Netanyahu is the Prime Minister of Israel at the end of 2020: 25%
121. Benny Gantz is the Prime Minister of Israel at the end of 2020: 35%
122. A date is set for a Scottish independence referendum in 2020 (the date doesn’t have to be in 2020): 24%
123. A war breaks out between two countries, both of which either have population in the top 40 or have nuclear weapons. For this to be labeled “true,” it must be a war between the two countries’ governments; for example, if the United States initiates a counter-insurgency operation in Nigeria, that will not count: 7%
124. In July-December 2020, there is a protest in Hong Kong that draws more than 1 million protesters according to the protest organizers or CHRF: 35% (not totally sure about this one, let me know if I’m wrong)

B. Tennis

125. Roger Federer wins a grand slam tournament in 2020: 37%
126. Someone besides Djokovic, Nadal, and Federer wins a men’s singles grand slam tournament in 2020: 50%
127. Serena Williams wins a grand slam tournament in 2020: 50%
128. Four different people win the women’s singles grand slam tournaments in 2020: 55% (Each of the three grand slam tournaments was won by a different woman, but Wimbledon was cancelled!)

C. Computer science

129. The unique games conjecture is widely considered resolved by the end of 2020: 6%
130. P vs. NP is widely considered resolved by the end of 2020: 1%

D. Other

131. The third book in the Kingkiller Chronicle has a publication date set by the end of 2020 (the date doesn’t have to be in 2020): 16%
132. Despacito remains the most-watched YouTube video at the end of 2020: 72% (it’s Baby Shark, which will soon have been watched more than once by the average living person!)



I. Calibration

Being calibrated means that among events you assign a 70% chance to, 70% of them happen (and likewise for other probabilities). Calibration is usually measured by splitting up a forecast’s probabilities into buckets and seeing whether the events in each bucket happened with roughly the correct frequency.

For example, I had 15 events to which I assigned between a 50% and a 60% chance to happening. (This counts events where I assigned between a 40% and a 50% chance to them happening — I just looked at the event’s negation.) The average probability among these 15 forecasts was 53.5%, to be exact. This means that if I were perfectly calibrated, 53.5% of those events should have happened. (Of course there’s a lot of noise there because there are only 15 events — but you get the idea.) That’s what the orange points in the chart below represent: perfect calibration.

What actually happened is that 7 of the 15 events (46.7%) in this bucket transpired. That represents the blue points (in particular the bottom-left blue point) in the plot below: the x-value is the fraction of events in the bucket that were expected to happen if my forecasts were calibrated, and the y-value is the actual fraction that happened.

Points well below line represent overconfidence (e.g. if when I say something is very likely to happen, it’s only somewhat likely to happen). Points well above the line represent underconfidence. The vertical bars are 95% error bars centered at the orange points. For each bucket, if I really am perfectly calibrated, there’s a 95% chance that the blue point will lie inside the error bar.

The blue points lie well inside the error bars. This means that there’s nothing in this data that would lead you to conclude that I’m not calibrated. Now, that doesn’t mean I’m perfectly calibrated (I’m not), but this is an indication that I’m at least decently calibrated. So, I’d say my calibration was good. (See here for a discussion of whether my Democratic primary predictions make me look artificially calibrated; my guess is “no or only a little”.)

The one thing that gives me pause is: I think I should have been a little overconfident this year. That’s because this was a very abnormal year in a way that was totally unpredictable when I made the predictions (COVID wasn’t even publicly known to exist back then). My philosophy is that in most years you should be a little underconfident, but in the occasional year that is really weird in a way that you couldn’t have predicted you should be overconfident.

Ultimately though, I think it isn’t a big concern that I don’t look overconfident, for two reasons. First, relatively few of my predictions (93, 94, 118, 119, and 124-128) were substantially influenced by COVID. And second, just by coincidence there was an opposing coincidental factor that by itself would have led me to look underconfident. Specifically, it so happened that the political environment in December 2019 was about as favorable to each party as it was in November 2020. As a result, I happened to “call” basically every general election race right (except Georgia). Since basically all the events I forecasted (even those I assigned a 60% chance to) happened, my general election predictions appear underconfident. I think this basically cancelled out what would have been a COVID-induced appearance of overconfidence.


II. Personal optimism/pessimism

I wrote in my 2020 predictions post:

Here are the Personal events that I will be considering when judging whether I was optimistic or pessimistic: 76 through 80 (weight 1/5 each); 80 through 86 (weight 1/6 each); 87; 89; 92-93; 112; 113-115 (weight 1/3 each); 116-119.

Each of these will score a point (or less if they are down-weighted) if I accomplish them. My probabilities imply a predicted score of 5.1 out of 12. Scoring higher is evidence of pessimism; scoring lower is evidence of optimism.

My actual score was 4.1. I ran some simulations and found that if I’m perfectly calibrated, this would be a 25th percentile result, i.e. 4.1 is on the lower edge of my 50% confidence interval for how much I’d accomplish. Seems fine.

I’ll note also that three of those predictions (93, 118, and 119) were very affected by COVID. If I don’t include those, my predicted score would have been 3.44 (compared to an actual score of 4.1). This would have been a 72nd percentile result. Still seems fine.

(I think that making these sorts of excuses in the analysis is bad in most years. But this year truly was special in an unpredictable way, so I think that if my goal is to be calibrated in the future, I should update very slightly toward guessing that I’ll accomplish more, not less.)


III. Relative performance

Calibration very much isn’t everything: for example, you could make a perfectly calibrated forecast for the NBA season by saying that each game is 50/50. The other skill is expertise: actually knowing whether or not something is likely to happen. Judging expertise is best done by comparing yourself to other forecasters.

In December 2019, I noted PredictIt‘s probabilities for all of my predictions that were being traded there (see here toward the bottom). Also, upon looking at the PredictIt probability, I made updated forecasts (I used my judgment in each case about how far to pull my numbers towards PredictIt’s); these are not the numbers above (again you can see those numbers here toward the bottom). Now I can compare my original predictions, PredictIt’s probabilities, and my updated predictions using the log scoring rule.

I’ll note (and I’m writing this before having done the comparison) that you shouldn’t read much into these numbers. Since most of the forecasts being compared are political, and political forecasts are really correlated (so the true sample size is quite small), it would take many many years to figure out whether I’m better than PredictIt or vice versa. This is not to mention that PredictIt and I typically had quite similar probabilities. But, I figure there’s no harm in doing the comparison.

The total scores (higher is better) on the 59 of my 132 events that were on PredictIt:

  • My log score: 16.5
  • PredictIt’s log score: 19.1
  • My updated log score: 18.4

Why did PredictIt do better than me? The answer is basically that PredictIt was more bullish on Bernie Sanders than I was, and Sanders ended up winning the popular vote in the first three primaries. Should I have been more bullish on Sanders, or did I just get unlucky? I don’t know but I probably should have been a little more bullish on Sanders. (At the same time, PredictIt was unreasonably bearish on Biden in late February, so I don’t think it’s obvious that I should have trusted PredictIt was reasoning correctly.)

There were also some cases where I was uninformed and my prediction was silly. For instance, I probably should have assigned much more than a 25% chance to Netanyahu being Prime Minister of Israel at the end of 2020. I said 25%, PredictIt said 40%. This was a case where I deferred to PredictIt (my updated probability was 39%). I think I did a pretty decent job of figuring out how much to change my probabilities in light of PredictIt’s numbers.

Finally, an anonymous friend took up my challenge of making predictions for a subset of the questions. They tried to come up with their predictions independently of mine, so as to not have an advantage. They made predictions for the U.S. Politics questions (#1-75). My log score on those questions was 21.6; theirs was 13.0. My better performance was partly due to giving Sanders a decent chance of wining Nevada, and partly due to being more bearish on Republicans in the general election. (Note that I had an advantage over my friend because I got to choose the questions.)


IV. Overall thoughts

I’m satisfied with my calibration this year. At the same time, calibration is much less than half the battle and I don’t have a good sense of whether my predictions were well-informed. The right way to figure this out is to use something like Metaculus, where I’d be directly competing with other forecasters. I’ve been reluctant to join Metaculus, partly due to activation energy and partly because I don’t want to end up spending too much time on it. But I’m guessing I’ll give in and join the fray sometime this year. 64% chance :)


New Comment