Forecasting Prize Results

NunoSempere; ozziegooen

The "Forecasting Innovations Prize" was announced on the 15th of November of 2020 on the Effective Altruism Forum and on LessWrong, with the goal of incentivizing valuable research around forecasting. We received 10 submissions.

Judges—AlexRJL, Eric Neyman, Tamay Besiroglu, Linch Zhang, Ozzie Gooen and myself— recommended a quantity of money to be awarded to each submission. The next section is a short summary of each entry, the prize they were assigned, and the reasons the judges gave. This is followed by a brief discussion of the judging process and takeaways.

We will be contacting authors soon.

Crowd-forecasting COVID-19

The post describes the results of a COVID-19 crowd-forecasting project created during the author's PhD. The judges didn’t know of any other app in which human forecasters could conveniently forecast different points in a time series, with confidence intervals. The project’s forecasts were submitted to the German and Polish Forecast Hub, and they did surprisingly well in comparison with other groups.

Judges brought up the issue that R/shiny is probably the suboptimal technology for a web-app. Further, as of the time the post was published neither the post under consideration nor other submissions to the German and Polish Forecast Hub were able to outperform a model that simply predicts constant cases on a four-week horizon.

This post receives a prize of $250.

Incentivizing forecasting via social media

The post explores the implications of integrating forecasting functionality with social media platforms. They consider several important potential issues in some length, and possible solutions to these, as well as indications for next steps. The scenario they consider— if it were to occur—could possibly have a large impact on the 'information economy'.

However, as the author’s note, the feasibility of the proposal is very unclear (<1%, though note that Twitch recently added some prediction functionality). Further, the authors were not aware of Facebook’s Forecast at the time they wrote the post.

This post receives a prize of $250.

Central Limit Theorem investigation

The post visualizes how quickly the central limit theorem works in practice, i.e., how many distributions of different types one has to sum (or convolve) to approximate a Gaussian distribution in practice. The visualizations are excellent, and give the readers intuitions about how long the central limit theorem takes to apply. Judges thought that explanations of important ideas to a specific community are valuable even if they are only new to that community.

As a caveat, the post requires understanding that the density of the sum of two independent variables is the convolution of their densities. That is, that when the post mentions “the number of convolutions you need to look Gaussian”, this is equivalent to “the number of times you need to sum independent instances of a distribution in order for the result to look Gaussian”. This point is mentioned in an earlier post of the overall sequence. Judges also weren’t sure to what extent this post was “forecasting-related.” Future competitions, if they happen, will have a clearer cut-off.

This post receives a prize of $120.

Forecasting of Priorities (Czech Priorities)

This post explains a set of ideas by Czech Priorities to use forecasting as a method of public deliberation, in particular to identify "priorities'' or "mega-trends". Judges thought that, with a less messy design, this post could have won the first prize. In particular, it seems that this group has managed to convince the Czech government to give it two large grants and to pay attention to the result.

However, the suggested implementation really was quite messy. On the one hand, they suggest predicting the result of expert deliberation on the importance of “priorities”, but the selection of those experts could be politicized. On the other hand, one of the proposed mechanisms incorporates both forecasting and preference elicitation, and might not end up producing either good elicitation or good forecasting.

This post receives a prize of $90

One's Future Behavior as a Domain of Calibration

This post advocates for forecasting one's future actions, and presents the author's method to do so. Some judges liked that it is pretty easy for this post to have an actual impact, as long as at least one person acts on it. One small detail the judges disagreed with was the post’s assertion that calibration doesn’t transfer between domains (this somewhat conflicts with some of the judges’ own experiences)

This post receives a prize of $80

What to do about short AI timelines?

This short sequence gathers three posts on short timelines, and asks two questions: How to bet on short AI timelines, and how one’s influence depends on the length of AI timelines. This posts part of a longer running investigation by Daniel Kokotajlo into short timelines.

The posts used the EA forum’s question functionality, and the author didn’t seem very satisfied with the responses, though the least forecasting-related post in the series did see more discussion on LessWrong. Judges found that other posts by the author on the topic of timelines (e.g., this one) were much stronger, whereas the particular research questions in the prize submission didn’t really pan out. Some judges thought that question creation might be underrated.

This post receives a prize of $70

How might better collective decision-making backfire?

The post is faithful to the title, and comes up with or elicits several pathways through which collective decision-making might backfire.

Judges found the question asked to be important, but found it hard to evaluate the answers, because there was no overall framework to do so. In particular, there was no discussion about which concerns were or would have been historically important. It is also unclear whether any practical actions will be taken as a result of the post, or whether it will be built upon.

This post receives a prize of $60

The Fermi Paradox has not been dissolved

The post points out some flaws in Dissolving the Fermi Paradox, a paper by Sandberg et al. Among other reasons, having good probabilities around the Fermi paradox is valuable because it provides (indirect) evidence about the existence of a "Great Filter" and thus for our likelihood of extinction.

Judges disagreed substantially about to what extent the points raised in the post were substantive, and to what extent the author was too overconfident or forceful. There was also some disagreement about whether the post was very related to “human judgmental forecasting.”

This post receives a prize of $50

The First Sample Gives the Most Information

The post concisely introduces a powerful and simple concept. Judges agreed that the post wasn’t hugely impactful, but that it probably did have a pretty great ratio of value to time spent on it.

This post receives a prize of $50

A tenth post was also submitted, but some flaws were identified, and the author asked us not to mention it until it is fixed.

Judging process

Judges read each submission and produced:

An assessment of the quality of the project (execution)
An estimate of how valuable the project was
A funding recommendation
Comments as to their reasoning

The reasons why the funding recommendations were not directly proportional to impact and quality were:

Adjusting for closeness to forecasting: more impactful projects which weren't that related to forecasting received smaller prices.
Some (but not all) judges tried to think about what signals giving higher or lower prizes sends. For example, some judges gave higher prizes to projects which had higher expected values even if they didn’t pan out in the end. Similarly, some judges penalized a post which sounded very overconfident even if it was otherwise impactful or valuable.
A high quality project can have low value if it belongs to a less impactful domain.
Some judges felt higher effort posts were worth more money per unit of impact, perhaps because lower effort posts could have been written by someone else if the original author hadn't done it.

After giving their initial estimates, judges met in a Zoom call to discuss their estimates. This was done by going project by project and bringing up disagreements. Afterwards, judges updated their estimates and recommendations. The final prize is simply the mean of all judges' recommendations.

Comments and Reflections

The counterfactual impact of this prize seems uncertain. Of the 10 submissions, only three were counterfactually caused by the prize, with the other seven being submitted because I (Nuño) asked the authors to do so after finding them by browsing forecasting related content in the EA forum and LessWrong.

Overall, it is possible that there were too many judges which spent too much time cumulatively judging, and that the marginal value of a judge wasn’t too high. However, when hashing out disagreements, each judge did bring unique points.

If there is a second round for this prize before 2022, entries published after the end of the first round will be accepted so as not to generate an incentive to not post forecasting-related content until there is a prize.

Appendix: Quality Adjusted Research Papers.

Judged also estimated the impact of these projects in terms of Quality Adjusted Research Papers (Qs). QARPs are intended to both have relative value (a 20 QARPs project should be estimated to be twice as valuable as a project which has 10 QARPs), and absolute meaning (0.1 QARPS, or 100mQARPs should correspond to "A fairly valuable paper", such as this one)

The value judges assigned to each submission was:

Crowd-forecasting Covid-19: 32 mQARPs
Central Limit Theorem investigation : 17 mQARPs
The Fermi Paradox has not been dissolved: 16 mQARPs
Incentivizing forecasting via social media: 12 mQARPs
Forecasting of Priorities (Czech Priorities): 8 mQARPs
What to do about short timelines?: 8 mQARPs
One's Future Behavior as a Domain of Calibration: 7 mQARPs
How might better collective decision-making backfire?: 5 mQARPs
The First Sample Gives the Most Information: 3 mQARPs

Note that this method of rating is highly speculative, and having judges using it was in part intended as a test. Judges brought up that they weren’t sure that the scale was well defined, and that they were much more sure about their own relative values than about the absolute magnitude. Also, note that this didn't consider the relevance of forecasting, which is the main reason why these values don't perfectly correlate with the prizes.

37