This post has been written in relation with work I'm doing for the Machine Intelligence Research Institute (MIRI), but hasn't been formally vetted by MIRI. I'm posting this to LessWrong because of its potential interest to a segment of the LessWrong readership.
In order to assess the quality of current and future forecasts, it's important to consider the historical record of forecasting. Doing such a historical evaluation typically requires a systematic record of past forecasts. The set of forecasts may be:
- collected explicitly for the purpose of an experiment to test the quality of forecasts,
- collected implicitly in the form of a market-clearing price or other indicator, or
- collected for the purpose of directly creating useful forecasts. In this case, the evaluation that happens later is incidental and is not the main goal of generating the forecasts.
Forecasts may be evaluated for any of these (for a longer discussion, see here):
- Accuracy of forecasts: How close are the forecasts to what actually transpires?
- Rationality or efficiency of forecasts: Are the forecasters ignoring obvious ways to improve the quality of their forecasts? Forecasts can rarely be proved to be efficient or rational, but we can test for the presence of some specific forms of irrationality or inefficiency (an example is where the forecasts have systematic bias, so that applying a Theil's correction for the bias would yield a better forecast).
- Utility of forecasts: How much more valuable is it to have correct forecasts than incorrect forecasts?
The forecasts could also take at least two different perspectives:
- They could put the focus on the individuals or institutions making the forecasts, and consider the quality of the forecasts along various dimensions in relation with the incentives and conditions facing these individuals and institutions. This is a black box approach with respect to the forecasting method used. The focus is on explaining why some people or institutions have better incentives or informational advantages that allow them to come up with better forecasts, rather than on what methods work better.
- They could put the focus on the methods used for the forecasts. This is best seen in cases where we compare the performance of programs or softwares that use different forecasting algorithms to guess unknown points in a time series from known points.
A quick list of forecast evaluations so far
See also the detailed discussion of each of the evaluations later in the post.
|Name of forecast evaluation or forecast data referenced in evaluations||How were the forecasts collected?||Evaluation primarily for accuracy or efficiency?||Focus of evaluation (forecasters or forecasting methods)?|
|Makridakis Competitions||Explicitly collected for the forecast evaluation (in the form of a competition).||Primarily accuracy||Forecasting methods|
|Survey-based macroeconomic forecasts||Collected for direct utility of forecasts||The literature includes evaluations of both accuracy and efficiency||Forecasters|
|Tetlock study of expert political judgment||Explicitly collected for the forecast evaluation||Focused on accuracy, using basic computer algorithms as benchmark||Forecasters (but also discussion of the overall philosophies used)|
|Tauri Group Retrospective||Collected for direct utility of forecasts, retrospectively used for study
||Focused an accuracy and also systematic bias (a form of irrationality)||Forecasting methods as well as nature of the items being forecast|
In this post, I describe existing evaluations and what we can learn from them.
The Makridakis Competitions (Wikipedia), known in the forecasting literature as the M-Competitions, are three competitions organized by teams led by forecasting researcher Spyros Makridakis. Here's a quick listing and summary of the competitions (table from Wikipedia):
|No.||Informal name for competition||Year of publication of results||Number of time series used||Number of methods tested||Other features|
|1||M Competition or M-Competition||1982||1001 (used a subsample of 111 for the methods where it was too difficult to run all 1001)||15 (plus 9 variations)||Not real-time|
|2||M-2 Competition or M2-Competition||1993||29 (23 from collaborating companies, 6 from macroeconomic indicators)||16 (including 5 human forecasters and 11 automatic trend-based methods) plus 2 combined forecasts and 1 overall average||Real-time, many collaborating organizations, competition announced in advance|
|3||M-3 Competition or M3-Competition||2000||3003||24|
According to the authors, the following main conclusions held in all three competitions:
- Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones.
- The relative ranking of the performance of the various methods varies according to the accuracy measure being used.
- The accuracy when various methods are combined outperforms, on average, the individual methods being combined and does very well in comparison to other methods.
- The accuracy of the various methods depends on the length of the forecasting horizon involved.
Although the organizers of the M3-Competition did contact researchers in the area of artificial neural networks to seek their participation in the competition, only one researcher participated, and that researcher's forecasts fared poorly. The reluctance of most ANN researchers to participate at the time was due to the computationally intensive nature of ANN-based forecasting and the huge time series used for the competition. In 2005, Crone, Nikolopoulos and Hibon organized the NN-3 Competition, using 111 of the time series from the M3-Competition (not the same data, because it was shifted in time, but the same sources). The NN-3 Competition found that the best ANN-based forecasts performed comparably with the best known forecasting methods, but were far more computationally intensive. It was also noted that many ANN-based techniques fared considerably worse than simple forecasting methods, despite greater theoretical potential for good performance. More on the NN-3 Competition here. It's quite possible that if the competition were rerun a few years out from now, the neural network methods would outperform the best simple methods. We'll talk more about simple versus complicated methods in a later post.
Survey-based macroeconomic forecasts
More details are available in my post reviewing the track record of survey-based macroeconomic forecasting. The following overall conclusions seem to emerge:
- For mature and well-understood economics such as that of the United States, consensus forecasts are not notably biased or inefficient. In cases where they miss the mark, this can usually be attributed to issues of insufficient information or shocks to the economy.
- There may however be some countries. particularly those whose economies are not sufficiently well-understood, where the consensus forecasts are more biased.
- The evidence on whether individual forecasts are biased or inefficient is more murky, but the research generally points in the direction of some individual forecasts being biased. Some people have posited a "rational bias" theory where forecasters have incentives to choose a value that is plausible but not the most likely in order to maximize their chances of getting a successful unexpected prediction. We can think of this as an example of product differentiation. Other sources and theories of rational bias have also been posited, but there is no consensus in the literature on whether and how these are sufficient to explain observed individual bias.
Tetlock study of expert political judgment
For his book Expert Political Judgment, Tetlock surveyed 284 experts and collected a total of 28,000 predictions. His findings, as described in the book and in an article for Cato Unbound co-authored with Dan Gardner, are as follows (note that the language is copy-pasted from the Cato Unbound article but restructured somewhat for sentence flow):
- The average expert’s forecasts were revealed to be only slightly more accurate than random guessing—or, to put more harshly, only a bit better than the proverbial dart-throwing chimpanzee. And the average expert performed slightly worse than a still more mindless competition: simple extrapolation algorithms that automatically predicted more of the same.
- The experts could be divided roughly into two overlapping yet statistically distinguishable groups. One group (the hedgehogs) would actually have been beaten rather soundly even by the chimp, not to mention the more formidable extrapolation algorithm. The other (the foxes) would have beaten the chimp and sometimes even the extrapolation algorithm, although not by a wide margin.
- The hedgehogs tended to use one analytical tool in many different domains; they preferred keeping their analysis simple and elegant by minimizing “distractions.” These experts zeroed in on only essential information, and they were unusually confident—they were far more likely to say something is “certain” or “impossible.” In explaining their forecasts, they often built up a lot of intellectual momentum in favor of their preferred conclusions. For instance, they were more likely to say “moreover” than “however.”
- The foxes used a wide assortment of analytical tools, sought out information from diverse sources, were comfortable with complexity and uncertainty, and were much less sure of themselves—they tended to talk in terms of possibilities and probabilities and were often happy to say “maybe.” In explaining their forecasts, they frequently shifted intellectual gears, sprinkling their speech with transition markers such as “although,” “but,” and “however.”
- It's unclear whether the performance of the best forecasters is the best that is in principle possible.
- This widespread lack of curiosity—lack of interest in thinking about how we think about possible futures—is a phenomenon worthy of investigation in its own right.
Tetlock followed up the research in the project with co-creating The Good Judgment Project (Wikipedia), that used aggregation of information from large numbers of participants who had access to Google search and the Internet but didn't necessarily have prior subject matter expertise. The Good Judgment Project produced better forecasts than other contestants in the IARPA Aggregative Contingent Estimation contest. This finding combines the idea that foxes have advantages over hedgehogs (Google searches by people without much prior knowledge resembles fox-like thinking) and the miracle of aggregation.
Tauri Group Retrospective
The report titled Retrospective Analysis of Technology Forecasting: InScope Extension by Carie Mullins for the Tauri Group often goes by the name of the Tauri Group Retrospective. The report was published on August 13, 2012 and includes 2,092 forecasts that were found to be timely, specific, complete, and relevant enough to be further verified and assessed for accuracy. The following were the main findings (from Table ES-2 of the paper, Page 3):
- In general, forecasts provide more accurate predictions than uninformed guesses. Six of the eight methodologies statistically are more accurate than a theoretical probability of success (random guess). Although qualitative trend analysis and gaming and scenarios methods have observed accuracies better than a random guess, at a 95% confidence interval there is no statistical evidence that these methods would perform better than a guess.
- Forecasts based on numeric trends are more accurate than forecasts based on opinion. Forecasts generated from quantitative trend analyses have statistically higher success rates than do forecasts generated from other methodologies.
- Forecasts are more likely to overestimate the event date. This is a change from our previous study, which indicated that there was a balance between pessimistic and optimistic forecasts.
- Short -term forecasts are more accurate than medium- and long-term forecasts.
- A predictive model of forecast accuracy could not be developed. Forecast accuracy appears to be influenced by a random component or some other attribute not captured in the study.
- Forecasts that clearly describe timeframe, technology, predicted event, and associated performance metrics are more informative.