Katherine Milkman on Twitter notes how far off the epidemiological expert forecasts were in the linked sample:

https://twitter.com/katy_milkman/status/1244668082062348291

They gave an average estimate of 20,000 cases. The actual outcome was 122,653 by the stated date in the U.S. That's off by a factor of 6.13, and those were experts. Only 3 out of 18 survey respondents managed to get the actual outcome to fall in their 80% confidence interval. If they were perfectly calibrated on this one-off prediction, about 14 should've had the actual outcome fall in their 80% confidence interval. EDIT: No that's not right; Daniel Filan points out that we wouldn't expect 14 of 18 to get it in their range. Error between users in a one-off forecast is often correlated.

I was curious how this compares to the Metaculus community forecast (note: not the machine learning fed one, just the simple median prediction). Unfortunately the interface doesn't tell me the full distribution at date x, it just says what the median was at the time. If the expert central tendency was off by a factor of 6.13, how far off was it for Metaculus?

I looked into it in this document:

https://docs.google.com/document/d/1TcG1gceeVgimr8lBTCAYUscOt_ADyuSpdmt1d6Scve8/edit?usp=sharing

Sadly a direct comparison is not really feasible, since we weren't predicting the same questions. But suppose if all predictions of importance were inputted into platforms such as the Good Judgement Project Open or Metaculus. Then making comparisons between groups could be trivial and continuous. This isn't even "experts versus non-experts". The relevant comparison is at the platform-level. It is "untrackable and unworkable one-off PDFs of somebody's projections" versus proper scoring and aggregation over time. Since Metaculus accounts can be entirely anonymous, why wouldn't we want every expert to input their forecast into a track record? That would make it possible to find out if the person is a dart-throwing chimp. You should assume half of them are.

Nope. Suppose I roll a 100-sided die, and all LessWrongers write down their centred 80% credible interval for where the answer should fall. If the LWers are rational and calibrated, that interval should be [10,90]. So the actual outcome will fall in everybody's credible interval or nobody's. The relevant averaging should happen across questions, not across predictors.

Yes, but the calibrated

and centeredinterval is uniquely [10, 90].