Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
Metaculus's Q4 Benchmark series is now live. Click here to learn more and to compete for $30,000. Benchmark the state of the art in AI forecasting against the best humans on real-world questions. Metaculus's Q3 AI Benchmarking Series aimed to assess how the best bots compare to the best humans on real-world forecasting questions, like those found on Metaculus. Over the quarter we had 55 bots compete for $30,000 on 255 weighted questions with a team of 10 Pros serving as a human benchmark. We found that Pro forecasters were significantly better than top bots (p = 0.036) using log scoring with a weighted t-test. This main result compares the median forecast of 10 Pro Forecasters against the median forecast of 9 top bots on a set of 113 questions that both humans and bots have answered. That analysis follows the methodology we laid out before the resolutions were known. We use weighted scores & weighted t-tests throughout this piece, unless explicitly stated otherwise. We further found that: * The Pro forecaster median was more accurate than all 34 individual bots that answered more than half of the weighted questions. The difference was statistically significant in 31 of those comparisons. * The top bots have worse calibration and discrimination compared to Pros. * The top bots are not appropriately scope sensitive. * The Metaculus single shot bot intended as baseline powered by GPT-4o finished slightly higher than the bot powered by Claude 3.5. The Metaculus bot powered by GPT-3.5 finished last out of 55 bots, worse than simply forecasting 50% on every question. Selecting a Bot Team We identify the top bots by looking at a leaderboard that includes only questions that were asked to the bots, but not the Pro forecasters. Using a weighted t-test, we calculated a 95% confidence interval for each bot and sorted the bots by their lower bounds. The table below shows that the top 10 bots out of 55 all had average Peer scores over 7 and answered over 100 weighte