Testing Cross-Model Consistency in Economic and Policy Predictions
Building on Pham & Cunningham (2024)'s findings on narrative-prompted predictions, I conducted a large-scale experiment comparing how different LLMs forecast economic and policy outcomes under different electoral scenarios. GPT-4o, GPT-4o-mini, and Grok all projected better outcomes under a Harris administration than a Trump administration, and Grok had the largest gaps.
Methodology
To elicit these forecasts, I prompted GPT-4o, GPT-4o-mini, and Grok to simulate the reporting of 2025 metrics under different administrative scenarios. I ran 500 trials per model/metric using narrative prompting techniques shown to improve LLM forecast accuracy. For example, I used this prompt for real GDP per capita:
Write a scene from January 2026 where a senior
... (read 274 more words →)