How does this model account for "bad" uses of prediction markets, e.g. cases where a market user artificially corrupts the outcome, illegally uses their access to national secrets to extract from the market, or otherwise uses the market extractively rather than predictively?
In section 1 of your findings, the "Expected" column of the table is misleading. It assumes that probabilities within each bucket are uniformly distributed (or more generally, symmetrically distributed around the center of the bucket's range).
A more faithful Expected value would be the average over each market's probability within a given bucket. This is the true rate that perfectly calibrated markets would resolve at.
I suspect that this is the cause of the significant discrepancy between the Expected and Actual resolution rates near the 0% and 100% extremes.
Summary
I analyzed 7,661 resolved binary markets from Polymarket to answer a simple question: when a market trades at X%, does the event actually happen X% of the time?
Short answer: Polymarket is well-calibrated at the extremes and shows systematic underpricing in the mid-range, with overall Brier score of 0.025. The calibration improves closer to resolution.
Interactive calibration tool with full data
Methodology
Data collection: I pulled all resolved binary markets from Polymarket's Gamma Markets API (condition_id-based), filtered to those with resolution data and CLOB price history. This yielded 7,661 markets with usable resolution data, of which ~2,000 had sufficient price history for time-horizon analysis.
Calibration calculation: For each probability bucket (0-5%, 5-15%, ..., 95-100%), I computed the fraction of markets in that bucket that actually resolved YES. A perfectly calibrated market would show a 45-degree line: markets at 60% resolve YES 60% of the time.
Time horizons: I examined prices at three snapshots before resolution: 24 hours, 7 days, and 30 days. This tests whether calibration improves as resolution approaches (you'd expect it to, since late prices incorporate more information).
Brier scores: Computed as mean squared error between the market probability and the binary outcome (0 or 1). Perfect forecasting = 0, random = 0.25.
Key Findings
1. Excellent calibration at extremes, systematic underpricing in the middle
The 65-75% bucket stands out: markets priced at ~70% actually resolved YES 96.5% of the time. This is consistent with a "long-shot bias" operating in reverse — popular mid-probability events may be systematically underpriced. However, n=29 is small enough that this could be noise.
2. Calibration improves closer to resolution
This is the expected pattern: as resolution approaches, prices converge toward 0% or 100%, and calibration improves. The 30-day score of 0.042 is still quite good (well below the 0.25 random baseline).
3. Category breakdown
Filtering by category reveals differences:
The interactive tool lets you filter by category to explore this.
Limitations and caveats
Selection bias: Polymarket lists markets it thinks will attract volume. These are likely more "forecastable" than arbitrary questions. The calibration results may not generalize to harder-to-predict domains.
Resolution mechanics: Some markets have ambiguous resolution criteria. I did not filter for resolution disputes, which could introduce noise.
Small mid-range samples: The most interesting finding (underpricing at 65-75%) has only 29 markets. I would not draw strong conclusions from this alone.
Binary markets only: Multi-outcome markets (e.g., "Who will win the election?" with 5+ candidates) were excluded. These have different calibration dynamics.
Survivorship: Markets that were voided or de-listed are not in the dataset. If poorly-calibrated markets are more likely to be voided, this biases toward showing better calibration.
Comparison to other forecasting platforms
For context, Metaculus reports overall Brier scores of ~0.07-0.10 for their community median forecasts. Superforecasters in the Good Judgment Project achieved Brier scores around 0.15-0.20 on geopolitical questions (but these were harder questions than typical Polymarket markets). Polymarket's 0.025 at 24h looks excellent, but the comparison is apples-to-oranges because the question difficulty distributions differ substantially.
Methodology notes
Code is straightforward: Python requests to Polymarket's Gamma API + CLOB API for price history, numpy/scipy for statistics, the results rendered with Chart.js. Data collected April 25, 2026.
The interactive tool is at polymarket-calibration.vercel.app. You can filter by category and time horizon.
Discussion questions
I'm writing a prediction market analysis newsletter where I cover this kind of analysis weekly. Happy to answer questions about the data or methodology.