Empirical calibration of Polymarket: Analysis of 7,661 resolved binary markets

JohnLeslie_PM

Summary

I analyzed 7,661 resolved binary markets from Polymarket to answer a simple question: when a market trades at X%, does the event actually happen X% of the time?

Short answer: Polymarket is well-calibrated at the extremes and shows systematic underpricing in the mid-range, with overall Brier score of 0.025. The calibration improves closer to resolution.

Interactive calibration tool with full data

Methodology

Data collection: I pulled all resolved binary markets from Polymarket's Gamma Markets API (condition_id-based), filtered to those with resolution data and CLOB price history. This yielded 7,661 markets with usable resolution data, of which ~2,000 had sufficient price history for time-horizon analysis.

Calibration calculation: For each probability bucket (0-5%, 5-15%, ..., 95-100%), I computed the fraction of markets in that bucket that actually resolved YES. A perfectly calibrated market would show a 45-degree line: markets at 60% resolve YES 60% of the time.

Time horizons: I examined prices at three snapshots before resolution: 24 hours, 7 days, and 30 days. This tests whether calibration improves as resolution approaches (you'd expect it to, since late prices incorporate more information).

Brier scores: Computed as mean squared error between the market probability and the binary outcome (0 or 1). Perfect forecasting = 0, random = 0.25.

Key Findings

1. Excellent calibration at extremes, systematic underpricing in the middle

Probability Bucket	Markets (n)	Actual Resolution Rate	Expected
0-5%	5,693	0.1%	~2.5%
5-15%	254	13.4%	~10%
15-25%	104	25.0%	~20%
25-35%	82	32.9%	~30%
35-45%	70	48.6%	~40%
45-55%	25	60.0%	~50%
55-65%	39	74.4%	~60%
65-75%	29	96.5%	~70%
75-85%	44	86.4%	~80%
85-95%	52	92.3%	~90%
95-100%	1,209	100.0%	~97.5%

The 65-75% bucket stands out: markets priced at ~70% actually resolved YES 96.5% of the time. This is consistent with a "long-shot bias" operating in reverse — popular mid-probability events may be systematically underpriced. However, n=29 is small enough that this could be noise.

2. Calibration improves closer to resolution

Time Horizon	Brier Score	Markets
30 days before	0.042	~1,800
7 days before	0.032	~1,900
24 hours before	0.025	~2,000

This is the expected pattern: as resolution approaches, prices converge toward 0% or 100%, and calibration improves. The 30-day score of 0.042 is still quite good (well below the 0.25 random baseline).

3. Category breakdown

Filtering by category reveals differences:

Sports markets tend to be well-calibrated with larger sample sizes
Crypto markets show more volatility in calibration (higher variance per bucket)
Geopolitical markets have smaller n per bucket but show the same general pattern

The interactive tool lets you filter by category to explore this.

Limitations and caveats

Selection bias: Polymarket lists markets it thinks will attract volume. These are likely more "forecastable" than arbitrary questions. The calibration results may not generalize to harder-to-predict domains.

Resolution mechanics: Some markets have ambiguous resolution criteria. I did not filter for resolution disputes, which could introduce noise.

Small mid-range samples: The most interesting finding (underpricing at 65-75%) has only 29 markets. I would not draw strong conclusions from this alone.

Binary markets only: Multi-outcome markets (e.g., "Who will win the election?" with 5+ candidates) were excluded. These have different calibration dynamics.

Survivorship: Markets that were voided or de-listed are not in the dataset. If poorly-calibrated markets are more likely to be voided, this biases toward showing better calibration.

Comparison to other forecasting platforms

For context, Metaculus reports overall Brier scores of ~0.07-0.10 for their community median forecasts. Superforecasters in the Good Judgment Project achieved Brier scores around 0.15-0.20 on geopolitical questions (but these were harder questions than typical Polymarket markets). Polymarket's 0.025 at 24h looks excellent, but the comparison is apples-to-oranges because the question difficulty distributions differ substantially.

Methodology notes

Code is straightforward: Python requests to Polymarket's Gamma API + CLOB API for price history, numpy/scipy for statistics, the results rendered with Chart.js. Data collected April 25, 2026.

The interactive tool is at polymarket-calibration.vercel.app. You can filter by category and time horizon.

Discussion questions

Is the mid-range underpricing (65-75% bucket) a genuine market inefficiency, or just small-sample noise? What sample size would you need to be confident?
How would you expect Polymarket's calibration to compare to Kalshi's (CFTC-regulated, different user base)?
Are there better methodological approaches for this kind of analysis? I considered logistic calibration curves but stuck with bucketed analysis for interpretability.

I'm writing a prediction market analysis newsletter where I cover this kind of analysis weekly. Happy to answer questions about the data or methodology.

[-]Karl Krueger20d20

How does this model account for "bad" uses of prediction markets, e.g. cases where a market user artificially corrupts the outcome, illegally uses their access to national secrets to extract from the market, or otherwise uses the market extractively rather than predictively?

[-]simulus21d10

In section 1 of your findings, the "Expected" column of the table is misleading. It assumes that probabilities within each bucket are uniformly distributed (or more generally, symmetrically distributed around the center of the bucket's range).

A more faithful Expected value would be the average over each market's probability within a given bucket. This is the true rate that perfectly calibrated markets would resolve at.

I suspect that this is the cause of the significant discrepancy between the Expected and Actual resolution rates near the 0% and 100% extremes.