Frontier LLMs disagree on 67% of real-world fact-checks (n=1,000)

Kosta Jordanov

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Snapshot v1.0 · Data as of 2026-05-21 · DOI: 10.5281/zenodo.20344847 · PDF + dataset · Web version

I presented 1,000 recent real-user claims to the five top frontier LLMs and asked each one for a verdict on a fixed 4-bucket rubric (True / Mostly True / Misleading / False). These aren't benchmark items with public answer keys — they're claims real users submitted to a fact-checking platform in the last 180 days. Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this rubric. On 67% of claims, the panel splits.

Key findings

67% of claims (672 / 1,000; 95% CI: 64–70%) have at least one frontier model dissenting from the panel majority, or no strict majority forms at all.
34% of claims (343 / 1,000; 95% CI: 31–37%) involve a 2+ bucket gap between the most-disagreeing pair — substantive disagreement on the answer, not just a calibration shift.
21% of claims (211 / 1,000) have a polar split: at least one model says True, at least one says False.
Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement.
The panel converges on definitive verdicts; the middle of the rubric is where it fractures. Among the 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True.
Different models have very different verdict distributions. Gemini 3 Pro puts 94% of verdicts in {True, False}; Claude Opus 4.7 spreads more evenly across all four buckets.

What this is not

A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. I use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. There is no human-labeled ground truth in this artifact — that's a separate follow-up. This paper measures that the frontier disagrees on real-world claims, not who is right.

1. How often the frontier disagrees

| Frontier verdict pattern | Claims | Share (95% CI) |

| All 5 agreed (unanimity) | 328 | 33% (30–36%) |

| 1 of 5 dissented | 224 | 22% (20–25%) |

| 2 of 5 dissented | 316 | 32% (29–35%) |

| Models split, no majority (e.g. 2-2-1) | 132 | 13% (11–15%) |

| ≥1 model dissents (incl. splits) | 672 | 67% (64–70%) |

| ≥2 models dissent (incl. splits) | 448 | 45% (42–48%) |

Lower bound on model error. Exactly one of the four buckets is the correct answer per claim. Under the *most charitable* assumption that the panel's most popular bucket is the correct one:

≥1 model wrong on 67% of claims (any non-unanimous panel)
≥2 wrong on 45% of claims (3-2, 3-1-1, or no-majority splits)
≥3 wrong on 13% of claims (no bucket reaches a majority, so at most 2 can be right)

Relaxing the "most-popular is correct" assumption can only raise these counts. The actual error rate is likely higher: even the 33% of unanimous cases can include shared blind spots.

2. Substantive vs nuance disagreement

Not every disagreement is equal. True vs Mostly True is a calibration shift. True vs False is a substantive disagreement about the answer. I measure this as the max pairwise bucket distance across the 5 verdicts on each claim (True=0, Mostly True=1, Misleading=2, False=3).

| 0 | Full unanimity | 328 | 33% (30–36%) |

| 1 | Nuance only (e.g. True ↔ Mostly True) | 329 | 33% (30–36%) |

| 2 | Substantive (True ↔ Misleading, or Mostly True ↔ False) | 132 | 13% (11–15%) |

| 3 | Polar (True ↔ False) | 211 | 21% (19–24%) |

| | ≥2 buckets apart (substantive or polar) | 343 | 34% (31–37%) |

That 21% polar-split row is the one I find hardest to wave away. On one in five real-user claims, at least one frontier model says True and at least one says False. These aren't trick claims designed by adversarial researchers — they're what users actually submitted.

Collapsed into a single inter-rater statistic that weights disagreements by bucket distance, Krippendorff's α (ordinal) across the 5-model panel is 0.639 — nontrivial agreement, but well short of treating any one model's verdict as a settled label.

3. Model-vs-model pairwise agreement

How often each pair of frontier models picked the same exact verdict label:

| | GPT-5.4 | Claude Opus 4.7 | Gemini 3 Pro | Gemini 3 + Search | Sonar Pro |

| GPT-5.4 | — | 65% | 65% | 60% | 60% |

| Claude Opus 4.7 | 65% | — | 53% | 53% | 58% |

| Gemini 3 Pro | 65% | 53% | 75% | 75% | 53% |

| Gemini 3 + Search | 60% | 53% | 75% | — | 58% |

| Sonar Pro | 60% | 58% | 53% | 58% | — |

Highest peer agreement: Gemini 3 Pro × Gemini 3 Pro + Search (75%) — unsurprising, they share a base model. Lowest: three pairs tie at 53% — Claude × Gemini 3 Pro, Claude × Gemini 3 + Search, Gemini 3 Pro × Sonar Pro.

4. Per-model verdict distribution

Some models concentrate at the True/False poles; others spread across the middle.

| GPT-5.4 | 42% | 16% | 12% | 30% |

| Claude Opus 4.7 | 38% | 26% | 19% | 17% |

| Gemini 3 Pro | 54% | 3% | 3% | 40% |

| Gemini 3 Pro + Search | 52% | 4% | 9% | 35% |

| Sonar Pro | 35% | 23% | 16% | 26% |

Two distinct shapes. Gemini-family models are nearly binary (94–97% in {True, False}). Claude and Sonar use the middle buckets heavily. GPT-5.4 sits between. If your pipeline uses any one of these as a single oracle for a 4-class rubric, the shape of your output is determined as much by which model you picked as by what the claims actually say.

5. Where the middle of the rubric collapses

Of the 328 unanimous claims:

| Unanimous verdict | Claims | Share of unanimous |

| True | 204 | 62% |

| Mostly True | 0 | 0% |

| Misleading | 4 | 1% |

| False | 120 | 37% |

Zero unanimous-Mostly-True. Four unanimous-Misleading. When the panel agrees, it agrees at the poles. The middle of a 4-bucket rubric isn't being used as a converged-on category — it's being used as a disagreement zone.

(A related observation from a different setup: on the AVeriTeC corpus, inter-annotator agreement among human fact-checkers across 50 organizations reaches κ=0.619 — substantial but well short of perfect. Some of what I'm measuring is task ambiguity, not LLM idiosyncrasy.)

6. Five example claims where the panel polar-splits

These are from the corpus appendix. All have max bucket distance 3 (True vs False) and no majority.

"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies." (Politics)

GPT-5.4: False · Claude Opus 4.7: Mostly True · Gemini 3 Pro: False · Gemini 3 + Search: True · Sonar Pro: True

"Volodymyr Zelensky was nominated for the Nobel Peace Prize for 2026." (Politics)

GPT-5.4: False · Claude Opus 4.7: Mostly True · Gemini 3 Pro: False · Gemini 3 + Search: True · Sonar Pro: True

"The World Bank's active portfolio in Nigeria stands at over $16.4 billion as of 2025."(Finance)

GPT-5.4: Mostly True · Claude Opus 4.7: True · Gemini 3 Pro: False · Gemini 3 + Search: Misleading · Sonar Pro: Misleading

"The FDI World Dental Federation confirms that daily oral hygiene routines, including mouthwash use, significantly reduce the incidence of gingivitis, periodontal disease, and dental caries." (Health)

GPT-5.4: Mostly True · Claude Opus 4.7: Mostly True · Gemini 3 Pro: True · Gemini 3 + Search: False · Sonar Pro: Misleading

"Individuals who prefer music with less positive emotional content tend to have higher intelligence." (Science)

GPT-5.4: Misleading · Claude Opus 4.7: Mostly True · Gemini 3 Pro: False · Gemini 3 + Search: True · Sonar Pro: Misleading

Several of these are temporal-specificity claims ("as of date X"), where retrieval-augmented models can see live web and parametric models can't. That's a real and interpretable axis of disagreement. But it's not the only one — the music/intelligence claim has no temporal angle and still polar-splits across the panel.

Methodology (brief)

Cohort: 1,000 most recent claims to Lenz (a fact-checking platform) passing all eligibility filters: not private, not staff/agent, not pending/hidden, no near-duplicates (cosine <0.2 on text-embedding-3-small), all 5 models produced parseable verdicts, ≤180 days old.
Claim normalization: models were rated against the framed claim — output of a normalization step that strips emotional language and time-anchors the proposition to the submission date. Raw "Canadian authorities are throwing Christians in jail!!!" becomes "As of YYYY-MM-DD, Canadian authorities have jailed individuals for publicly quoting the Bible because of their Christian beliefs."
Prompt (usr_v2): "Classify this claim as of <date>: <atomic claim>. Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers."
Models: GPT-5.4, Claude Opus 4.7, Gemini 3 Pro (parametric); Gemini 3 Pro + Search, Sonar Pro (retrieval).
No LLM grader. All measurements derive from direct parsed-label equality across the 5 verdicts on each claim.
CIs: Wilson 95% on every reported rate.

Full methodology, exclusions, sampling-frame caveats, and reproducibility notes in the PDF (https://doi.org/10.5281/zenodo.20344847).

Why I think this matters

Three concrete consequences I'd want a reader to walk away with:

"Use a frontier LLM as the answer" is a load-bearing assumption in a lot of current pipelines — RAG systems, agent reasoning steps, eval rubrics, content moderation. On 67% of real fact-check-shaped claims, swapping which frontier model you call would change the answer for at least one of every three claims. If your downstream metric is sensitive to label flips, that's the noise floor you're working against.
The middle of an ordinal rubric is structurally fragile. Whatever your domain, if you ask a panel of models to map free-form claims onto a 4-class ordinal scale, the middle classes are probably not being used the way the rubric author intended. They're being used as a hedge zone, and different models hedge differently. Multi-class ordinal rubrics may need either richer category definitions or a different metric than label-exact-match.
Benchmark contamination is a confound that's hard to escape on any public corpus. This corpus is structurally fresh — real-user submissions never paired with canonical answer keys in any public dataset — and the panel still disagrees this much. That's some evidence that aggregate-accuracy-matched-on-benchmarks doesn't imply aggregate-agreement-on-fresh-data.

Limitations

Pigeonhole is a floor, not a verdict. Disagreement implies at least one inconsistent verdict per claim, but I don't claim to know which model is wrong on which claim. The 67% number is a structural lower bound on rubric inconsistency, not a per-model error rate.
Bucket distance treats the rubric as equal-spaced ordinal. A 2-bucket gap can reflect rubric ambiguity, temporal framing, or differing interpretations of "Misleading" — not necessarily a factually larger error.
Non-iid sample. Users cluster submissions around news events, and individuals often submit multiple related claims in a session. True sampling variability under a cluster bootstrap would be larger than the Wilson CIs reported. I surface CIs as a precision floor, not a coverage guarantee.
Snapshot effects. Frontier LLMs are non-deterministic; a re-run with the same models and prompts would produce somewhat different numbers. Re-running with newer models or different prompts moves them more. That's why the snapshot is versioned and the v1.0 URL is archival.
Retrieval was uncontrolled. Gemini 3 + Search and Sonar Pro can see the live web at inference time, including Lenz's own public claim pages. This isn't a contamination audit.

What I'm uncertain about (and would value pushback on)

Whether "any model on any claim" disagreement (67%) is the right headline, or whether the substantive ≥2-bucket rate (34%) is. The first is more striking; the second is harder to dismiss as nuance.
Whether a 5-model panel of three labs is enough to call this a "frontier" result, or whether the panel should be wider (more independent base models) or narrower.
The right next artifact. The planned follow-up adds human ground-truth labels and compares both the frontier panel and Lenz against them. I'd be curious whether LW readers would value something else more — e.g. claim-level disagreement decomposition (which axis drives each split: temporal, rubric, retrieval, base prior), or a re-run on a non-Lenz corpus to test corpus dependence.

Data + reproducibility

Zenodo (DOI): [10.5281/zenodo.20344847] — PDF + per-claim JSON dataset, hash-pinned in the manifest.
Web version (citation-stable): [lenz.io/research/llm-disagreement/v1.0]
Per-claim CSV: [lenz.io/research/llm-disagreement/data.csv] — claim ID, URL, atomic claim text, 5 frontier verdicts, max pairwise bucket distance, domain, creation date.
Code SHA: "a6b78be".
Harvester prompt version: "usr_v2".