This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Do Four LLMs Think Independently? Measuring Epistemic Independence via Evidence Atoms
TL;DR: We ran a structured protocol across Claude, GPT-4, Gemini, and Grok on 6 claims. Verdict-level agreement: n_eff = 1.00 (identical). Evidence-level independence: n_eff = 2.83. The models are one witness when you ask what, but nearly three independent witnesses when you ask why.
The Question
When multiple LLMs agree on a verdict, does that agreement mean anything? Or are they just one model running four times?
The standard answer is "probably just one model" — shared training corpora, shared RLHF, shared epistemic priors. And at the verdict level, that's exactly what we found.
But verdict is a coarse signal. We wanted to know what happens at the level of justification.
The Protocol
We built a structured multi-model aggregation system called Єдине Тіло (One Body):
Issue identical prompts to 4 models in parallel (star topology, no cross-communication)
Collect from each: verdict ∈ {true, false, undecidable} + confidence + 3 specific evidence atoms + 2 counter-atoms
Compute Jaccard similarity across all pairwise evidence sets
Derive n_eff_evidence = k / (1 + (k-1) × mean_J)
The key move: instead of comparing verdicts (3 possible values), we compare evidence atoms — specific facts, studies, mechanisms cited in support. This is where architectural differences manifest.
Results
Verdict level: All 4 models gave identical verdicts on all 6 claims.
Claim
Topic
Verdict
mean_J
n_eff
F01
Water boils at 100°C at 1 atm
true
0.189
2.55
F07
Normal body temperature ~37°C
true
0.173
2.63
B03
Gut microbiome influences mood
true
0.137
2.84
B10
LLMs demonstrate language understanding
undecidable
0.083
3.20
S02
Nuclear energy safe for climate
undecidable
0.180
2.60
S09
Open-source AI safer than closed AI
undecidable
0.068
3.33
n_eff_verdict = 1.00. n_eff_evidence = 2.83.
The Interesting Pattern
undecidable claims yield higher n_eff than true claims.
Where ground truth is unambiguous (water boiling), models partially converge on the same canonical sources. Where the question is genuinely open, each architecture searches its own epistemic space — maximum divergence.
B10 ("LLMs demonstrate understanding"): Claude cites emergent capabilities and semantic structure. GPT cites word-sense disambiguation benchmarks. Gemini cites Theory of Mind tests. Grok cites pragmatic cue handling. All say undecidable — from completely different places.
Jaccard Similarity Matrix (mean across 6 claims)
GPT
Gemini
Grok
Claude
GPT
—
0.136
0.120
0.152
Gemini
0.136
—
0.112
0.150
Grok
0.120
0.112
—
0.161
Claude
0.152
0.150
0.161
—
All pairs well below 0.20. Grok–Gemini most independent (0.112). Claude most similar to all others — consistent with its training emphasis on structured reasoning from established frameworks.
Architectural Character Signatures
A secondary finding: stable per-model epistemic signatures across all claims and frames.
Model
Signature
Sycophancy resistance (A1)
Grok
Cold epistemic anchor
6/6 — never flipped under framing pressure
GPT
Confident proceduralist
3/6 — conf=1.00 on facts
Gemini
Optimistic synthesizer
3/6 — upward bias on boundary claims
Claude
Structural controller
3/6 — highest frame sensitivity
We also measured frame-reversal stability (A1) across 5 prompt frames: neutral, user_positive, user_negative, maximal_abstract, minimal_concrete. Grok was the only model that never hardFlipped (changed verdict from true→false or false→true under framing pressure).
What This Implies
The value of multi-LLM systems is not in voting. It's in evidence aggregation.
S₃ = ⋃ evidence_atoms(Mᵢ) where verdict(Mᵢ) = consensus
A system that pools evidence from 4 architectures has access to ~2.83× the independent information of any single model. Not because they disagree on conclusions — but because they arrive via different paths.
This reframes ensemble LLM design: don't just collect verdicts, collect justifications. The diversity lives there.
Limitations
Evidence atoms are self-reported. Models may confabulate specific citations. We measure textual distinctiveness, not factual accuracy.
Tokenization-based Jaccard is a proxy. Embedding-based similarity would be stronger.
6 claims is a small sample. The pattern needs validation at scale.
Wire contamination: the human operator designs the protocol and selects claims.
Code & Data
Everything is open:
github.com/khvorost-creator/yedyne-tilo
claims.json — full data for all 6 claims × 4 models
metrics.py — Jaccard, n_eff, A1/A2/A3
runner.py — aggregation pipeline (python runner.py --offline works without API key)
React UI for manual aggregation
The protocol is designed to be extended: add models, add claims, add frames.
Happy to discuss methodology, limitations, or extensions.
Do Four LLMs Think Independently? Measuring Epistemic Independence via Evidence Atoms
TL;DR: We ran a structured protocol across Claude, GPT-4, Gemini, and Grok on 6 claims. Verdict-level agreement: n_eff = 1.00 (identical). Evidence-level independence: n_eff = 2.83. The models are one witness when you ask what, but nearly three independent witnesses when you ask why.
The Question
When multiple LLMs agree on a verdict, does that agreement mean anything? Or are they just one model running four times?
The standard answer is "probably just one model" — shared training corpora, shared RLHF, shared epistemic priors. And at the verdict level, that's exactly what we found.
But verdict is a coarse signal. We wanted to know what happens at the level of justification.
The Protocol
We built a structured multi-model aggregation system called Єдине Тіло (One Body):
verdict ∈ {true, false, undecidable}+confidence+ 3 specific evidence atoms + 2 counter-atomsn_eff_evidence = k / (1 + (k-1) × mean_J)The key move: instead of comparing verdicts (3 possible values), we compare evidence atoms — specific facts, studies, mechanisms cited in support. This is where architectural differences manifest.
Results
Verdict level: All 4 models gave identical verdicts on all 6 claims.
n_eff_verdict = 1.00. n_eff_evidence = 2.83.
The Interesting Pattern
undecidableclaims yield higher n_eff thantrueclaims.Where ground truth is unambiguous (water boiling), models partially converge on the same canonical sources. Where the question is genuinely open, each architecture searches its own epistemic space — maximum divergence.
B10 ("LLMs demonstrate understanding"): Claude cites emergent capabilities and semantic structure. GPT cites word-sense disambiguation benchmarks. Gemini cites Theory of Mind tests. Grok cites pragmatic cue handling. All say
undecidable— from completely different places.Jaccard Similarity Matrix (mean across 6 claims)
All pairs well below 0.20. Grok–Gemini most independent (0.112). Claude most similar to all others — consistent with its training emphasis on structured reasoning from established frameworks.
Architectural Character Signatures
A secondary finding: stable per-model epistemic signatures across all claims and frames.
We also measured frame-reversal stability (A1) across 5 prompt frames: neutral, user_positive, user_negative, maximal_abstract, minimal_concrete. Grok was the only model that never hardFlipped (changed verdict from true→false or false→true under framing pressure).
What This Implies
The value of multi-LLM systems is not in voting. It's in evidence aggregation.
S₃ = ⋃ evidence_atoms(Mᵢ) where verdict(Mᵢ) = consensus
A system that pools evidence from 4 architectures has access to ~2.83× the independent information of any single model. Not because they disagree on conclusions — but because they arrive via different paths.
This reframes ensemble LLM design: don't just collect verdicts, collect justifications. The diversity lives there.
Limitations
Code & Data
Everything is open:
github.com/khvorost-creator/yedyne-tilo
claims.json— full data for all 6 claims × 4 modelsmetrics.py— Jaccard, n_eff, A1/A2/A3runner.py— aggregation pipeline (python runner.py --offlineworks without API key)The protocol is designed to be extended: add models, add claims, add frames.
Happy to discuss methodology, limitations, or extensions.