Summary
LLM outputs vary substantially across models, prompts, and simulated perspectives. I propose "opinion fuzzing" for systematically sampling across these dimensions to quantify and understand this variance. The concept is simple, but making it practically usable will require thoughtful tooling. In this piece I discuss what opinion fuzzing could be and show a simple example in a test application.
LLM Use
Claude Opus rewrote much of this document, mostly from earlier drafts. It also did background research, helping with the citations.
LLMs produce inconsistent outputs. The same model with identical inputs will sometimes give different answers. Small prompt changes produce surprisingly large output shifts.[1] If we want to use LLMs for anything resembling reliable judgments (research evaluation, forecasting, medical triage), this variance is a real hindrance.
We can't eliminate variance entirely. But we can measure it, understand its structure, and make better-calibrated judgments by sampling deliberately across the variance space. That's what I'm calling "opinion fuzzing."
The core idea is already used by AI forecasters. Winners of Metaculus's AI forecasting competitions consistently employed ensemble approaches. The top performer in Q4 2024 (pgodzinai) aggregated 3 GPT-4o runs with 5 Claude-3.5-Sonnet runs, filtering the two most extreme values and averaging the remaining six forecasts. The Q2 2025 winner (Panshul42) used a more sophisticated ensemble: "sonnet 3.7 twice (later sonnet 4), o4-mini twice, and o3 once."
Survey data from Q4 2024 shows 76% of prize winners "repeated calls to an LLM and took a median/mean." The Q2 2025 analysis found that aggregation was the second-largest positive effect on bot performance. This basic form of sampling across models demonstrably works.
What I'm proposing here is a more general, but very simple, framework: systematic sampling not just across models, but across prompt variations and simulated perspectives, with explicit analysis of the variance structure rather than just averaging it away. The goal isn’t simply to take a mean, it’s also to understand a complex output space.
The basic approach is simple: instead of a single LLM call, systematically sample across:
Then analyze the distribution of responses. This tells you:
To illustrate the approach, here's what the workflow might look like:
User: "Will US solar capacity exceed 500 GW by 2030?"
Claude: "Based on current growth trends and policy commitments, this seems
likely (~65% probability). Current capacity is around 180 GW with annual
additions accelerating..."
This seems reasonable, but how confident should you actually be in this estimate?
Result: More calibrated estimate (55-65% after adjusting for identified biases), plus understanding of which factors drive variance.
The 50-point range matters. If you're making investment decisions, policy recommendations, or AI scaling infrastructure plans that depend on electricit'y availability, that range completely changes your analysis.
The naive approach samples uniformly. But we're already using LLMs. Why not use one as an experimental designer?
Proposed workflow:
This could be more sample-efficient when you care about understanding the variance structure, not just getting a robust average.
Do this when:
Don't do this when:
Based on current pricing with 650 input tokens and 250 output tokens per small call (roughly 500 words input, 200 words output):
| Model | Input | Output | 400 calls (650 in / 250 out tokens) |
| Claude Opus 4.5 | $5.00 | $25.00 | ~$3.80 |
| Claude Sonnet 4.5 | $3.00 | $15.00 | ~$2.28 |
| GPT-5 | $1.25 | $10.00 | ~$1.33 |
| GPT-4o | $5.00 | $20.00 | ~$3.30 |
| Gemini 3 Pro | $2.00 | $12.00 | ~$1.72 |
| DeepSeek V3.2 | $0.26 | $0.39 | ~$0.11 |
For many use cases, even $1-4 per judgement is reasonable. For high-volume applications, mixing cheaper models (DeepSeek V3.2, GPT-5, Gemini 3 Pro) with occasional frontier model validation (Claude Opus 4.5, Claude Sonnet 4.5) keeps costs manageable while maintaining quality for critical queries.
I’ve started work on one tool to test some of these ideas. It runs queries on questions using a bunch of different LLMs and then plots them. For each, it asks for a simple “Agree vs. Disagree” score and a “Confidence” score.
Below is a plot for the question, “There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.” The dots represent the stated opinions of different LLM runs.
I had Claude Code run variations of this in different settings. It basically does a version of adaptive sampling, as discussed above. It showed that this article updated the opinions of many LLMs on this question. Some comments on the article were critical of the article, but the LLMs didn’t seem very swayed by these comments.
This tool is still in development. I’d want it to be more flexible to enable opinion fuzzing with 50+ queries per question, but this will take some iteration.
Some noted challenges:
This doesn't fix fundamental model capabilities. Garbage in, variance-adjusted garbage out. If no model in your ensemble actually knows the answer, you might get a tight distribution around the wrong answer.
Correlated errors across models matter. Common training data and RLHF procedures mean true independence is lower than it appears.
One massive question mark is what background research to do on a given question. If someone asks, "Will US solar capacity exceed 500GW by 2030?", a lot of different kinds of research might be done to help answer that. Opinion Fuzzing does not answer this research question, though it can be used to help show sensitivity to specific research results.
Personas are simulated and may not capture real expert disagreements. This needs empirical testing before I'd recommend making it a core part of the methodology.
Thanks to Deger Turan for comments on this post
[1] Sclar et al. (2024, arXiv:2310.11324) documented performance differences of “up to 76 accuracy points” from formatting changes alone on LLaMA-2-13B.