Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling

ozziegooen

Summary
LLM outputs vary substantially across models, prompts, and simulated perspectives. I propose "opinion fuzzing" for systematically sampling across these dimensions to quantify and understand this variance. The concept is simple, but making it practically usable will require thoughtful tooling. In this piece I discuss what opinion fuzzing could be and show a simple example in a test application.

LLM Use
Claude Opus rewrote much of this document, mostly from earlier drafts. It also did background research, helping with the citations.

Introduction

LLMs produce inconsistent outputs. The same model with identical inputs will sometimes give different answers. Small prompt changes produce surprisingly large output shifts.[1] If we want to use LLMs for anything resembling reliable judgments (research evaluation, forecasting, medical triage), this variance is a real hindrance.

We can't eliminate variance entirely. But we can measure it, understand its structure, and make better-calibrated judgments by sampling deliberately across the variance space. That's what I'm calling "opinion fuzzing."

The core idea is already used by AI forecasters. Winners of Metaculus's AI forecasting competitions consistently employed ensemble approaches. The top performer in Q4 2024 (pgodzinai) aggregated 3 GPT-4o runs with 5 Claude-3.5-Sonnet runs, filtering the two most extreme values and averaging the remaining six forecasts. The Q2 2025 winner (Panshul42) used a more sophisticated ensemble: "sonnet 3.7 twice (later sonnet 4), o4-mini twice, and o3 once."

Survey data from Q4 2024 shows 76% of prize winners "repeated calls to an LLM and took a median/mean." The Q2 2025 analysis found that aggregation was the second-largest positive effect on bot performance. This basic form of sampling across models demonstrably works.

What I'm proposing here is a more general, but very simple, framework: systematic sampling not just across models, but across prompt variations and simulated perspectives, with explicit analysis of the variance structure rather than just averaging it away. The goal isn’t simply to take a mean, it’s also to understand a complex output space.

The Primary Technique

The basic approach is simple: instead of a single LLM call, systematically sample across:

Models (Claude, GPT-5, Gemini, Grok, etc.)
Prompt phrasings (4-20 variations of your question)
Simulated personas (domain expert, skeptic, generalist, leftist, etc.)

Then analyze the distribution of responses. This tells you:

Inter-model agreement levels
Sensitivity to prompt phrasing
Persona-dependent biases (does the "expert" persona show different biases than the "skeptic"?)
Which combinations exhibit unusual behavior worth investigating

Hypothetical Example: Forecasting US Solar Capacity

To illustrate the approach, here's what the workflow might look like:

Single-shot approach:

User: "Will US solar capacity exceed 500 GW by 2030?"
Claude: "Based on current growth trends and policy commitments, this seems
likely (~65% probability). Current capacity is around 180 GW with annual
additions accelerating..."

This seems reasonable, but how confident should you actually be in this estimate?

Opinion fuzzing approach:

Generate 20 prompt variations:
- "What's the probability that US solar capacity exceeds 500 GW by 2030?"
- "Given current trends, will the US reach 500 GW of solar by 2030?"
- "An analyst asks: is 500 GW of US solar capacity by 2030 achievable?"
- "Rate the likelihood of US solar installations exceeding 500 GW by decade's end"
- [16 more variations]
Test across 5 models: Claude Sonnet 4.5, GPT-5, Gemini 3 Pro, etc.
Sample 4 personas per model:
- Energy policy analyst with 15 years experience
- Climate tech investor
- DOE forecasting model
- Renewable energy researcher
Run 400 queries (20 prompts × 5 models × 4 personas)
Hypothetically, analysis might reveal:
- Median probability: 62%
- Range: 35-85%
- GPT-5 + "policy analyst" persona consistently lower (~45%)
- Prompt phrasing "is achievable" inflates estimates by ~12 percentage points
- 4 outlier responses suggest >90% probability (investigating these reveals they assume aggressive IRA implementation)

Result: More calibrated estimate (55-65% after adjusting for identified biases), plus understanding of which factors drive variance.

The 50-point range matters. If you're making investment decisions, policy recommendations, or AI scaling infrastructure plans that depend on electricit'y availability, that range completely changes your analysis.

Adaptive Sampling: A Speculative Extension

The naive approach samples uniformly. But we're already using LLMs. Why not use one as an experimental designer?

Proposed workflow:

User poses question
Meta-LLM (e.g., Claude Opus 4.5) receives budget of 400 queries
Phase 1: Broad sampling (50 queries across full space)
Phase 2: Meta-LLM analyzes Phase 1, identifies anomalies
- "Claude shows consistently higher estimates with policy analyst persona"
- "Prompt phrasing about 'achievability' produces systematic upward bias"
Phase 3: Targeted experiments to understand anomalies (300 queries)
Phase 4: Meta-LLM produces report with confidence intervals, identified biases, and recommendations

This could be more sample-efficient when you care about understanding the variance structure, not just getting a robust average.

When This Is Worth The Cost

Do this when:

Stakes are high (medical decisions, important forecasts, research prioritization)
Single-point estimates seem unreliable
The results will be made public to many people
You need to defend the judgment to others
Understanding variance structure matters (e.g., for future calibration)

Don't do this when:

You just need a quick sanity check
Budget is tight and stakes are low
The question is purely factual (just look it up)

Based on current pricing with 650 input tokens and 250 output tokens per small call (roughly 500 words input, 200 words output):

Model	Input	Output	400 calls (650 in / 250 out tokens)
Claude Opus 4.5	$5.00	$25.00	~$3.80
Claude Sonnet 4.5	$3.00	$15.00	~$2.28
GPT-5	$1.25	$10.00	~$1.33
GPT-4o	$5.00	$20.00	~$3.30
Gemini 3 Pro	$2.00	$12.00	~$1.72
DeepSeek V3.2	$0.26	$0.39	~$0.11

For many use cases, even $1-4 per judgement is reasonable. For high-volume applications, mixing cheaper models (DeepSeek V3.2, GPT-5, Gemini 3 Pro) with occasional frontier model validation (Claude Opus 4.5, Claude Sonnet 4.5) keeps costs manageable while maintaining quality for critical queries.

An Example Application

I’ve started work on one tool to test some of these ideas. It runs queries on questions using a bunch of different LLMs and then plots them. For each, it asks for a simple “Agree vs. Disagree” score and a “Confidence” score.

Below is a plot for the question, “There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.” The dots represent the stated opinions of different LLM runs.

LLMs on “There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.” — LLMs on *“There is at least a 10% chance that the US won't be considered a democracy by 2030 according to The Economist Democracy Index.”*

I had Claude Code run variations of this in different settings. It basically does a version of adaptive sampling, as discussed above. It showed that this article updated the opinions of many LLMs on this question. Some comments on the article were critical of the article, but the LLMs didn’t seem very swayed by these comments.

This tool is still in development. I’d want it to be more flexible to enable opinion fuzzing with 50+ queries per question, but this will take some iteration.

Some noted challenges:

It’s hard to represent and visualize the corresponding data. This tool uses a simpler setup to full opinion fuzzing, but it's still tricky.
This requires complex and lengthy AI workflows, which can be a pain to create and optimize.

Limitations and Open Questions

This doesn't fix fundamental model capabilities. Garbage in, variance-adjusted garbage out. If no model in your ensemble actually knows the answer, you might get a tight distribution around the wrong answer.

Correlated errors across models matter. Common training data and RLHF procedures mean true independence is lower than it appears.

One massive question mark is what background research to do on a given question. If someone asks, "Will US solar capacity exceed 500GW by 2030?", a lot of different kinds of research might be done to help answer that. Opinion Fuzzing does not answer this research question, though it can be used to help show sensitivity to specific research results.

Personas are simulated and may not capture real expert disagreements. This needs empirical testing before I'd recommend making it a core part of the methodology.

Thanks to Deger Turan for comments on this post

[1] Sclar et al. (2024, arXiv:2310.11324) documented performance differences of “up to 76 accuracy points” from formatting changes alone on LLaMA-2-13B.

LESSWRONG
LW