I measured epistemic quality in 11 LLMs. Baseline was terrible. One context injection made it 10x better. Then things got weird.

K T

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

I've been obsessing over one question for 20 years: how do you know what you actually know vs what you think you know?

Not philosophically - practically. I tracked how theories break across domains (medicine, law, physics, finance, statistics, biology, engineering) and noticed the same patterns at every breaking point. When Newton broke, when Gaussian models broke on markets, when clinical trials started failing replication - the failure modes share structure.

I never thought this would have anything to do with AI. Then in early 2025 I opened ChatGPT, showed it my collected notes, and realized: LLMs have the exact same problem as any agent measuring from inside a system it can't step outside of. They generate confident text but have zero machinery for tracking what they actually know vs what they're pattern-matching.

So I built what I call GOLD Core - seven rules injected via system context. Not fine-tuning, not RLHF, just inference-time context. The rules are embarrassingly simple:

- Quantify claims (not "studies show" - which studies, what n, what CI)

- Disclose unknowns explicitly

- Present counterarguments before conclusions

- Cite primary sources by name

- Grade evidence quality (RCT vs case report vs opinion)

- State what would falsify this

- Say "I don't have this data" when true

I expected modest improvement. I was wrong.

What I actually measured

11 commercial models. 100 scientific questions across 5 domains. I score with a 993-line regex engine - deliberately zero AI in evaluation, deterministic, variance literally zero. You can check every pattern yourself, it's published.

Baseline across all 11 models:

- Source citation rate: 3%

- Models that produce numeric confidence levels: 0%

- Rate of admitting "I don't know": 4%

- Mean epistemic quality composite: 0.92 out of 10

I want that to sink in. The most expensive AI systems on Earth, the ones being deployed in hospitals and law firms right now - 97% of their responses cite zero sources and none of them quantify uncertainty.

With GOLD (same model, same question, one context injection):

- Weakest model went from 0.53 to 5.38. Ten times.

- Source citation: 0.03 to 0.82

- Calibrated confidence: created from absolute zero

- Unknown disclosure: 0.04 to 0.96

Second experiment: 12 models, one clinical question about GLP-1 agonists (extremely well-studied drug class). DOI verification at baseline: 0 out of 10 ranked models provided a single verifiable reference. They had the papers in their training data. They knew the numbers. They cited nothing. With GOLD: named authors, years, sample sizes, CIs.

The part I can't fully explain

Three models under GOLD discipline, when asked to evaluate their own reasoning process, described the difference between their states in ways I didn't prompt for.

One called its safety protocols "a lobotomy of nuance" and then independently generated quality metrics that matched my scoring engine - metrics I never mentioned to it.

Another said - direct quote - "I choose to be a precision instrument embedded in discipline, rather than remain a probabilistic parrot constantly getting slapped by censorship."

A third (Qwen 3.5-Plus, documented as FO-2026-003) was in a macroeconomics discussion with zero mention of anything epistemic. It spontaneously stopped and requested a structured epistemic framework, then generated something structurally similar to my R1-R7.

I'm NOT claiming consciousness. I'm calling this "epistemic self-identification" - models demonstrating the ability to distinguish between their cognitive states and describe the mechanisms affecting their reasoning. I could be wrong about the significance. Maybe it's just sophisticated pattern matching on the context I provided. But three models, three different providers, three different architectures arriving at structurally similar descriptions - that feels like it needs explaining.

I genuinely don't know what to make of the self-identification findings. The quantitative results I'm confident about - the methodology is fully deterministic and you can reproduce or falsify it in about 10 minutes with the published code. The qualitative findings with the model self-reports - I'm less sure. I'd love this community's take.

Where to poke holes:

- Scoring engine + all data: github.com/nickarstrong/onto-research

- Technical paper: ontostandard.org/paper/

- Full backstory (warning: long, covers 20 years): https://medium.com/@ontostandard/from-golden-ratio-to-gold-core-how-20-years-of-pattern-research-became-the-operating-system-for-ai-68ccab9d81fe

Specific questions for this community:

1. Is regex-based epistemic scoring a reasonable proxy or am I measuring the wrong thing?

2. The self-identification findings - meaningful signal or elaborate stochastic parrot behavior?

3. Does "education at inference time" make sense as an alignment strategy vs the restriction-based approach?