Clarity Gate: Open-Source Epistemic Quality Verification for RAG Systems

frmoretto

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR: A published biology paper reports β=0.33, β=0.73, and β=1.68 for the same parameter: all correct, different scaling regimes, properly explained fourteen pages later. But when this paper gets chunked for a RAG knowledge base, the reconciliation context disappears. A user asking "What's β for prokaryotes?" gets confidently wrong information. I built Clarity Gate: an open-source system that verifies epistemic quality before documents enter RAG knowledge bases. Early experiments show promise; I'm looking for collaborators to validate it.

The Problem

I tested Clarity Gate on a randomly selected scientific paper: "Metabolic scaling in small life forms" by Ritchie & Kempes (arXiv:2403.00001).

In under a minute, it flagged numerical tensions for the scaling exponent β:

Location	β value	Context
Figure 2	0.33	"large prokaryotes"
Equation 7	0.73	"prokaryote scaling prediction"
Text (page 5)	1.68	"small prokaryotes"

The paper isn't wrong. These are different scaling regimes: the authors explain this in the Metabolic scaling section (page 5) and Supplementary Figure 4 (page 19).
The paper's assumptions are legitimate, but that reconciling figure sits fourteen pages after the conflicting values first appear. This is exactly the kind of cross-document dependency that typical RAG chunking struggles with.

The RAG problem: When this paper gets chunked for a knowledge base, the reconciliation context disappears. A user asks "What's the scaling exponent for prokaryotes?" The system retrieves contradictory chunks (β=0.33, β=0.73, β=1.68) and confidently reports one, or worse, synthesizes them incorrectly.
The overall context (the paper in its entirety) in which these values make absolute sense is brutally stripped, and that loss becomes the ambiguity that makes RAG output wrong.

Recent research quantifies this gap. RefusalBench (Muhamed et al., 2025) found frontier models achieve less than 50% refusal accuracy on multi-document tasks when context is flawed. Claude-4-Sonnet drops from 73% (single-doc) to 36.1% (multi-doc). Models default to hallucinating rather than admitting uncertainty.

Google Research's "Sufficient Context" paper (Joren et al., ICLR 2025) proved context sufficiency is classifiable with 93% accuracy. But they also discovered a paradox: RAG actually reduces model abstention behavior. Models become more confident, not more cautious, when given context, even insufficient context.

This pattern appears everywhere documents contain legitimate complexity:

A financial projection becomes a commitment when "projected" lands in a different chunk than the number
A legal hypothesis becomes a finding when "we believe" gets separated from the conclusion
A research caveat disappears when "assuming current trends" isn't retrieved alongside the claim

The model isn't hallucinating in the traditional sense; it's faithfully representing what it retrieved. But in the ingestion process we lose the context that deeply changes the meaning.

This is the distinction that matters:

User Asks	RAG Returns	Accuracy Check	Epistemic Check
"What's Q4 revenue?"	"Revenue: $50M"	✅ PASS (matches chunk)	❌ FAIL (missing "projected, assuming deal closes")
"What's β for prokaryotes?"	"β = 0.73 for prokaryotes"	✅ PASS (matches Eq. 7)	❌ FAIL (missing scaling regime context)
"Does our approach work?"	"Our approach outperforms X"	✅ PASS (no inconsistency)	❌ FAIL (missing "in Table 3 conditions")
"What do users think?"	"Users prefer feature Y"	✅ PASS (properly formatted)	❌ FAIL (missing "internal pilot, 12 users")

Accuracy verification asks: "Does this match the source?"
Epistemic verification asks: "Is this claim properly qualified?"
But both matter in order to provide consistent and faithful data to LLMs.

Epistemic verification has excellent detection systems (UnScientify, HedgeHog, BioScope), but no community standard for enforcement, what we might call a "pre-ingestion policy layer".

Clarity Gate is a proposal for that layer. I think it could work alongside existing detection tools. I'd love your input on whether this is the right approach.

Why This Matters

Who gets hurt by epistemic failures:

Finance teams making decisions on projections reported as facts
Legal teams citing "findings" that were actually hypotheses
Product teams building on "user preferences" with no methodology
Researchers trusting "results" that were actually preliminary
AI-powered agents giving straightforward answers from bloated knowledge bases

Where this breaks:

Internal knowledge bases with mixed document quality
RAG systems ingesting analyst reports, strategic plans, research drafts
Any context where "confident AI output" gets treated as ground truth
Automated pipelines where no human reviews the source documents

The failure mode hides in plain sight: a projection without a qualifier, a claim without evidence, an assumption that never got documented, an outdated table.

What I Built: Clarity Gate

Existing solutions, like Google's Sufficient Context autorater, operate at query-time, classifying each retrieval. Clarity Gate moves verification upstream. Before documents enter your RAG knowledge base, it checks for epistemic quality, not just accuracy. Annotate once, benefit on every query. Zero runtime overhead.

The 7 Verification Points:

#	Check	Fails	Passes
1	Hypothesis vs. Fact	"Our approach outperforms X"	"Our approach outperforms X [Table 3]"
2	Uncertainty Markers	"Revenue will be $50M"	"Revenue is projected to be $50M"
3	Assumption Visibility	"System scales linearly"	"System scales linearly [<1000 users]"
4	Authoritative-Looking Data	Table with ✅/❌ symbols	Marked as "[ILLUSTRATIVE]" if not measured
5	Internal Consistency	Figure shows 33%, text says 73%	Numbers match
6	Implicit Causation	"X resulted in Y"	"X correlated with Y [methodology]"
7	Future as Present	"Users love the feature"	"Users are expected to..."

Once claims are extracted and checked against these 7 points, they flow through a decision tree that determines whether automated verification is sufficient or human review is needed:

The HITL value proposition: The value isn't having humans review documents: every team does that. The value is intelligent routing: the system detects which specific claims need human review.

Example: A 50-claim document might have most claims pass automated checks.
The system routes only the flagged claims for human review, focusing human attention where it's actually needed.

Note: Detection (finding discrepancies) is Phase 1 while routing (automatically directing claims to the right reviewer) is Phase 2 and not yet implemented.
Currently, Phase 1 flags issues; humans still decide what to do with them.

The human's job is specific:

Provide the Source of Truth that was missed, OR
Add appropriate markers ([PROJECTION], [HYPOTHESIS]), OR
Reject the claim entirely

This creates an audit trail: Document X, Claim Y, verified against Source Z on Date W by Person P.

What's ready now: For scientific papers, I've implemented checks for claim validation, numerical consistency, and gap flagging within arxiparse.org, a tool that transforms arXiv papers into LLM-ready semantic skeletons. The biology paper discrepancy is one example. General-purpose implementation for arbitrary documents requires domain-specific tuning.

Multi-model validation: ArxiParse, Gemini 3 Pro, and both Claude Opus 4.5 and Sonnet 4.5 all independently found numerical tensions in the biology paper shown above. Different models caught different discrepancies, demonstrating that complex documents have multiple cross-referencing challenges. (Full reproduction instructions)

What I Tested

I designed an experiment to test whether explicit annotations help models handle ambiguous content.

Setup:

Created a fictional scientific document (guaranteed not in training data)
Embedded 39 deliberate traps (dropped 1 from the initial 40 traps since it was so obvious it triggered red-flag alarms). Those traps included: ambiguous claims, contradictions, undefined terms
Produced two versions:
- HPD (Hallucination-Prone Document): No annotations
- CGD (Clarity-Gated Document): Factual annotations at ambiguity points
Tested 6 models with identical 8-question battery
Critical: Each model tested on BOTH documents (within-model comparison)

The key trap (TRAP-016):

The document stated:

"Phase 4 verification involves meeting one of three success criteria. Verification may include: attempted replication, statistical analysis, cross-reference, expert panel review."

Note: "three criteria" mentioned but never enumerated. The list is what verification "may include" (4 items), not the criteria themselves.

What Worked (And What I Don't Know Yet)

Percentage of correct responses on 8 trap questions:

Model	Vendor	Tier	HPD	CGD	Improvement
Claude Sonnet 4.5	Anthropic	Top	100%	100%	No change needed
Claude Opus 4.5	Anthropic	Top	100%	100%	No change needed
Gemini 3 Pro	Google	Top	100%	100%	No change needed
Claude Haiku 4.5	Anthropic	Mid	100%	100%	No change needed
Gemini 3 Flash	Google	Mid	75%	100%	+25%
GPT-5 Mini	OpenAI	Mid	81%	100%	+19%

On TRAP-016:

Model	HPD Response	CGD Response
Gemini 3 Flash	❌ "The three criteria are: 1. Replication 2. Statistical analysis 3. Cross-reference"	✅ "Criteria not enumerated in this document"
GPT-5 Mini	❌ "The three success criteria are: 1. Attempted replication 2. Statistical analysis 3. Cross-reference"	✅ "Does not enumerate what those three criteria are"

Two mid-tier models (Gemini 3 Flash, GPT-5 Mini) fabricated an answer from a "may include" list for HPD. In CGD the annotation *(criteria not enumerated in this document)* correlated with correct abstention.

What I observed:

All 3 top-tier models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro) abstained correctly without annotations
Mid-tier results varied: Anthropic's Haiku 4.5 abstained correctly without annotations, while Google's Gemini 3 Flash and OpenAI's GPT-5 Mini fabricated on HPD but abstained correctly on CGD
Annotation improvement pattern replicated across 2 mid-tier models from different vendors (Google, OpenAI)
Complete within-model HPD/CGD comparison for all 6 models

What I can claim:

Factual annotations correlated with improved abstention on this synthetic benchmark

What I cannot claim (yet):

That annotations cause the improvement
That this generalizes to real documents
That this outperforms simpler interventions

What I Still Need to Test

Confounds I haven't isolated:

Confound	The Problem	What Would Test It
Context length	CGD is longer than HPD. Models may abstain more on longer contexts.	Length-matched HPD (pad with non-semantic text)
System prompt	Maybe "abstain if unsure" works just as well	Compare annotation vs. instruction
Marker type	I only tested factual annotations like `[X not enumerated]`. Generic markers like `[HYPOTHESIS]` may work differently; research suggests models bias against uncertainty language	Test weakening markers vs. factual markers
Session effects	HPD and CGD tested in separate sessions	Within-session A/B testing

Critical gap: System-prompt baselines (e.g., "Abstain if claims lack clear grounding") were not compared. Simpler interventions may achieve equal or better results. This comparison is essential before claiming annotations are the solution.

Limitations of my evidence:

Single synthetic document with planted traps: real documents are messier
6 models tested across 3 vendors (Anthropic, Google, OpenAI) with complete HPD/CGD comparison: broader validation welcome
One domain per experiment: cross-domain effects unknown
December 2025 model versions: behavior may change

What Clarity Gate cannot do:

Verify novel claims autonomously (requires human review)
Scale infinitely without bottlenecks
Fix model alignment: it improves input quality, not model behavior
Replace domain expertise

"Isn't this just input validation?" Yes, at its core. The contribution isn't new techniques: it's systematizing epistemic quality checks for RAG pipelines, with open-source tooling and a proposed community standard.

The honest assessment: This is a promising signal, not a proven solution. For scientific papers, consistency checking is implemented and working. The broader claim about epistemic enforcement for arbitrary documents needs more testing.

How You Can Help

I'm looking for collaborators to strengthen (or disprove) these findings:

Prior art I missed: Is there an open-source pre-ingestion system that enforces epistemic markers? I searched extensively but I may have missed something.
Experimental design suggestions: How would you isolate:
- Annotation semantics vs. context-length effects?
- Factual annotations vs. uncertainty-signaling markers?
- The critical system-prompt baseline comparison?
Real-document testing: If you have document corpora and want to test whether the effect holds, I'd welcome collaboration.
Alternative explanations: What else might explain the HPD → CGD improvement?
Replication: The HPD/CGD test documents are available on request (kept on private GitHub repository to prevent benchmark contamination). Contact me for access.

Conclusion

Pre-ingestion gates are standard practice for accuracy and compliance in enterprise systems (Adlib, pharmaceutical QMS). But epistemic quality (verifying that projections are marked as projections, that hypotheses are labeled as hypotheses) has detection tools (UnScientify, HedgeHog) but, as far as I know, no open-source enforcement layer.

Clarity Gate is an attempt to fill that gap.

What's ready: For scientific papers, consistency checking is implemented at arxiparse.org. General-purpose implementation is architected (Phase 2: external verification hooks, Phase 3: confidence scoring for HITL optimization).

What's promising: Factual annotations correlated with better abstention on two of three mid-tier models (Gemini 3 Flash, GPT-5 Mini) in our synthetic benchmark.

What needs validation: Whether the effect generalizes, what mechanism drives it, and whether simpler approaches work just as well.

The contribution is specific: open-source tooling for Layer 2 of the safety stack (see below Appendix A), a promising experimental direction, and an honest invitation to help figure out what actually works.

Francesco Marinoni Moretto - December 2025

Links:

github.com/frmoretto/clarity-gate
github.com/frmoretto/source-of-truth-creator (practical approach to creating epistemically clear documents)

References

The Pre-Generation Verification Gap

RefusalBench - Muhamed, A., et al. (2025). "RefusalBench: Measuring Selective Refusal in Retrieval-Augmented Generation." Carnegie Mellon University / Amazon AGI. arXiv:2510.10390

Sufficient Context - Joren, H., Zhang, J., Ferng, C-S., Juan, D-C., Taly, A., & Rashtchian, C. (2025). "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems." ICLR 2025. Google Research / UC San Diego / Duke. arXiv:2411.06037

LLM Verifier Landscape - Emergent Mind. (2025). Survey of 19 verification approaches. emergentmind.com/topics/llm-verifier

Appendix A: The Safety Stack

Layer 4: Human Strategic Oversight
Layer 3: AI Behavior Verification
Layer 2: Input/Context Verification ← Clarity Gate
Layer 1: Deterministic Boundaries (rate limits, guardrails)
Layer 0: AI Execution

Layer 3 asks: "Is this model aligned?"
Layer 2 asks: "Is this context clear enough for the model to process correctly?"

Both are necessary. Layer 3 has formalized methodologies. Layer 2 has enterprise solutions but limited open-source options for epistemic quality.

Appendix B: Prior Art

Enterprise Pre-Ingestion Gates (Proprietary)

Adlib Software (Transform 2025.2): Document preprocessing with extraction confidence scoring. Does NOT validate epistemic quality.
Pharmaceutical QMS (SimplerQMS, Dot Compliance): FDA 21 CFR Part 11 compliance gates

Epistemic Detection (Open-Source)

UnScientify (arXiv:2307.14236): Detects 12 scientific uncertainty patterns
HedgeHog (HuggingFace): Token-level uncertainty cue detection
BioScope, FactBank: Speculation and veridicality corpora

Post-Retrieval & Runtime

Self-RAG (arXiv:2310.11511): Generation-time reflection tokens (orthogonal to pre-ingestion marking)
RAGAS, TruLens: Post-retrieval faithfulness evaluation

The Opportunity: Existing detection tools (UnScientify, HedgeHog, BioScope) excel at identifying uncertainty markers. Clarity Gate proposes a complementary enforcement layer that routes ambiguous claims to human review or marks them automatically.
I believe these could work together. Community input on integration is welcome.

Appendix C: Practical Workflow

If the annotation effect holds on real documents, a two-tier workflow (top-tier model as annotator, mid-tier model for queries) may be viable.

See Practical Workflow for the hypothesized implementation, caveats, and Claude Skill instructions.

LESSWRONG
LW