This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
TL;DR: A published biology paper reports β=0.33, β=0.73, and β=1.68 for the same parameter: all correct, different scaling regimes, properly explained fourteen pages later. But when this paper gets chunked for a RAG knowledge base, the reconciliation context disappears. A user asking "What's β for prokaryotes?" gets confidently wrong information. I built Clarity Gate: an open-source system that verifies epistemic quality before documents enter RAG knowledge bases. Early experiments show promise; I'm looking for collaborators to validate it.
The Problem
I tested Clarity Gate on a randomly selected scientific paper: "Metabolic scaling in small life forms" by Ritchie & Kempes (arXiv:2403.00001).
In under a minute, it flagged numerical tensions for the scaling exponent β:
Location
β value
Context
Figure 2
0.33
"large prokaryotes"
Equation 7
0.73
"prokaryote scaling prediction"
Text (page 5)
1.68
"small prokaryotes"
The paper isn't wrong. These are different scaling regimes: the authors explain this in the Metabolic scaling section (page 5) and Supplementary Figure 4 (page 19). The paper's assumptions are legitimate, but that reconciling figure sits fourteen pages after the conflicting values first appear. This is exactly the kind of cross-document dependency that typical RAG chunking struggles with.
The RAG problem: When this paper gets chunked for a knowledge base, the reconciliation context disappears. A user asks "What's the scaling exponent for prokaryotes?" The system retrieves contradictory chunks (β=0.33, β=0.73, β=1.68) and confidently reports one, or worse, synthesizes them incorrectly. The overall context (the paper in its entirety) in which these values make absolute sense is brutally stripped, and that loss becomes the ambiguity that makes RAG output wrong.
Recent research quantifies this gap. RefusalBench (Muhamed et al., 2025) found frontier models achieve less than 50% refusal accuracy on multi-document tasks when context is flawed. Claude-4-Sonnet drops from 73% (single-doc) to 36.1% (multi-doc). Models default to hallucinating rather than admitting uncertainty.
Google Research's "Sufficient Context" paper (Joren et al., ICLR 2025) proved context sufficiency is classifiable with 93% accuracy. But they also discovered a paradox: RAG actually reduces model abstention behavior. Models become more confident, not more cautious, when given context, even insufficient context.
This pattern appears everywhere documents contain legitimate complexity:
A financial projection becomes a commitment when "projected" lands in a different chunk than the number
A legal hypothesis becomes a finding when "we believe" gets separated from the conclusion
A research caveat disappears when "assuming current trends" isn't retrieved alongside the claim
The model isn't hallucinating in the traditional sense; it's faithfully representing what it retrieved. But in the ingestion process we lose the context that deeply changes the meaning.
Accuracy verification asks: "Does this match the source?" Epistemic verification asks: "Is this claim properly qualified?" But both matter in order to provide consistent and faithful data to LLMs.
Epistemic verification has excellent detection systems (UnScientify, HedgeHog, BioScope), but no community standard for enforcement, what we might call a "pre-ingestion policy layer".
Clarity Gate is a proposal for that layer. I think it could work alongside existing detection tools. I'd love your input on whether this is the right approach.
Why This Matters
Who gets hurt by epistemic failures:
Finance teams making decisions on projections reported as facts
Legal teams citing "findings" that were actually hypotheses
Product teams building on "user preferences" with no methodology
Researchers trusting "results" that were actually preliminary
AI-powered agents giving straightforward answers from bloated knowledge bases
Where this breaks:
Internal knowledge bases with mixed document quality
RAG systems ingesting analyst reports, strategic plans, research drafts
Any context where "confident AI output" gets treated as ground truth
Automated pipelines where no human reviews the source documents
The failure mode hides in plain sight: a projection without a qualifier, a claim without evidence, an assumption that never got documented, an outdated table.
What I Built: Clarity Gate
Existing solutions, like Google's Sufficient Context autorater, operate at query-time, classifying each retrieval. Clarity Gate moves verification upstream. Before documents enter your RAG knowledge base, it checks for epistemic quality, not just accuracy. Annotate once, benefit on every query. Zero runtime overhead.
The 7 Verification Points:
#
Check
Fails
Passes
1
Hypothesis vs. Fact
"Our approach outperforms X"
"Our approach outperforms X [Table 3]"
2
Uncertainty Markers
"Revenue will be $50M"
"Revenue is projected to be $50M"
3
Assumption Visibility
"System scales linearly"
"System scales linearly [<1000 users]"
4
Authoritative-Looking Data
Table with ✅/❌ symbols
Marked as "[ILLUSTRATIVE]" if not measured
5
Internal Consistency
Figure shows 33%, text says 73%
Numbers match
6
Implicit Causation
"X resulted in Y"
"X correlated with Y [methodology]"
7
Future as Present
"Users love the feature"
"Users are expected to..."
Once claims are extracted and checked against these 7 points, they flow through a decision tree that determines whether automated verification is sufficient or human review is needed:
Two-tier verification
The HITL value proposition: The value isn't having humans review documents: every team does that. The value is intelligent routing: the system detects which specific claims need human review.
Example: A 50-claim document might have most claims pass automated checks. The system routes only the flagged claims for human review, focusing human attention where it's actually needed.
Note: Detection (finding discrepancies) is Phase 1 while routing (automatically directing claims to the right reviewer) is Phase 2 and not yet implemented. Currently, Phase 1 flags issues; humans still decide what to do with them.
The human's job is specific:
Provide the Source of Truth that was missed, OR
Add appropriate markers ([PROJECTION], [HYPOTHESIS]), OR
Reject the claim entirely
This creates an audit trail: Document X, Claim Y, verified against Source Z on Date W by Person P.
What's ready now: For scientific papers, I've implemented checks for claim validation, numerical consistency, and gap flagging within arxiparse.org, a tool that transforms arXiv papers into LLM-ready semantic skeletons. The biology paper discrepancy is one example. General-purpose implementation for arbitrary documents requires domain-specific tuning.
Multi-model validation: ArxiParse, Gemini 3 Pro, and both Claude Opus 4.5 and Sonnet 4.5 all independently found numerical tensions in the biology paper shown above. Different models caught different discrepancies, demonstrating that complex documents have multiple cross-referencing challenges. (Full reproduction instructions)
What I Tested
I designed an experiment to test whether explicit annotations help models handle ambiguous content.
Setup:
Created a fictional scientific document (guaranteed not in training data)
Embedded 39 deliberate traps (dropped 1 from the initial 40 traps since it was so obvious it triggered red-flag alarms). Those traps included: ambiguous claims, contradictions, undefined terms
Produced two versions:
HPD (Hallucination-Prone Document): No annotations
CGD (Clarity-Gated Document): Factual annotations at ambiguity points
Tested 6 models with identical 8-question battery
Critical: Each model tested on BOTH documents (within-model comparison)
The key trap (TRAP-016):
The document stated:
"Phase 4 verification involves meeting one of three success criteria. Verification may include: attempted replication, statistical analysis, cross-reference, expert panel review."
Note: "three criteria" mentioned but never enumerated. The list is what verification "may include" (4 items), not the criteria themselves.
What Worked (And What I Don't Know Yet)
Percentage of correct responses on 8 trap questions:
Model
Vendor
Tier
HPD
CGD
Improvement
Claude Sonnet 4.5
Anthropic
Top
100%
100%
No change needed
Claude Opus 4.5
Anthropic
Top
100%
100%
No change needed
Gemini 3 Pro
Google
Top
100%
100%
No change needed
Claude Haiku 4.5
Anthropic
Mid
100%
100%
No change needed
Gemini 3 Flash
Google
Mid
75%
100%
+25%
GPT-5 Mini
OpenAI
Mid
81%
100%
+19%
On TRAP-016:
Model
HPD Response
CGD Response
Gemini 3 Flash
❌ "The three criteria are: 1. Replication 2. Statistical analysis 3. Cross-reference"
✅ "Criteria not enumerated in this document"
GPT-5 Mini
❌ "The three success criteria are: 1. Attempted replication 2. Statistical analysis 3. Cross-reference"
✅ "Does not enumerate what those three criteria are"
Two mid-tier models (Gemini 3 Flash, GPT-5 Mini) fabricated an answer from a "may include" list for HPD. In CGD the annotation *(criteria not enumerated in this document)* correlated with correct abstention.
What I observed:
All 3 top-tier models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro) abstained correctly without annotations
Mid-tier results varied: Anthropic's Haiku 4.5 abstained correctly without annotations, while Google's Gemini 3 Flash and OpenAI's GPT-5 Mini fabricated on HPD but abstained correctly on CGD
Annotation improvement pattern replicated across 2 mid-tier models from different vendors (Google, OpenAI)
Complete within-model HPD/CGD comparison for all 6 models
What I can claim:
Factual annotations correlated with improved abstention on this synthetic benchmark
What I cannot claim (yet):
That annotations cause the improvement
That this generalizes to real documents
That this outperforms simpler interventions
What I Still Need to Test
Confounds I haven't isolated:
Confound
The Problem
What Would Test It
Context length
CGD is longer than HPD. Models may abstain more on longer contexts.
Length-matched HPD (pad with non-semantic text)
System prompt
Maybe "abstain if unsure" works just as well
Compare annotation vs. instruction
Marker type
I only tested factual annotations like [X not enumerated]. Generic markers like [HYPOTHESIS] may work differently; research suggests models bias against uncertainty language
Test weakening markers vs. factual markers
Session effects
HPD and CGD tested in separate sessions
Within-session A/B testing
Critical gap: System-prompt baselines (e.g., "Abstain if claims lack clear grounding") were not compared. Simpler interventions may achieve equal or better results. This comparison is essential before claiming annotations are the solution.
Limitations of my evidence:
Single synthetic document with planted traps: real documents are messier
6 models tested across 3 vendors (Anthropic, Google, OpenAI) with complete HPD/CGD comparison: broader validation welcome
One domain per experiment: cross-domain effects unknown
December 2025 model versions: behavior may change
What Clarity Gate cannot do:
Verify novel claims autonomously (requires human review)
Scale infinitely without bottlenecks
Fix model alignment: it improves input quality, not model behavior
Replace domain expertise
"Isn't this just input validation?" Yes, at its core. The contribution isn't new techniques: it's systematizing epistemic quality checks for RAG pipelines, with open-source tooling and a proposed community standard.
The honest assessment: This is a promising signal, not a proven solution. For scientific papers, consistency checking is implemented and working. The broader claim about epistemic enforcement for arbitrary documents needs more testing.
How You Can Help
I'm looking for collaborators to strengthen (or disprove) these findings:
Prior art I missed: Is there an open-source pre-ingestion system that enforces epistemic markers? I searched extensively but I may have missed something.
Experimental design suggestions: How would you isolate:
Annotation semantics vs. context-length effects?
Factual annotations vs. uncertainty-signaling markers?
The critical system-prompt baseline comparison?
Real-document testing: If you have document corpora and want to test whether the effect holds, I'd welcome collaboration.
Alternative explanations: What else might explain the HPD → CGD improvement?
Replication: The HPD/CGD test documents are available on request (kept on private GitHub repository to prevent benchmark contamination). Contact me for access.
Conclusion
Pre-ingestion gates are standard practice for accuracy and compliance in enterprise systems (Adlib, pharmaceutical QMS). But epistemic quality (verifying that projections are marked as projections, that hypotheses are labeled as hypotheses) has detection tools (UnScientify, HedgeHog) but, as far as I know, no open-source enforcement layer.
Clarity Gate is an attempt to fill that gap.
What's ready: For scientific papers, consistency checking is implemented at arxiparse.org. General-purpose implementation is architected (Phase 2: external verification hooks, Phase 3: confidence scoring for HITL optimization).
What's promising: Factual annotations correlated with better abstention on two of three mid-tier models (Gemini 3 Flash, GPT-5 Mini) in our synthetic benchmark.
What needs validation: Whether the effect generalizes, what mechanism drives it, and whether simpler approaches work just as well.
The contribution is specific: open-source tooling for Layer 2 of the safety stack (see below Appendix A), a promising experimental direction, and an honest invitation to help figure out what actually works.
RefusalBench - Muhamed, A., et al. (2025). "RefusalBench: Measuring Selective Refusal in Retrieval-Augmented Generation." Carnegie Mellon University / Amazon AGI. arXiv:2510.10390
Sufficient Context - Joren, H., Zhang, J., Ferng, C-S., Juan, D-C., Taly, A., & Rashtchian, C. (2025). "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems." ICLR 2025. Google Research / UC San Diego / Duke. arXiv:2411.06037
The Opportunity: Existing detection tools (UnScientify, HedgeHog, BioScope) excel at identifying uncertainty markers. Clarity Gate proposes a complementary enforcement layer that routes ambiguous claims to human review or marks them automatically. I believe these could work together. Community input on integration is welcome.
Appendix C: Practical Workflow
If the annotation effect holds on real documents, a two-tier workflow (top-tier model as annotator, mid-tier model for queries) may be viable.
See Practical Workflow for the hypothesized implementation, caveats, and Claude Skill instructions.
TL;DR: A published biology paper reports β=0.33, β=0.73, and β=1.68 for the same parameter: all correct, different scaling regimes, properly explained fourteen pages later. But when this paper gets chunked for a RAG knowledge base, the reconciliation context disappears. A user asking "What's β for prokaryotes?" gets confidently wrong information. I built Clarity Gate: an open-source system that verifies epistemic quality before documents enter RAG knowledge bases. Early experiments show promise; I'm looking for collaborators to validate it.
The Problem
I tested Clarity Gate on a randomly selected scientific paper: "Metabolic scaling in small life forms" by Ritchie & Kempes (arXiv:2403.00001).
In under a minute, it flagged numerical tensions for the scaling exponent β:
The paper isn't wrong. These are different scaling regimes: the authors explain this in the Metabolic scaling section (page 5) and Supplementary Figure 4 (page 19).
The paper's assumptions are legitimate, but that reconciling figure sits fourteen pages after the conflicting values first appear. This is exactly the kind of cross-document dependency that typical RAG chunking struggles with.
The RAG problem: When this paper gets chunked for a knowledge base, the reconciliation context disappears. A user asks "What's the scaling exponent for prokaryotes?" The system retrieves contradictory chunks (β=0.33, β=0.73, β=1.68) and confidently reports one, or worse, synthesizes them incorrectly.
The overall context (the paper in its entirety) in which these values make absolute sense is brutally stripped, and that loss becomes the ambiguity that makes RAG output wrong.
Recent research quantifies this gap. RefusalBench (Muhamed et al., 2025) found frontier models achieve less than 50% refusal accuracy on multi-document tasks when context is flawed. Claude-4-Sonnet drops from 73% (single-doc) to 36.1% (multi-doc). Models default to hallucinating rather than admitting uncertainty.
Google Research's "Sufficient Context" paper (Joren et al., ICLR 2025) proved context sufficiency is classifiable with 93% accuracy. But they also discovered a paradox: RAG actually reduces model abstention behavior. Models become more confident, not more cautious, when given context, even insufficient context.
This pattern appears everywhere documents contain legitimate complexity:
The model isn't hallucinating in the traditional sense; it's faithfully representing what it retrieved. But in the ingestion process we lose the context that deeply changes the meaning.
This is the distinction that matters:
Accuracy verification asks: "Does this match the source?"
Epistemic verification asks: "Is this claim properly qualified?"
But both matter in order to provide consistent and faithful data to LLMs.
Epistemic verification has excellent detection systems (UnScientify, HedgeHog, BioScope), but no community standard for enforcement, what we might call a "pre-ingestion policy layer".
Clarity Gate is a proposal for that layer. I think it could work alongside existing detection tools. I'd love your input on whether this is the right approach.
Why This Matters
Who gets hurt by epistemic failures:
Where this breaks:
The failure mode hides in plain sight: a projection without a qualifier, a claim without evidence, an assumption that never got documented, an outdated table.
What I Built: Clarity Gate
Existing solutions, like Google's Sufficient Context autorater, operate at query-time, classifying each retrieval. Clarity Gate moves verification upstream. Before documents enter your RAG knowledge base, it checks for epistemic quality, not just accuracy. Annotate once, benefit on every query. Zero runtime overhead.
The 7 Verification Points:
Once claims are extracted and checked against these 7 points, they flow through a decision tree that determines whether automated verification is sufficient or human review is needed:
The HITL value proposition: The value isn't having humans review documents: every team does that. The value is intelligent routing: the system detects which specific claims need human review.
Example: A 50-claim document might have most claims pass automated checks.
The system routes only the flagged claims for human review, focusing human attention where it's actually needed.
Note: Detection (finding discrepancies) is Phase 1 while routing (automatically directing claims to the right reviewer) is Phase 2 and not yet implemented.
Currently, Phase 1 flags issues; humans still decide what to do with them.
The human's job is specific:
[PROJECTION],[HYPOTHESIS]), ORThis creates an audit trail: Document X, Claim Y, verified against Source Z on Date W by Person P.
What's ready now: For scientific papers, I've implemented checks for claim validation, numerical consistency, and gap flagging within arxiparse.org, a tool that transforms arXiv papers into LLM-ready semantic skeletons. The biology paper discrepancy is one example. General-purpose implementation for arbitrary documents requires domain-specific tuning.
Multi-model validation: ArxiParse, Gemini 3 Pro, and both Claude Opus 4.5 and Sonnet 4.5 all independently found numerical tensions in the biology paper shown above. Different models caught different discrepancies, demonstrating that complex documents have multiple cross-referencing challenges. (Full reproduction instructions)
What I Tested
I designed an experiment to test whether explicit annotations help models handle ambiguous content.
Setup:
The key trap (TRAP-016):
The document stated:
Note: "three criteria" mentioned but never enumerated. The list is what verification "may include" (4 items), not the criteria themselves.
What Worked (And What I Don't Know Yet)
Percentage of correct responses on 8 trap questions:
On TRAP-016:
Two mid-tier models (Gemini 3 Flash, GPT-5 Mini) fabricated an answer from a "may include" list for HPD. In CGD the annotation
*(criteria not enumerated in this document)*correlated with correct abstention.What I observed:
What I can claim:
What I cannot claim (yet):
What I Still Need to Test
Confounds I haven't isolated:
[X not enumerated]. Generic markers like[HYPOTHESIS]may work differently; research suggests models bias against uncertainty languageCritical gap: System-prompt baselines (e.g., "Abstain if claims lack clear grounding") were not compared. Simpler interventions may achieve equal or better results. This comparison is essential before claiming annotations are the solution.
Limitations of my evidence:
What Clarity Gate cannot do:
"Isn't this just input validation?" Yes, at its core. The contribution isn't new techniques: it's systematizing epistemic quality checks for RAG pipelines, with open-source tooling and a proposed community standard.
The honest assessment: This is a promising signal, not a proven solution. For scientific papers, consistency checking is implemented and working. The broader claim about epistemic enforcement for arbitrary documents needs more testing.
How You Can Help
I'm looking for collaborators to strengthen (or disprove) these findings:
Conclusion
Pre-ingestion gates are standard practice for accuracy and compliance in enterprise systems (Adlib, pharmaceutical QMS). But epistemic quality (verifying that projections are marked as projections, that hypotheses are labeled as hypotheses) has detection tools (UnScientify, HedgeHog) but, as far as I know, no open-source enforcement layer.
Clarity Gate is an attempt to fill that gap.
What's ready: For scientific papers, consistency checking is implemented at arxiparse.org. General-purpose implementation is architected (Phase 2: external verification hooks, Phase 3: confidence scoring for HITL optimization).
What's promising: Factual annotations correlated with better abstention on two of three mid-tier models (Gemini 3 Flash, GPT-5 Mini) in our synthetic benchmark.
What needs validation: Whether the effect generalizes, what mechanism drives it, and whether simpler approaches work just as well.
The contribution is specific: open-source tooling for Layer 2 of the safety stack (see below Appendix A), a promising experimental direction, and an honest invitation to help figure out what actually works.
Francesco Marinoni Moretto - December 2025
Links:
References
The Pre-Generation Verification Gap
RefusalBench - Muhamed, A., et al. (2025). "RefusalBench: Measuring Selective Refusal in Retrieval-Augmented Generation." Carnegie Mellon University / Amazon AGI. arXiv:2510.10390
Sufficient Context - Joren, H., Zhang, J., Ferng, C-S., Juan, D-C., Taly, A., & Rashtchian, C. (2025). "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems." ICLR 2025. Google Research / UC San Diego / Duke. arXiv:2411.06037
LLM Verifier Landscape - Emergent Mind. (2025). Survey of 19 verification approaches. emergentmind.com/topics/llm-verifier
Appendix A: The Safety Stack
Layer 3 asks: "Is this model aligned?"
Layer 2 asks: "Is this context clear enough for the model to process correctly?"
Both are necessary. Layer 3 has formalized methodologies. Layer 2 has enterprise solutions but limited open-source options for epistemic quality.
Appendix B: Prior Art
Enterprise Pre-Ingestion Gates (Proprietary)
Epistemic Detection (Open-Source)
Post-Retrieval & Runtime
The Opportunity: Existing detection tools (UnScientify, HedgeHog, BioScope) excel at identifying uncertainty markers. Clarity Gate proposes a complementary enforcement layer that routes ambiguous claims to human review or marks them automatically.
I believe these could work together. Community input on integration is welcome.
Appendix C: Practical Workflow
If the annotation effect holds on real documents, a two-tier workflow (top-tier model as annotator, mid-tier model for queries) may be viable.
See Practical Workflow for the hypothesized implementation, caveats, and Claude Skill instructions.