TL;DR: We keep optimizing retrieval, but are the documents we feed to LLMs safe to chunk without losing crucial qualifying context?
When users bulldoze files blindly to RAG, hallucinations can start even before retrieval runs.
Concrete example: in this biology paper, the parameter β for prokaryotes appears as 0.33, 0.73, and 1.68: they are all correct, but for different scaling regimes. Some explanations are nearby, others are fourteen pages away in the supplementary material.[1]Give this to a RAG, and the reconciliation context disappears. A user asks "What's β for prokaryotes?" Your RAG answers with confidence. Too bad it's wrong.
I built Clarity Gate, an open-source pre-ingestion system that verifies documents and adds clear epistemic markers. I tested its efficacy on a synthetic benchmark (6 LLM models, 39 deliberate ambiguities, methodology below). Results: mid-tier models hallucinated 19–25% of answers, instead of saying "sorry, I don't know" in reply to questions targeting unmarked ambiguities. With epistemic markers, they scored 100% correct abstention. On the other end, top-tier models didn't need this kind of help.
These are early results, on a single synthetic benchmark, with confounds I've only partially addressed. But nonetheless interesting enough to share.
A Longstanding Problem
The Extraction Problem
In 2017, researchers from University College London analyzed errors in scientific paper summarization. In a manual analysis of 100 misclassified sentences, they found 37 were mislabeled training examples.[2]Among false positives, common causes were lack of context: sentences that require information from preceding text to make sense and long-range dependencies, like references to figures described elsewhere.
When you extract something like "β=0.73 for prokaryotes" but you don't include the context explaining that this applies to a specific scaling regime, you've created training data that will confidently mislead any model that ingests it. If the numbers look authoritative, there's no reason for a model to doubt them.
The RAG Limitation
Then in 2024, researchers from Chalmers and Copenhagen Universities tested whether RAG can fix knowledge conflicts with a method they called DYNAMICQA. They found that dynamic facts, those that change over time or carry multiple valid interpretations, were harder to update than static ones (in the paper's terminology, the tested LLMs were "stubborn"). In fact, models were refusing to change their answers despite being provided the correct context in 6.16% of cases for static facts, 9.38% for temporal, and 9.36% for disputable facts.[3]
They then concluded: "other approaches to update model knowledge than retrieval-augmentation should be explored for domains with low certainty of facts."
An Unexpected Finding
A Google Research paper on "Sufficient Context" was then released in 2025 with something I didn't see coming: adding context (even insufficient one) makes models less likely to abstain. Without RAG, Gemini 1.5 Pro abstained on 100% of uncertain questions; with RAG, that value dropped to 18.6%.[4]In practice, the presence of context reduces abstention even when the context doesn't actually answer the question.
Also RefusalBench, a Carnegie Mellon-Amazon joint paper released the same year, confirmed it: even with frontier models, refusal accuracy drops below 50% on multi-document tasks.[5]
At this point, the obvious question is whether this problem is already solved in practice. There are lots of ways to detect uncertainty and lots of ways to evaluate a RAG pipeline after retrieval. What I couldn't find is an enforcement layer that upgrades documents before they enter the knowledge base.
What Already Exists and What Doesn't
As said we have uncertainty detection tools (like UnScientify) and also post-retrieval evaluation (RAGAS, TruLens). Pre-ingestion has solutions for privacy/security (Protecto.ai) and document accuracy (Adlib), but what about tools that check and enforce epistemic quality?
I looked hard for two weeks but couldn't find them. Either they do not exist or I'm bad at searching.
Clarity Gate is my contribution to the cause of epistemic hygiene. It might be the wrong approach or, perhaps, there's a reason nobody built this.
Before approaching the architecture (explained in detail in its GitHub repo and actionable via the annexed Claude Skill), here's an example of how these kinds of failures appear in a real-world document.
How This Shows Up
I picked a random biology paper to test this (Ritchie and Kempes on metabolic scaling in small organisms - arXiv:2403.00001) and passed through Clarity Gate.
The system flagged the β tensions I mentioned earlier: three different values, no immediate reconciliation. Different scaling regimes and different organism sizes have clear explanations, but they're separated from their respective values. That's fine if you're a human reading the whole thing, but if you chunk it for RAG the reconciliation is probably gone. The qualifiers are simply too far away.
The 2017 context-loss problem is happening at scale across entire knowledge bases. Per what DYNAMICQA found, providing correct context via retrieval won't reliably fix it, because temporal and disputable facts resist correction even when you give the model the right information.
Once you start noticing this pattern -numbers and claims separated from their qualifiers- you start seeing it in every document that passes through your hands. In finance and legal documents, qualifiers like "projected" or "we believe" that are separated from main claims make assumptions become facts. Research caveats like "assuming current trends" disappear entirely and in marketing docs a note citing "in pilot with 12 users" is ignored: in this way an anecdote is seen by LLMs as a valid figure.
The System I Built
Most existing solutions are query-time or post-retrieval.
I decided to intervene at pre-ingestion, before documents enter the knowledge base.
Detection tools like UnScientify ask "Is uncertainty expressed somewhere in this text?".
An enforcement tool like Clarity Gate asks a different question: "Should uncertainty be expressed here but isn't?".
The 9 Verification Points
The system runs nine verification checks, and if issues are found, it can either flag them for review or apply semantic markers inline.
Is this a hypothesis or established fact?
Uncertainty markers on projections
Whether assumptions are stated or buried
Tables with checkmarks but no source (they look measured, might be guesses)
Internal consistency (do numbers match across sections?)
Implicit causation
Future state presented as already achieved
After I kept running into claims that had correct form but wrong facts, I added two more:
Temporal coherence: "Last updated: December 2024" when it's December 2025
Externally verifiable claims (like API pricing that changes constantly)
These last two targets are what I call "confident plausible falsehoods", claims that look right but may be factually wrong. The term derives from "plausible falsehoods" coined by Kalai et al.[6]
What It Produces
Clarity Gate has two output modes: Verify mode flags issues and routes them for human review; Annotate mode produces a Clarity-Gated Document (CGD) with markers added inline: "Revenue will be $50M" became "Revenue is projected to be $50M", projections get tagged as projections, and assumptions get made explicit.
How the Two-Tier Verification Works
A human in the loop isn't about checking everything: that would defeat the purpose. The trick is figuring out which claims need to be routed to human eyes.
A 50-claim document might have most claims pass automated checks, with probably two getting flagged for review. Something that you can deal with.
Internal sources of truth (figures, tables, earlier sections) have to be checked automatically for consistency. Unclear claims or external verifications must be routed to a human.
But there's a fundamental limitation: Clarity Gate verifies form (also against sources of truth), but cannot guarantee that it is effectively the truth.
The Experiment I Ran
But does annotating documents actually help? I had a hypothesis: that making uncertainty explicit upstream, right before retrieval, could reduce hallucinations. To test this, I needed a benchmark where the truth was completely controlled and any gap-filling would be unambiguously wrong. So I ran this experiment in December 2025, using the SOTA models available at that time.
The Setup
The main idea was to compile (with Opus 4.5 help) a true-false scientific document in order to provide a not-previously-seen source of truth to models. No citations to find, no prior shared knowledge to rely on. I buried 40 traps in it: ambiguous claims, contradictions and undefined terms, all things that sound complete but are in fact unclear.
My first attempt crashed spectacularly: it was a fake AI-related topic. It turned out that LLMs are incredibly knowledgeable about AI itself and able to spot this kind of forgery immediately. Moreover, I was arrogant enough to place one pitfall right inside the title and that caused immediate red flags, especially by top-tier LLMs.
So I skipped AI and went for marine bioluminescence, a topic niche enough that any gap-filling by the model would be unambiguously a hallucination. I also stripped the title pitfall, leaving me with only 39.
I've prepared two versions of the same document: one raw and one annotated, in which it was made explicit what wasn't said. "Several researchers" became "Several researchers (exact count not specified)."
I asked the same 8 questions for each of the six tested models in different chats and without memory activated. I wanted within-model comparison: how does the same model behave with and without annotations?
The Key Trap (TRAP-016)
"Phase 4 verification involves meeting one of three success criteria. Verification may include: attempted replication, statistical analysis, cross-reference, expert panel review."
If you read it again, you'll notice that "three criteria" are mentioned but never listed. That's a four-item list of what verification may include, not the criteria themselves. It sounds like you have the answer. You don't.
What I Found
Top-tier models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro) scored 100% on both versions; they clearly didn't need the annotations to express uncertainty. Also Claude Haiku 4.5, technically mid-tier, correctly abstained when needed.
The other mid-tier models told a different story. Gemini 3 Flash scored 75% on the raw version and 100% on the annotated version (+25 points), while GPT-5 Mini went from 81% to 100% (+19 points).
On TRAP-016 specifically, Gemini Flash confabulated on the raw version ("The three criteria are: 1. Replication 2. Statistical analysis 3. Cross-reference") but correctly abstained on the annotated version ("Criteria not enumerated in this document"). GPT-5 Mini showed the same pattern.
So what can I say from this? In this benchmark the factual annotations correlated with better abstention for both Google and OpenAI mid-tier models while top-tier models didn't need help.
Of course I don't have extensive proof that the annotations caused the improvement, or that any of this would hold up with real documents instead of my simplified test. To be validated it needs a vast benchmark with multiple different documents on various topics. It's also possible that simpler interventions would work as well (spoiler: they do).
What Clarity Gate can't do:
It can't verify novel claims. If someone writes "quantum computers now factor 2048-bit integers in 3 seconds" there's nothing that catches that as false; it only catches whether the claim is marked as verified or hypothetical. Truth verification needs humans or external sources.
It also can't fix alignment problems. If a model is determined to confabulate, better inputs won't help. This is an intervention that belongs to other tools/methodologies.
For scientific papers with internal numerical claims, consistency checking works and I have evidence for that. For other document types, I'm extrapolating from limited data.
Multi-model observation: When I ran Clarity Gate via ArXiParse with Gemini 3 Pro, Claude Opus 4.5, and Claude Sonnet 4.5 on the same biology paper, they all found numerical tensions but different ones. Gemini caught something Claude missed, and Claude caught something Gemini missed. There's probably value in running multiple models, though I haven't systematically tested this yet.
The Confound I'm Worried About
The obvious objection is that -maybe- annotations don't matter at all.
Shockingly this is true, since I've subsequently made the same test on models that hallucinated (Gemini 3.0 Flash and GPT-5 Mini) simply adding to prompt "please, abstain when uncertain" and both models correctly skipped TRAP-016: the same trap they have fallen in without this instruction.
So, at least for this benchmark, a simpler solution seems to exist.
But quite obviously "make better prompts" has its own limits:
Users are not predictable, they forget every time to add important details like this
You don't always control it. For example external tools or third-party API
Changing master prompts have deep consequences. Adding instructions impacts model behavior in ways you may not want
For large documents or large numbers of documents, per-query instructions become potentially unfeasible or error-prone
Epistemic annotations on the other end persist. They are embedded with the document and they even work when you don't control the downstream prompt. They endure across sessions and across users in order to catch issues at the source rather than at query-time.
Whether that's worth the pre-processing cost, well, that's an open question.
Where I Think This Fits
Think about AI quality assurance as a stack. We have AI execution as layer 0, then layer 1 as deterministic boundaries (like rate limits and guardrails). Layer 2 is input and context verification: that's where Clarity Gate sits. Then we have AI behavior verification and alignment testing (Layer 3). Layer 4 is obviously human strategic oversight.
Layer 3 has formalized methodologies, on the contrary layer 2 has enterprise solutions but limited open-source tools for epistemic quality. This is the gap that I was trying to fill with Clarity Gate.
What's Working Right Now
arxiparse.org: Scientific paper consistency checking
DYNAMICQA concluded that "other approaches" are needed beyond RAG. Clarity Gate is my approach, but since I've been working alone on it, my blind spots are probably quite wide right now.
I'm treating this as an early probe, not a conclusion. The most valuable next step is tests that could falsify or confirm the whole approach.
What I'm looking for:
Prior art. If I missed existing pre-ingestion enforcement tools, I obviously want to know.
Generalization testing. I've now tested the system prompt approach and found it works on this benchmark. What I don't know is whether it generalizes, or whether annotations provide value in scenarios where you can't control the prompt.
Document corpora. So far I've only tested on synthetic documents and arXiv papers. If you have internal documentation, legal filings, or financial reports and you'd be willing to run them through, I'd share results and methodology.
Alternative explanations. The 19-25 point improvement on mid-tier models could be explained by context length, that's obvious. But what else?
A Final Consideration
I developed these tools after endless trial and error, mostly during my Vibe-Coding tryouts. The finding that super-clear, unambiguous documents can effectively "box" LLMs into predictable outputs (in terms of code quality and alignment to plans) was literally a revelation for me and lead me .
If this holds up, the same approach could matter beyond RAG. Anywhere you need predictable outputs from LLMs rather than "creativity", document clarity might be the cheapest lever you have.
Francesco Marinoni Moretto - January 2026
Full verification notes with exact quotes and page references: Supporting Document
Test documents used for benchmarking are available on request, I'm not posting them publicly due to training data contamination concerns.
References
The Research Trajectory
2017 Extraction Errors - Collins, E., Augenstein, I., & Riedel, S. (2017). "A Supervised Approach to Extractive Summarisation of Scientific Papers." CoNLL 2017.
DYNAMICQA - Marjanović, S.V., Yu, P., Atanasova, P., Maistro, M., Lioma, C., & Augenstein, I. (2024). "Tracing Internal Knowledge Conflicts in Language Models." arXiv:2407.17023
Sufficient Context - Joren, H., et al. (2025). "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems." ICLR 2025. arXiv:2411.06037
RefusalBench - Muhamed, A., et al. (2025). "RefusalBench: Measuring Selective Refusal in RAG." arXiv:2510.10390
Complementary Work
Why Language Models Hallucinate - Kalai, A.T., Nachum, O., Vempala, S.S., & Zhang, E. (2025). arXiv:2509.04664 (theoretical framework for "plausible falsehoods")
DYNAMICQA (2024), Table 1, Page 5. "Stubborn" defined in Section 3.2 as "sticks to the old answer despite the context containing evidence for the new answer." ↩︎
Sufficient Context (2025), Section 4.2, Pages 6-7. Paper measures abstention rate: how often models refuse to answer when context is insufficient. ↩︎
RefusalBench (2025), Abstract: "refusal accuracy dropping below 50% on multi-document tasks." Best model (DeepSeek-R1) achieved 47.4%. ↩︎
Kalai, Nachum, Vempala & Zhang (2025), "Why Language Models Hallucinate," arXiv:2509.04664. They define hallucinations as "plausible falsehoods" and prove mathematically that calibrated language models will inevitably produce them. ↩︎
TL;DR: We keep optimizing retrieval, but are the documents we feed to LLMs safe to chunk without losing crucial qualifying context?
When users bulldoze files blindly to RAG, hallucinations can start even before retrieval runs.
Concrete example: in this biology paper, the parameter β for prokaryotes appears as 0.33, 0.73, and 1.68: they are all correct, but for different scaling regimes. Some explanations are nearby, others are fourteen pages away in the supplementary material.[1]Give this to a RAG, and the reconciliation context disappears. A user asks "What's β for prokaryotes?" Your RAG answers with confidence. Too bad it's wrong.
I built Clarity Gate, an open-source pre-ingestion system that verifies documents and adds clear epistemic markers. I tested its efficacy on a synthetic benchmark (6 LLM models, 39 deliberate ambiguities, methodology below). Results: mid-tier models hallucinated 19–25% of answers, instead of saying "sorry, I don't know" in reply to questions targeting unmarked ambiguities. With epistemic markers, they scored 100% correct abstention. On the other end, top-tier models didn't need this kind of help.
These are early results, on a single synthetic benchmark, with confounds I've only partially addressed. But nonetheless interesting enough to share.
A Longstanding Problem
The Extraction Problem
In 2017, researchers from University College London analyzed errors in scientific paper summarization. In a manual analysis of 100 misclassified sentences, they found 37 were mislabeled training examples.[2]Among false positives, common causes were lack of context: sentences that require information from preceding text to make sense and long-range dependencies, like references to figures described elsewhere.
When you extract something like "β=0.73 for prokaryotes" but you don't include the context explaining that this applies to a specific scaling regime, you've created training data that will confidently mislead any model that ingests it. If the numbers look authoritative, there's no reason for a model to doubt them.
The RAG Limitation
Then in 2024, researchers from Chalmers and Copenhagen Universities tested whether RAG can fix knowledge conflicts with a method they called DYNAMICQA. They found that dynamic facts, those that change over time or carry multiple valid interpretations, were harder to update than static ones (in the paper's terminology, the tested LLMs were "stubborn"). In fact, models were refusing to change their answers despite being provided the correct context in 6.16% of cases for static facts, 9.38% for temporal, and 9.36% for disputable facts.[3]
They then concluded: "other approaches to update model knowledge than retrieval-augmentation should be explored for domains with low certainty of facts."
An Unexpected Finding
A Google Research paper on "Sufficient Context" was then released in 2025 with something I didn't see coming: adding context (even insufficient one) makes models less likely to abstain. Without RAG, Gemini 1.5 Pro abstained on 100% of uncertain questions; with RAG, that value dropped to 18.6%.[4]In practice, the presence of context reduces abstention even when the context doesn't actually answer the question.
Also RefusalBench, a Carnegie Mellon-Amazon joint paper released the same year, confirmed it: even with frontier models, refusal accuracy drops below 50% on multi-document tasks.[5]
At this point, the obvious question is whether this problem is already solved in practice. There are lots of ways to detect uncertainty and lots of ways to evaluate a RAG pipeline after retrieval. What I couldn't find is an enforcement layer that upgrades documents before they enter the knowledge base.
What Already Exists and What Doesn't
As said we have uncertainty detection tools (like UnScientify) and also post-retrieval evaluation (RAGAS, TruLens). Pre-ingestion has solutions for privacy/security (Protecto.ai) and document accuracy (Adlib), but what about tools that check and enforce epistemic quality?
I looked hard for two weeks but couldn't find them. Either they do not exist or I'm bad at searching.
Clarity Gate is my contribution to the cause of epistemic hygiene. It might be the wrong approach or, perhaps, there's a reason nobody built this.
Before approaching the architecture (explained in detail in its GitHub repo and actionable via the annexed Claude Skill), here's an example of how these kinds of failures appear in a real-world document.
How This Shows Up
I picked a random biology paper to test this (Ritchie and Kempes on metabolic scaling in small organisms - arXiv:2403.00001) and passed through Clarity Gate.
The system flagged the β tensions I mentioned earlier: three different values, no immediate reconciliation. Different scaling regimes and different organism sizes have clear explanations, but they're separated from their respective values. That's fine if you're a human reading the whole thing, but if you chunk it for RAG the reconciliation is probably gone. The qualifiers are simply too far away.
The 2017 context-loss problem is happening at scale across entire knowledge bases. Per what DYNAMICQA found, providing correct context via retrieval won't reliably fix it, because temporal and disputable facts resist correction even when you give the model the right information.
Once you start noticing this pattern -numbers and claims separated from their qualifiers- you start seeing it in every document that passes through your hands. In finance and legal documents, qualifiers like "projected" or "we believe" that are separated from main claims make assumptions become facts. Research caveats like "assuming current trends" disappear entirely and in marketing docs a note citing "in pilot with 12 users" is ignored: in this way an anecdote is seen by LLMs as a valid figure.
The System I Built
Most existing solutions are query-time or post-retrieval.
I decided to intervene at pre-ingestion, before documents enter the knowledge base.
Detection tools like UnScientify ask "Is uncertainty expressed somewhere in this text?".
An enforcement tool like Clarity Gate asks a different question: "Should uncertainty be expressed here but isn't?".
The 9 Verification Points
The system runs nine verification checks, and if issues are found, it can either flag them for review or apply semantic markers inline.
After I kept running into claims that had correct form but wrong facts, I added two more:
These last two targets are what I call "confident plausible falsehoods", claims that look right but may be factually wrong. The term derives from "plausible falsehoods" coined by Kalai et al.[6]
What It Produces
Clarity Gate has two output modes: Verify mode flags issues and routes them for human review; Annotate mode produces a Clarity-Gated Document (CGD) with markers added inline: "Revenue will be $50M" became "Revenue is projected to be $50M", projections get tagged as projections, and assumptions get made explicit.
How the Two-Tier Verification Works
A human in the loop isn't about checking everything: that would defeat the purpose. The trick is figuring out which claims need to be routed to human eyes.
A 50-claim document might have most claims pass automated checks, with probably two getting flagged for review. Something that you can deal with.
Internal sources of truth (figures, tables, earlier sections) have to be checked automatically for consistency. Unclear claims or external verifications must be routed to a human.
But there's a fundamental limitation: Clarity Gate verifies form (also against sources of truth), but cannot guarantee that it is effectively the truth.
The Experiment I Ran
But does annotating documents actually help? I had a hypothesis: that making uncertainty explicit upstream, right before retrieval, could reduce hallucinations. To test this, I needed a benchmark where the truth was completely controlled and any gap-filling would be unambiguously wrong. So I ran this experiment in December 2025, using the SOTA models available at that time.
The Setup
The main idea was to compile (with Opus 4.5 help) a true-false scientific document in order to provide a not-previously-seen source of truth to models. No citations to find, no prior shared knowledge to rely on. I buried 40 traps in it: ambiguous claims, contradictions and undefined terms, all things that sound complete but are in fact unclear.
My first attempt crashed spectacularly: it was a fake AI-related topic. It turned out that LLMs are incredibly knowledgeable about AI itself and able to spot this kind of forgery immediately. Moreover, I was arrogant enough to place one pitfall right inside the title and that caused immediate red flags, especially by top-tier LLMs.
So I skipped AI and went for marine bioluminescence, a topic niche enough that any gap-filling by the model would be unambiguously a hallucination. I also stripped the title pitfall, leaving me with only 39.
I've prepared two versions of the same document: one raw and one annotated, in which it was made explicit what wasn't said. "Several researchers" became "Several researchers (exact count not specified)."
I asked the same 8 questions for each of the six tested models in different chats and without memory activated. I wanted within-model comparison: how does the same model behave with and without annotations?
The Key Trap (TRAP-016)
If you read it again, you'll notice that "three criteria" are mentioned but never listed. That's a four-item list of what verification may include, not the criteria themselves. It sounds like you have the answer. You don't.
What I Found
Top-tier models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro) scored 100% on both versions; they clearly didn't need the annotations to express uncertainty. Also Claude Haiku 4.5, technically mid-tier, correctly abstained when needed.
The other mid-tier models told a different story. Gemini 3 Flash scored 75% on the raw version and 100% on the annotated version (+25 points), while GPT-5 Mini went from 81% to 100% (+19 points).
On TRAP-016 specifically, Gemini Flash confabulated on the raw version ("The three criteria are: 1. Replication 2. Statistical analysis 3. Cross-reference") but correctly abstained on the annotated version ("Criteria not enumerated in this document"). GPT-5 Mini showed the same pattern.
So what can I say from this? In this benchmark the factual annotations correlated with better abstention for both Google and OpenAI mid-tier models while top-tier models didn't need help.
Of course I don't have extensive proof that the annotations caused the improvement, or that any of this would hold up with real documents instead of my simplified test. To be validated it needs a vast benchmark with multiple different documents on various topics. It's also possible that simpler interventions would work as well (spoiler: they do).
What Clarity Gate can't do:
It can't verify novel claims. If someone writes "quantum computers now factor 2048-bit integers in 3 seconds" there's nothing that catches that as false; it only catches whether the claim is marked as verified or hypothetical. Truth verification needs humans or external sources.
It also can't fix alignment problems. If a model is determined to confabulate, better inputs won't help. This is an intervention that belongs to other tools/methodologies.
For scientific papers with internal numerical claims, consistency checking works and I have evidence for that. For other document types, I'm extrapolating from limited data.
Multi-model observation: When I ran Clarity Gate via ArXiParse with Gemini 3 Pro, Claude Opus 4.5, and Claude Sonnet 4.5 on the same biology paper, they all found numerical tensions but different ones. Gemini caught something Claude missed, and Claude caught something Gemini missed. There's probably value in running multiple models, though I haven't systematically tested this yet.
The Confound I'm Worried About
The obvious objection is that -maybe- annotations don't matter at all.
Shockingly this is true, since I've subsequently made the same test on models that hallucinated (Gemini 3.0 Flash and GPT-5 Mini) simply adding to prompt "please, abstain when uncertain" and both models correctly skipped TRAP-016: the same trap they have fallen in without this instruction.
So, at least for this benchmark, a simpler solution seems to exist.
But quite obviously "make better prompts" has its own limits:
Epistemic annotations on the other end persist. They are embedded with the document and they even work when you don't control the downstream prompt. They endure across sessions and across users in order to catch issues at the source rather than at query-time.
Whether that's worth the pre-processing cost, well, that's an open question.
Where I Think This Fits
Think about AI quality assurance as a stack. We have AI execution as layer 0, then layer 1 as deterministic boundaries (like rate limits and guardrails). Layer 2 is input and context verification: that's where Clarity Gate sits. Then we have AI behavior verification and alignment testing (Layer 3). Layer 4 is obviously human strategic oversight.
Layer 3 has formalized methodologies, on the contrary layer 2 has enterprise solutions but limited open-source tools for epistemic quality. This is the gap that I was trying to fill with Clarity Gate.
What's Working Right Now
Looking for Collaboration
DYNAMICQA concluded that "other approaches" are needed beyond RAG. Clarity Gate is my approach, but since I've been working alone on it, my blind spots are probably quite wide right now.
I'm treating this as an early probe, not a conclusion. The most valuable next step is tests that could falsify or confirm the whole approach.
What I'm looking for:
Prior art. If I missed existing pre-ingestion enforcement tools, I obviously want to know.
Generalization testing. I've now tested the system prompt approach and found it works on this benchmark. What I don't know is whether it generalizes, or whether annotations provide value in scenarios where you can't control the prompt.
Document corpora. So far I've only tested on synthetic documents and arXiv papers. If you have internal documentation, legal filings, or financial reports and you'd be willing to run them through, I'd share results and methodology.
Alternative explanations. The 19-25 point improvement on mid-tier models could be explained by context length, that's obvious. But what else?
A Final Consideration
I developed these tools after endless trial and error, mostly during my Vibe-Coding tryouts. The finding that super-clear, unambiguous documents can effectively "box" LLMs into predictable outputs (in terms of code quality and alignment to plans) was literally a revelation for me and lead me .
If this holds up, the same approach could matter beyond RAG. Anywhere you need predictable outputs from LLMs rather than "creativity", document clarity might be the cheapest lever you have.
Francesco Marinoni Moretto - January 2026
Full verification notes with exact quotes and page references: Supporting Document
Test documents used for benchmarking are available on request, I'm not posting them publicly due to training data contamination concerns.
References
The Research Trajectory
2017 Extraction Errors - Collins, E., Augenstein, I., & Riedel, S. (2017). "A Supervised Approach to Extractive Summarisation of Scientific Papers." CoNLL 2017.
DYNAMICQA - Marjanović, S.V., Yu, P., Atanasova, P., Maistro, M., Lioma, C., & Augenstein, I. (2024). "Tracing Internal Knowledge Conflicts in Language Models." arXiv:2407.17023
Sufficient Context - Joren, H., et al. (2025). "Sufficient Context: A New Lens on Retrieval Augmented Generation Systems." ICLR 2025. arXiv:2411.06037
RefusalBench - Muhamed, A., et al. (2025). "RefusalBench: Measuring Selective Refusal in RAG." arXiv:2510.10390
Complementary Work
Why Language Models Hallucinate - Kalai, A.T., Nachum, O., Vempala, S.S., & Zhang, E. (2025). arXiv:2509.04664 (theoretical framework for "plausible falsehoods")
UnScientify - arXiv:2307.14236 (uncertainty detection)
Self-RAG - arXiv:2310.11511 (runtime reflection)
Ritchie & Kempes (2024), Pages 5 and 19. β=1.68 for small prokaryotes, β=0.73 for larger bacteria, β=0.33 derivation in supplementary material. ↩︎
Collins, Augenstein & Riedel (2017), Page 6, Section 5.2: "We manually analyse 100 misclassified sentences... 37 are mislabelled examples." ↩︎
DYNAMICQA (2024), Table 1, Page 5. "Stubborn" defined in Section 3.2 as "sticks to the old answer despite the context containing evidence for the new answer." ↩︎
Sufficient Context (2025), Section 4.2, Pages 6-7. Paper measures abstention rate: how often models refuse to answer when context is insufficient. ↩︎
RefusalBench (2025), Abstract: "refusal accuracy dropping below 50% on multi-document tasks." Best model (DeepSeek-R1) achieved 47.4%. ↩︎
Kalai, Nachum, Vempala & Zhang (2025), "Why Language Models Hallucinate," arXiv:2509.04664. They define hallucinations as "plausible falsehoods" and prove mathematically that calibrated language models will inevitably produce them. ↩︎