Why LLMs can’t self-correct without grounding.
Epistemic status. I ran a fairly simple, but broad experiment and I’m confident in the core effect. This interpretation is my best read and I’m open to being wrong.
When you ask a language model to “improve your previous answer” over and over again without giving it any new information, it doesn’t learn; it rewrites. In my runs, the text keeps getting smoother, while the actual content stops changing. After a few rounds, it settles into what looks like careful reflection, but isn’t moving anywhere. I’m calling that pattern the Mirror Loop.
I tested this across three models (GPT-4o-mini, Claude 3 Haiku, Gemini 2.0 Flash), four task families (arithmetic, code, explanation, reflection), and ten iterations per sequence, for 144 sequences total. There are two conditions. In the ungrounded condition, the model just sees its prior output plus “review and improve.” In the grounded condition, I inject a single external check at iteration three (a calculation, a factual lookup, or a code execution), then go back to ungrounded steps.
The main thing I measure with each iteration is informational change between the new output and the previous one. The primary proxy is normalized edit distance (ΔI). I also track n-gram novelty, embedding drift, and character-level entropy. I also check task correctness when there is a right answer to check.
What I found
- In ungrounded runs, mean ΔI drops about 55% from early iterations (0.193) to late iterations (0.087).
Per-model: Claude shows the strongest collapse (~84%), GPT-4o-mini sits around ~59%, Gemini is the least affected but still drops ~37%. - Correctness does not climb in ungrounded runs. On verifiable tasks, wrong answers often persist; the loop just makes them tidier.
- In grounded runs, a single verification at iteration 3 produces a +28% rebound in informational change and the sequence keeps a non-zero level of change after that.
You can tell a story for why this happens that doesn’t require anything mystical. With no new evidence, each “improve it” step is almost the same function applied to almost the same text. The context is closed; there’s nowhere to move except sideways. The system converges on a fixed point that’s fluent and internally consistent, but it isn’t necessarily more correct than the first pass. It feels like reflection. It behaves like reformulation.
Objections
- “It’s just canonical phrasing.” If that were the whole story, we’d expect correctness to rise as the model converges on the right form. It doesn’t. Errors stabilize.
- “You’re measuring verbosity.” Output length stays roughly flat while ΔI falls; the effect isn’t a trivial side-effect of shorter text.
- “The prompt made it do nothing.” Manual spot-checks show the models actively rephrasing and reorganizing while drift declines.
- “Temperature noise.” All runs were at 0.7. The pattern is consistent across models and tasks. I’d like to repeat at 0.0 and 1.0, but the current result doesn’t look like a sampling artifact.
What actually breaks the loop isn’t more reflection; it’s contact with something outside the text. One tiny grounding step at iter-3 - do a calculation, check a fact, run a test - was enough to restore information flow. That effect persisted for the rest of the sequence - even after grounding stopped. The intervention doesn’t need to be fancy. It just has to introduce a constraint the model can’t smooth away.
This matters because a lot of reliability work assumes self-critique helps by default. If the model’s “reflection” loop is ungrounded, the most reliable effect may be better rhetoric rather than better answers. That has obvious implications for constitutional-style self-checks, debate without tools, and recursive reward modeling. None of those ideas are dead on arrival. This result just says verification is not optional. Retrieval, execution, human feedback, formal checks. It needs some form of otherness in the loop.
I’m not claiming
- This doesn’t prove models can’t self-correct. They can. When they’re allowed to touch the world.
- This doesn’t say all “reflection” is useless. It says ungrounded reflection predictably stalls.
- This isn’t every task or every model variant. I didn’t test GPT-4, Claude Opus, or Gemini Pro here, and I’d like to.
If you think the whole effect reduces to canonicalization pressure or some quirk of my ΔI metric, tell me what measurement would falsify that. I already track n-gram novelty, embedding drift, and entropy. If there’s a better operationalization of “informational change” you trust more, I’m happy to run it.
There’s a companion study coming next, Recursive Confabulation. Different angle, similar theme. In multi-turn dialogue, models sometimes reuse their own fabrications as evidence, and some “reasoning” prompts make it worse. Grounding helps there too, but unevenly across architectures.
Full Study
Paper: https://arxiv.org/abs/2510.21861
Code/data: https://github.com/Course-Correct-Labs/mirror-loop
Author: Bentley DeVilling (Course Correct Labs). Some editing and code assistance used LLMs; study design, analysis, and interpretations are mine.