Hall of Illusions: Heavy Synthetic Data as a Structural Risk for LLMs

MetisBH

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.
LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.

Read full explanation

Over the last year or so, there’s been a lot of hand-waving about “model collapse” and “LLMs eating their own outputs”. Some of it is vibes, some of it is actual math.

I tried to pin down one very specific piece of this: what happens when you lean heavily on synthetic data, across multiple generations, and then ask the model to perform on real-world data only?

I ended up writing two things:

A short technical preprint:
“The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance” (Lei, Yu, 2025)
DOI: 10.5281/zenodo.17782033
An open letter aimed at labs and policymakers:
“Against the Hall of Illusions: An Open Letter on Heavy Synthetic Data Training”
<open letter link here>

The picture in one sentence

If you repeatedly retrain a model on mixtures of real and model-generated data, and the synthetic fraction is large enough, you eventually get a model that is very good on its own manifold and quietly worse on real data, especially in the long tail.

That’s the “hall of illusions”: a mirror corridor where the reflections get sharper and the outside world slowly drops out of frame.

What I actually did (toy but transparent)

I stayed deliberately small and reproducible. No trillion-parameter anything.

Two experiments:

2D Gaussian mixture + GMM
- True data: a 3-component Gaussian mixture in 2D.
- Model: a 3-component Gaussian mixture, trained by EM.
- Loop:
  - Generation 0: fit on real samples only.
  - Later generations: fit on a mix of real + synthetic samples; synthetic is drawn from the previous model.
- I vary the synthetic fraction α and number of generations, and track average log-likelihood on held-out real data.
Tiny character-level n-gram LM
- Corpus: a fixed chunk of text, split into train / val / test (all real).
- Model: a 4-gram character language model.
- Loop:
  - Generation 0: train on real text only.
  - Later generations: train on a mix of
    - a shrinking prefix of the real training text and
    - synthetic text generated by the previous model.
- I vary α again and track perplexity on held-out real text across generations.

These are intentionally boring. They fit in a couple of Python files, can be run on a laptop, and you can see the curves without squinting at a giant training trace.

The plots (both in the paper) look basically like you’d expect:

α = 0.0 (no synthetic) → metrics on real data are flat; no collapse.
Moderate α (e.g. 0.5) → gradual drift.
Heavy α (e.g. 0.7–0.9) → clear, monotonic degradation on real-only test sets as generations stack.

In the n-gram case, perplexity on real text just walks upward once synthetic dominates, even though the model is happily fitting the mixed data.

Why I think this is structurally worrying

None of this will surprise people who think in terms of control theory or optics:

Too much feedback from your own output,
Not enough fresh signal from the environment,
Eventually you’re optimizing for the echoes.

The paper formalizes this with a simple feedback loop:

D_{g+1} = (1-\alpha)\,D_{\text{real}} + \alpha\,D_{\text{syn}}^{g}

and argues that “heavy synthetic training” is roughly:

α ≥ 0.5 at an important training stage, or
smaller α compounded over several generations, such that most of the effective signal is coming from D_{\text{syn}}^g rather than D_{\text{real}}.

The experiments are there to show that even in very forgiving toy settings, this is enough to visibly hurt held-out real performance.

My worry is not “synthetic data is evil”. It’s that:

As labs scale, synthetic becomes cheap and real becomes expensive.
There are strong incentives to keep pushing α up, especially in post-training.
From the outside, we usually have no idea where α is, or how many synthetic generations there have been.

So even if “well-tuned, light synthetic” is fine, you still need some notion of how close you are to the hall-of-illusions regime.

The open-letter part (governance / requests)

The open letter tries to translate this into simple, checkable asks. Roughly:

Disclose approximate synthetic fractions.
Not the full recipe; just order-of-magnitude for each major stage. Something like: “pretraining: ≤10% synthetic; post-training: 60–80% synthetic.”
Run and publish multi-generation collapse tests.
Take a smaller model, build the same kind of loop as above, and measure performance on real-only held-out test sets over generations. Publish the setup and results.
Keep uncontaminated real-world eval suites.
Real-only, protected from synthetic contamination, enriched for rare/ugly/long-tail cases. Track those separately from synthetic-heavy benchmarks.
Treat heavy synthetic use as a safety parameter.
Not just “data augmentation”, but something alignment/safety people are allowed to say no to if it’s opaque or visibly collapsing on real tests.
Enable outside scrutiny.
Give enough structural info that external researchers and regulators can tell whether you’re close to the hall-of-illusions regime, even if they can’t reproduce the full training run.

None of this requires giving away proprietary data. It does require acknowledging that α and feedback depth are not neutral knobs

Things I know are missing / limited

I’ll pre-empt a couple of obvious objections:

“These are toys, not transformers.”
True. The Gaussian + n-gram experiments are just sanity checks that the geometry works the way you’d expect. The paper explicitly calls for follow-up with small neural sequence models (tens of millions to maybe a few billion parameters) under more realistic SFT/IT setups.
“What about mitigation techniques?”
I briefly discuss things like self-critique filters, preference models, process supervision, and diversification. My current view: they help, but they don’t change the core issue when α is large and you stack generations. I don’t have the big grid of experiments that would really answer this cleanly yet.
“This sounds like dataset contamination / train-on-validation all over again.”
Yes, there’s overlap. The difference here is the focus on recursive structure and synthetic fraction as an explicit variable, not just “oops, test set leaked”.

If you think any of that is wrong or underspecified, I’d actually like to hear it.

What I’d love feedback on from LessWrong

If you’ve read this far, I’d appreciate thoughts on a few concrete questions:

Is the “hall of illusions” / feedback-loop framing actually helpful, or is it just a poetic way of restating “don’t overfit to your own samples”?
What would you count as a minimal convincing neural-scale experiment?
- Model size range?
- Number of generations?
- What kind of eval (MMLU-ish, GSM8K-ish, something else)?
Are the governance asks reasonable, or too weak / too strong?
- If you work in or near a lab: which of the 5 would be easiest vs hardest to actually implement?

Links again, for convenience:

Preprint: “The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance” – 10.5281/zenodo.17782033
Open letter: “Against the Hall of Illusions: An Open Letter on Heavy Synthetic Data Training” – <open letter link here>

Happy to clarify details of the toy setups if that’s useful. I tried to keep them simple enough that anyone here could reproduce or modify them if they felt like poking the loop from a different angle.

LESSWRONG
LW