Rejected for the following reason(s):
- No LLM generated, heavily assisted/co-written, or otherwise reliant work.
- LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.
Read full explanation
Rejected for the following reason(s):
Over the last year or so, there’s been a lot of hand-waving about “model collapse” and “LLMs eating their own outputs”. Some of it is vibes, some of it is actual math.
I tried to pin down one very specific piece of this: what happens when you lean heavily on synthetic data, across multiple generations, and then ask the model to perform on real-world data only?
I ended up writing two things:
A short technical preprint:
“The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance” (Lei, Yu, 2025)
DOI: 10.5281/zenodo.17782033
An open letter aimed at labs and policymakers:
“Against the Hall of Illusions: An Open Letter on Heavy Synthetic Data Training”
<open letter link here>
The picture in one sentence
If you repeatedly retrain a model on mixtures of real and model-generated data, and the synthetic fraction is large enough, you eventually get a model that is very good on its own manifold and quietly worse on real data, especially in the long tail.
That’s the “hall of illusions”: a mirror corridor where the reflections get sharper and the outside world slowly drops out of frame.
What I actually did (toy but transparent)
I stayed deliberately small and reproducible. No trillion-parameter anything.
Two experiments:
2D Gaussian mixture + GMM
Later generations: fit on a mix of real + synthetic samples; synthetic is drawn from the previous model.
Tiny character-level n-gram LM
Loop:
Later generations: train on a mix of
synthetic text generated by the previous model.
I vary α again and track perplexity on held-out real text across generations.
These are intentionally boring. They fit in a couple of Python files, can be run on a laptop, and you can see the curves without squinting at a giant training trace.
The plots (both in the paper) look basically like you’d expect:
In the n-gram case, perplexity on real text just walks upward once synthetic dominates, even though the model is happily fitting the mixed data.
Why I think this is structurally worrying
None of this will surprise people who think in terms of control theory or optics:
The paper formalizes this with a simple feedback loop:
D_{g+1} = (1-\alpha)\,D_{\text{real}} + \alpha\,D_{\text{syn}}^{g}
and argues that “heavy synthetic training” is roughly:
The experiments are there to show that even in very forgiving toy settings, this is enough to visibly hurt held-out real performance.
My worry is not “synthetic data is evil”. It’s that:
So even if “well-tuned, light synthetic” is fine, you still need some notion of how close you are to the hall-of-illusions regime.
The open-letter part (governance / requests)
The open letter tries to translate this into simple, checkable asks. Roughly:
Disclose approximate synthetic fractions.
Not the full recipe; just order-of-magnitude for each major stage. Something like: “pretraining: ≤10% synthetic; post-training: 60–80% synthetic.”
Run and publish multi-generation collapse tests.
Take a smaller model, build the same kind of loop as above, and measure performance on real-only held-out test sets over generations. Publish the setup and results.
Keep uncontaminated real-world eval suites.
Real-only, protected from synthetic contamination, enriched for rare/ugly/long-tail cases. Track those separately from synthetic-heavy benchmarks.
Treat heavy synthetic use as a safety parameter.
Not just “data augmentation”, but something alignment/safety people are allowed to say no to if it’s opaque or visibly collapsing on real tests.
Enable outside scrutiny.
Give enough structural info that external researchers and regulators can tell whether you’re close to the hall-of-illusions regime, even if they can’t reproduce the full training run.
None of this requires giving away proprietary data. It does require acknowledging that α and feedback depth are not neutral knobs
Things I know are missing / limited
I’ll pre-empt a couple of obvious objections:
“These are toys, not transformers.”
True. The Gaussian + n-gram experiments are just sanity checks that the geometry works the way you’d expect. The paper explicitly calls for follow-up with small neural sequence models (tens of millions to maybe a few billion parameters) under more realistic SFT/IT setups.
“What about mitigation techniques?”
I briefly discuss things like self-critique filters, preference models, process supervision, and diversification. My current view: they help, but they don’t change the core issue when α is large and you stack generations. I don’t have the big grid of experiments that would really answer this cleanly yet.
“This sounds like dataset contamination / train-on-validation all over again.”
Yes, there’s overlap. The difference here is the focus on recursive structure and synthetic fraction as an explicit variable, not just “oops, test set leaked”.
If you think any of that is wrong or underspecified, I’d actually like to hear it.
What I’d love feedback on from LessWrong
If you’ve read this far, I’d appreciate thoughts on a few concrete questions:
What would you count as a minimal convincing neural-scale experiment?
Links again, for convenience:
Happy to clarify details of the toy setups if that’s useful. I tried to keep them simple enough that anyone here could reproduce or modify them if they felt like poking the loop from a different angle.