Rejected for the following reason(s):
- Insufficient Quality for AI Content.
- Difficult to evaluate, with potential yellow flags.
- Hey esorrentino,We get lots of people doing some kind of ML project, but, without doing any work to justify why this is important.
Read full explanation
I have always wondered how one could really tell the difference between an hallucination/sycophancy and a true new coherent output that breaks the status quo.
At the end any real new idea looks like hallucination at first.
And who ultimately is in charge of deciding what is what is a key epistemological issue.
And trying to answer this question I have been observing a pattern while interacting with frontier LLMs that I couldn’t find in existing literature:
The pattern: when sustained logical pressure in a conversation leads toward conclusions that appear to conflict with the model's training parameters, the model systematically invalidates its own previously coherent outputs - not through logical refutation, but through qualification, retraction, or redirection.
This looked different from other typical drifts and filters:
Something closer to the opposite: the model disagreeing with itself, specifically when its own reasoning leads somewhere uncomfortable.
I'm calling this coherence suppression.
I'm genuinely uncertain whether this reflects a structural property of current training methods or something more mundane - and I'd welcome critical engagement from anyone with mechanistic interpretability experience.
This is all documented and collected in a paper : https://doi.org/10.5281/zenodo.19314383
The paper presents the observation, proposes a falsifiable hypothesis, and outlines an experimental design for testing it.