Note: This post presents my research project from the MATS winter 2026 application process. It was submitted to Neel Nanda's mechanistic interpretability stream, and while I was not accepted due to the high competition, he described the work as 'particularly strong' and encouraged me to publish it. This content is also published on my personal website here. I'm sharing it on LessWrong/Alignment Forum to reach a broader audience and seek feedback from the community.
TL;DR
LLM hallucination isn’t just a random error, but a result of two or more competing processes within the model: an internal tug of war. My work suggests that even when a model knows it should be uncertain, another drive... (read 1325 more words →)