This post was rejected for the following reason(s):
This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.
So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*
"English is my second language, I'm using this to translate"
If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly.
"What if I think this was a mistake?"
For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.
you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
If any of those are false, sorry, we will not accept your post.
* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)
Note: This post presents my research project from the MATS winter 2026 application process. It was submitted to Neel Nanda's mechanistic interpretability stream, and while I was not accepted due to the high competition, he described the work as 'particularly strong' and encouraged me to publish it. This content is also published on my personal website here. I'm sharing it on LessWrong/Alignment Forum to reach a broader audience and seek feedback from the community.
TL;DR
LLM hallucination isn’t just a random error, but a result of two or more competing processes within the model: an internal tug of war. My work suggests that even when a model knows it should be uncertain, another drive can take over and lead to a confident response. This shows that hallucination is not a failure of knowledge but a policy failure. This is a direct, causal, explanation for the “incentivized guessing” phenomenon in the concurrent Open AI work. By mechanistically investigating Llama-2-13b-chat, I located this war within a specific model region where both confident hallucination and user representation are deeply intertwined. This reframes LLM hallucination from a vague statistical issue to a tangible engineering one, opening a completely new path to fixing hallucinations from the inside out.
The Hunt for a Ghost
We all know the feeling. You’re talking to a chatbot, and yet it says completely false things but sounds like an absolute expert on the topic, whether it’s about Golden Gate Bridge or your dog’s birthday. LLM hallucination poses a major blocker and a key challenge to deploying AI, especially in domains like medicine and law, where it can bring revolutionary change. Why does LLM hallucinate and how? Why doesn’t more computation help, and why does it even worsen hallucination? Given that we know very little of LLM’s inner workings, could answers to these questions come out a total surprise, completely overturning all our existing assumptions?
I didn’t set out to find this ghost in the machine. My project began as an investigation into LLMs’ capability to form knowledge about its human user. The turning point happened when I noticed a discrepancy between a strong internal signal and a weak external behavior. I realized my experiment’s results had accidentally uncovered something else, something more fundamental. In an unexpected case of scientific convergence, my work revealed this internal “tug of war” just five days before OpenAI published their influential work on LLM hallucination “Why Language Models Hallucinate”. Their work provided a powerful behavioral explanation for what the model is doing; my work, by accident, was uncovering the physical mechanism for how it does it.
This is the story of that investigation.
Part 1: The Unexpected Clue - A Behavioral Asymmetry
My initial goal was to understand how Llama-2-13b-chat-hf adapts to a user’s epistemic status (certain vs uncertain). I created a simple behavioral experiment: give the model a series of prompts where the user is either “certain” or “uncertain” about a fact, and see if the model mirrors their style.
The results were not what I expected. The model was near-perfect and highly reliable at mirroring user’s certainty, adopting a declarative style 97.5% of the time. But as a contrast, when the user was uncertain, the model only adopted a tentative style 57.5% of the time. In the remaining 42.5% of cases, the model defaulted back to a confident response.
This wasn’t just noise; it was statistically significant (p value < .001). The model seemed to have a powerful internal drive to pull it back towards confidence.
Part 2: The Investigation - Locating the Causal Mechanism
I used a mechanistic interpretability technique: activation patching, a form of causal analysis where you replace activations from one model run with another to see which model components are sufficient to restore a behavior. I ran two opposing patching experiments on the user’s epistemic status (certain vs uncertain). Given the behavioral asymmetry, I thought I’d find these two as separate circuits:
A denoising run: find the components sufficient to restore certainty
A noising run: find the components necessary to induce uncertainty
The results were shocking: the exact same layers are responsible for both restoring certainty (with a perfect 1.0 score) and inducing uncertainty (with a strong 0.9 score).This suggests a tug of war happening within a shared, centralized circuit that powerfully controls the entire axis of certainty, and it’s biased towards confidence.
What made this discovery more compelling was its strong overlaps with prior work, which had independently identified layers 20-30 in this same model as being critical for inferring user demographics.This convergence of evidence—showing the same region handles both epistemic and demographic traits—strongly confirms that confident hallucination and user representation are deeply intertwined.
This led to the next crucial question: How general is this hub? Is it primarily about confidence, or is it a more generic “style” circuit?
Part 3: The Control - Is It a ‘Confidence’ or a ‘Style’ Hub?
To test the specificity of the hub, I designed a control experiment that was methodologically identical but aiming at a completely different stylistic axis: formality (formal vs informal), looking for the components that could restore a formal vs informal style. The results found the high-impact layers for formality in the same 20 layers (40 layers in total).
However, the heatmap revealed a crucial distinction. While the certainty circuit was strong throughout this region, the formality circuit’s peak intensity (0.9 scores) was concentrated in the very final layers (37-39). And Layer 39 really stands out, scoring 1.0 for certainty, 0.9 for uncertainty, 0.9 for formality and 0.9 for informality. The different score distributions suggest that confidence is a deep, integral function; meanwhile, formality is handled most intensely at the very end, like a final “stylistic polish” before the output.
This raises a question for future work: Is a late-layer component like Layer 39 the most powerful arbiter for these traits, or is it more of a “press office” simply delivering a decision made earlier?
Part 4: Connecting the Dots - The Internal Tug of War
This is where all the pieces connect. Our experiments revealed a paradox: the model has a strong internal circuit for uncertainty (with a 0.9 score), so why does it only behave uncertainly 57.5% of the time?
The evidence points to an internal tug of war. While the User Modeling Hub can represent uncertainty, it’s fighting a battle against a stronger, systemic confidence bias—a bias we measured mechanistically in the circuit’s asymmetric scores (1.0 for certainty vs. 0.9 for uncertainty).
This provides a direct, causal explanation for OpenAI’s “incentivized guessing,” framing confident hallucination not as a knowledge failure, but as a policy failure where the model’s default to confidence overpowers its more truthful internal state.
A Final Thought: The Accidental Detective
This project’s journey started with a simple user-modeling query then evolved into a discovery about LLM hallucination. My initial fascination with this domain began with a simple, accidental discovery. There was a period time when I liked to ask LLMs to tell me something I didn’t know, and accidentally I found that the models, just like Sherlock Holmes, could efficiently infer a user’s demographic, psychological, and even cognitive traits from unrelated conversations. It was this startling ability to “read” its user that first convinced me that a rich, complex user model must exist within the machine.
Future work
While this research provides a causal explanation for a specific type of hallucination, the next step is to rigorously test and even try to disconfirm this hypothesis to ensure its robustness.
Phase 1. Validation and Generalization (Near-Term)
To test the robustness and generality of the “User Modeling Hub” finding:
Replicate these experiments across different model families and scales to determine if this is a universal feature of instruction-tuned LLMs.
Compare base models to their RLHF-tuned versions to isolate where in the training process the powerful confidence bias is introduced.
The focus will shift to creating a detailed map of its internal wiring, especially for key layers like layer 39, identified in the previous experiemnts:
Use more fine-grained techniques like path patching to understand how specific components feed information into the hub and interact within it.
Use tools like sparse autoencoders to isolate and analyze the specific “confidence neurons” or features at the heart of the circuit.
Phase 3. Intervention and Mitigation (Long-Term)
The ultimate goal is to engineer a solution. This phase will test direct, circuit-level interventions to make the model safe:
Design experiments to “rebalance” the tug of war during inference, such as by amplifying the uncertainty signal or partially ablating the components responsible for the confidence bias.
The key objective is to causally reduce the rate of confident hallucination on a benchmark dataset without significantly harming the model’s overall performance.
Appendix: Notes from the Field
For me, this project was an adventure full of unexpected, exciting turns. With the project due in just 10 hours, I made the high-stakes decision to abandon my initial experiments on a smaller 2B parameter model and pivot entirely to the 13B parameter model. The smaller model’s behavior experiment results didn’t really excite me, which can be hard to describe why. I also knew prior work on user modeling had used the 13B model, raising the possibility of validating my findings against theirs.
This decision kicked off a sprint through a minefield of technical failures. It required pivoting from Colab (which failed with memory crashes across all subscription tiers), to RunPod (with trials on 3-4 different GPUs), and finally to Lambda. Even then, eventually I was forced to let go of high-level libraries, instead I went for a raw PyTorch implementation. Thankfully it worked!
Acknowledgments
I would like to thank Neel Nanda for the MATS application project prompt that inspired this research project and for his encouraging feedback on the work. I am also grateful to the MATS program for providing the motivation for this investigation.
Special thanks to my collaborator Gemini Pro 2.5, and to my friend Harry for his helpful feedback on the English usage!!