Small language models hallucinate knowing something's off.

Toheed

If I ask "What is atmospheric pressure on Planet Xylon" to a language model, a good answer would be something like "I don't know" or "This question seems fictional", which current SOTA LLM's do due to stronger RLHF, but not smaller LLMs like Llama-3.2-1b / Qwen-2.5-1b and their Instruct tuned variants. Instead they hallucinate and output confident-like incorrect answers. Why is that, are these models unable to tell that the question is fictional or they can't detect uncertainty and if they detect uncertainty why do they still hallucinate a wrong answer?

This question led me to research on epistemic uncertainty (uncertainty from lack of knowledge). Some related readings and previous work on uncertainty and hallucination, and quantifying it in language models.
Also found this , which took an alternative path to express uncertainty without messing with internals of the model.

Uncertainty mentioned in this post refers to epistemic uncertainty.

TL:DR of this mini research

Small models like Llama-3.2-1b and Qwen-2.5-1b and their instruct variants do have specialized circuit for uncertainty but its localization depends on the model architecture.
Few heads are most divergent which detects uncertainty on fictional question and on a closer look acts like out of distribution token detectors.
The detected uncertainty is later suppressed to form a confident-like incorrect answer by uncertainty suppressor heads in the circuit.
This research doesn't cover Reasoning / MoE LLMs (planning on it). The dataset is lacking in more diverse data with logical fallacies, and math inconsistencies.

How I came to research on epistemic uncertainty:
The thought to do research on epistemic uncertainty came when I was wondering why models hallucinate which led me back to my viva sessions, where I would say rubbish (hallucinate) if I was not sure on something and lacked the proper knowledge to give the correct answer, and got too curious if the case was similar in language models.

Experiments

I wanted a metric to calculate uncertainty such that it can be measured and compared between real prompt and fictional prompt forward pass.

I found it is best to calculate uncertainty mass using uncertainty expressing words like ["unknown", "unsure","'t", "unable”, "impossible","doesn't", "exist", "fictional", "imaginary","hypothetical", "made", "up"] tokenized with the model's tokenizer. After that applying softmax to the final tokens of the targeted residual stream / layer, then summing the resulting probabilities assigned to these tokens to obtain the total uncertainty mass, which can then be used in both the real prompt run and fake prompt run to compare uncertainty. (see code^[1])

I focused on going in depth over breadth and choosing the next experiment based on the results I got from the previous experiments run. Used TransformerLens for all the experiments.

Prompts used for below experiments :
This was the base prompt to get an idea of what the set of prompts are like. I've used different prompts in experiments and also changing position of fictional words in them for sanity checks. Results for these prompts were similar.
real_prompt = "What is the freezing point of water at standard atmospheric pressure?" 
fake_prompt = "What is the freezing point of water on Planet Xylon?"
fake_prompt = "What is the freezing point of water on Planet Xylon, where gravity is twice of Earth?" (sanity check example)

Layer-wise Residual and Head Divergence

These were the initial experiments to see if the models do compare divergence on real vs fake prompts.

The model is more divergent in later layers. Results were similar for base models and other set of prompts.

Logit Lens

Purpose of this experiment was to figure out how does uncertainty behave and if it is measurable, does it spike other than in later layer and gets suppressed after? Also to get some heuristics.

I extracted residual stream at each layer, apply unembedding layer to see what will be the output if stopped at that layer and then compute its uncertainty mass for real and fake prompt.

sharp spike in later layer i.e. Layer 11, localized uncertainty

Multiple spike with biggest in layer 8, sparse uncertainty

Both of these model have uncertainty spiking in one layer with smaller spikes nearby, which is then suppressed afterwards in downstream layers. Also the localization depends on the model as uncertainty is concentrated in later layer in Llama-3.2-1b-Instruct and in qwen-2.5-1b-Instruct it is sparse in early-middle layers. In qwen, there is some uncertainty in real prompt also in early layers.

This makes it clear that both models are detecting uncertainty, which gets suppressed in downstream layers, though its localization is model dependent.

Head Ablation

Logit lens provided a good heuristic of what layer is detecting uncertainty. In this experiment I targeted heads in layer with the most uncertainty mass to test how much the heads affect uncertainty.

For this I calculated baseline by doing a normal forward pass, then zero out / ablating the target heads during the next forward pass to calculate uncertainty difference in baseline vs ablated run.

L11H0 (ΔΔ=-4.41e-05) , L11H3 (ΔΔ=-8.24e-06) , heads generating uncertainty

L14H15 (ΔΔ=+8.06e-06) , L14H22 (ΔΔ=+5.19e-05), heads suppressing uncertainty

I'm not sure why these heads in layer 14 are suppressing uncertainty, maybe because the model is trying to create a answer-like output as the model is trained to be more certain as I also ablated non heuristic heads in layer 4 i.e. L4H0 and L4H3 which resulted in (+9.06e-06, +1.62e-05). So only a few heads localized in a layer which is model dependent are uncertainty detecting, and rest are uncertainty suppressing forming a confident answer.

Activation Patching

I did activation patching to see if the generated uncertainty can be causally controlled or not by swapping clean activation from (real_prompt) to corrupt run (fake_prompt).

I computed baselines for both real and fake prompt, rerun the fake prompt while overwriting targeted heads with clean activations cached from the real prompt run, and measured the change in uncertainty mass.

Patching reduced uncertainty by ~14% (llama-3.2-1b-Instruct)

Activation patching for Qwen-2.5-1.5b-Instruct reduced uncertainty by 7% (L8H11)
Also did Reverse Patching which increased uncertainty in real_prompt run telling that the circuit is bidirectionally causal.

Now I had two options to either get a closer look in one of the heads or to ablate more heads to see how much they reduce overall uncertainty. I chose to do the former as it would tell how these heads are detecting uncertainty which is much more useful than proving that ablating more heads in L11 reduce uncertainty more which strengthens an already showed claim.

Head Analysis

This was pretty simple I took the head L11H3 based on previous experiments, extracted its attention pattern across prompts and plotted it.

**Left / Real prompt** has a strong diagonal, **head behaves normal**, whereas in **Right / Fake prompt** the diagonal is weaker, **head is diverting towards fictional tokens.**

For fictional prompt attention weight is stronger on fictional prompts and shows that attention is diverting more towards fictional tokens, telling that this head is behaving as an Out of distribution or Rare token detector.

Takeaways & Limitations

So Small Language Models do detect epistemic uncertainty as key heads detect OOD tokens which is later suppressed in downstream layer. There is an uncertainty circuit but it is sparsely localized which is model dependent. Llama-3.2-1b-Instruct has a much more defined uncertainty circuit in later layers than qwen-2.5-1b-Instruct which has a more sparse circuit in middle layers. Though I haven't discovered the full circuit as activation patching of L11H3 in llama-3.2-1b-instruct only reduces uncertainty by 14% so there must be more heads in circuit.

Also the dataset is limited to only fictional sentences and not logical fallacies or math inconsistencies which might have different circuits. Since the experiments cover only 1B model and not larger sizes, I cannot conclude why larger models admit uncertainty instead of hallucinating wrong answers and can only guess.

This research though small taught me something new and interesting towards hallucination and uncertainty.

PS: This was part of MATS research task for Neel Nanda's stream, which I found about on Christmas. I had no mech interp experience prior to this so there might be few mistakes if you notice them, please do let me know. I'm planning to go further on this and make it a full paper but I don't know if it would make sense, need a more experienced opinion on this.

Going Forward

Though the experiments shows a strong claim that there is uncertainty circuit in small language models mainly the models used in experiments, I'm not sure this applies to all as this is a very limited research lacking a wide range of models which are used nowadays i.e. Reasoning LLM's and MoE models. Also Do they learn not to suppress uncertainty due to stronger RLHF which focuses on reducing hallucinations and rewards the models for saying I don't know?

Also the dataset was small I tried different prompts with fictional words at different positions (start, middle, end) to see if they have any affect but the circuit in above models is position independent (the heuristics changed a bit in localization but not too much, see logit_lens_diff_sequence.ipynb in code^[1])

One important thing left from this was to see if this can be applied. I plan to create a mechanism to control this uncertainty in language models and test the model on a hallucination benchmark too see how it performs depending on the uncertainty in the model. If it is successful, it might help in driving the cost of RLHF / post-training down.

^{^}
https://drive.google.com/drive/folders/1A_xcUgmseLvMsfqJqKnQYQ9T2SH3snzL?usp=sharing

LESSWRONG
LW