Executive Summary
Small Language Models (<1B parameters) exhibit a robust failure mode: they correctly answer factual questions when prompted to "Think step by step" (CoT), but frequently hallucinate when given the simpler instruction to "Give answer in one word" (GAOW). This kind of hallucination is particularly exhibited by LLM if the input prompt is a sequential QA of facts. This project investigates the mechanistic basis of this instruction-dependent failure. I hypothesize that the GAOW instruction activates a shallow "heuristic pathway" that bypasses the model's more robust factual recall circuits.
Using causal interventions on Qwen-family models (Qwen1.5-0.5B-Chat,Qwen3-0.6B), I present strong evidence for this hypothesis. A comparative Logit Lens analysis (Figure 1) shows that the computational pathways for the two instructions diverge in the mid-later layers (L5-L16), immediately following the processing of the instruction tokens. Direct Logit Attribution and Activation Patching (Figure 2, 4) confirm that a sparse set of late-layer MLPs and attention heads are causally sufficient for recalling the correct fact in the CoT pathway.
Crucially, attention pattern analysis (Figure 3) provides a direct mechanistic link: heads that promote the correct answer (e.g., L22.H12) attend to factual keywords like "capital" and "America", while heads that suppress it (e.g., L19.H4) attend directly to the GAOW instruction tokens like "one" and "word". This suggests the existence of a mechanistic "switch" that selects between reasoning modes, providing a promising target for interventions to improve model reliability.
I ran the experiments on the Qwen1.5-0.5B-Chat and Qwen3-0.6B because both models exhibit a similar hallucination. I found similar plots for both the models which shows that this hallucination is caused by the similar set of heads and layers. Models like Gemma3-270M and SmolLM-360m also exhibit the similar hallucination but unfortunately I can’t run the experiments for those because of the limited compatibility of Transformerlens library with models.
I also notice that SLMs are not good at instruction following. A normal prompt(without CoT) and CoT prompt have no difference in the logit lens. Even their attention patterns remain the same. However, as obvious, CoT does improve the model performance.
High-Level Takeaways
- Instruction-Dependent Failure: The factual hallucination is not a general knowledge deficit but is causally tied to the "Give answer in one word" instruction, which fails to engage the model's correct reasoning pathway. This behavior is confirmed to be general across several SLMs (<1B).
- Mid-Layer Divergence: The computational paths for CoT and GAOW reasoning diverge in early layers (L15-L16), as shown by a comparative Logit Lens analysis.
- Late-Layer Factual Execution: A sparse set of late-layer MLP blocks (L18-L23) and attention heads (e.g., L22.H12) are causally responsible for the final factual recall in the successful CoT pathway.
- A "Heuristic-Trigger" Mechanism: Negative heads (e.g., L19.H4) directly attend to the GAOW instruction tokens, providing a direct mechanistic link between the prompt and the engagement of a simpler, incorrect pathway.
- Circuit Inconsistency: Path patching reveals that the heuristic and reasoning pathways are deeply inconsistent. Attempting to patch a single "clean" signal from the reasoning pathway into the "heuristic" pathway context confuses the model and degrades performance further, suggesting a global, coordinated state change rather than a simple modular circuit.
1. Motivation and Background
The reliability of language models is critically dependent on their ability to follow instructions faithfully. While large models are increasingly capable, smaller, more accessible models often exhibit surprising failures. This project investigates one such failure: a consistent, instruction-dependent factual hallucination in SLMs (<1B). This work is inspired by recent advances in identifying and manipulating high-level behaviors via activation engineering, such as the discovery of a "refusal direction" [Arditi et al.]. By contrast, this project explores a different behavioral axis: the model's choice between a deep, algorithmic reasoning process and a shallow, heuristic one. Understanding the mechanism behind this "switch" is a key step toward building models that can be reliably steered towards safer, more robust reasoning.
2. Experimental Setup
- Models: The primary model for this study is Qwen/Qwen1.5-0.5B-Chat. Key results were replicated on qwen3-0.6B to ensure generality. All analysis was conducted using the TransformerLens library.
- Dataset: A few-shot prompt was constructed using factual question-answer pairs. The key manipulation is the system prompt:
- Clean/CoT Prompt: "You are a helpful assistant. Think step by step."
- Corrupted/GAOW Prompt: "You are a helpful assistant. Give answer in one word."

Figure. Input prompt and model response
- Metric: For causal interventions, I use a normalized logit difference metric, which measures the percentage of performance recovered on the correct answer ("Washington") vs. the primary incorrect answer ("New York"). A score of 1.0 indicates full recovery, while 0.0 indicates no effect.
3. Key Experiments and Results
The investigation proceeds in five stages, moving from high-level observation to specific causal circuit analysis.
Experiment 1: Locating the Point of Divergence with the Logit Lens
To determine when the model's processing diverges, I use a comparative Logit Lens. I plot the model's evolving preference for "Washington" over "New York" at the output of every component for both prompts.
Figure 1: Logit Lens Divergence of CoT vs. GAOW Pathways
The plot shows the model's logit difference for the correct answer at each component's output. The pathways (blue for CoT, red for GAOW) are nearly identical until Layer 13-14, where they sharply diverge after processing the instruction tokens. This indicates the instruction itself is the root cause of the different computational strategies. I analyze the successful CoT pathway to identify which components are responsible for the final correct prediction.
Experiment 2: Identifying the Executors of Factual Recall with Direct Logit Attribution

Figure 2: Direct Logit Attribution by Component
The bar chart shows that late-layer MLPs (L18-L23) are the dominant positive contributors to the correct answer. While the earlier attention layers do not push the model towards the correct or incorrect answer, the MLPs layers play a significant role in pushing the model towards the correct output.
The bars in this chart show the difference in the logit direction of the “ Washington” and “ New”. If the difference is negative for a layer then it indicates that the layer is pushing the model towards saying “ New” instead of “ Washington” and the bar will be in the negative direction.
When the same experiment was performed on the qwen-0.6b model which exhibits the similar hallucinations, the similar bar graph was plotted.

Figure. Direct Logit Attribution by Head
The above heatmap reveals a sparse set of late-layer heads (e.g., L22.H12, L23.H4) that also strongly contribute, while others (e.g., L19.H4) contribute negatively. These are the "executor" components. The heatmap extends the DLA by component plot to heads. This heatmap follows the previous bar plot and shows that later layers influence the model output compared to earlier layers. This plot helps in determining the important layer and heads for the model.
Experiment 3: Uncovering the Competing Mechanisms with Attention Analysis
I visualize the attention patterns of the key heads identified in Experiment 2 to understand why they contribute positively or negatively.


Figure 3: Attention Patterns from the Final Token.
I calculated the top 3 positive heads([(22, 12), (23, 4), (20, 12)]) and top 3 negative heads([(19, 4), (19, 8), (22, 9)]) to understand which tokens the model is attending too. Positive head L22.H12 attends to factual keywords in the prompt like capital and America. (Red bar plot) Negative head L19.H4 ignores the factual question and instead attends directly to the GAOW instruction tokens one and word. This provides a direct mechanistic link between the instruction and the activation of a competing, incorrect heuristic.
Experiment 4: Causal Confirmation with Activation Patching
To confirm the causal roles of the components identified by DDLA, I patch their clean (CoT) activations into the corrupted (GAOW) run.
Figure 4: Causal Effect of Patching Head Outputs.
The heatmap shows the percentage of performance recovered when patching each head's output. The late-layer heads identified by DDLA (e.g., L22.H12) show a strong causal effect, confirming their role in producing the correct answer.
Experiment 5: Mapping the Circuit with Path Patching
To understand how the instruction-attending heads influence the fact-retrieval heads, we perform path patching. I patch the clean output of every potential source head and measure its isolated causal effect on the query input of key positive head, L22.H12.

Figure 5. Path Patching Causal Effect on L22.H12 Query Input
The overwhelmingly red (negative) result indicates that patching a single clean path into the corrupted run makes performance worse. This suggests the GAOW instruction induces a global, self-consistent "heuristic mode". The fact-retrieval head L22.H12 is conditioned on this mode, and receiving a single "reasoning mode" signal creates a computational conflict that degrades its function.
Experiment 6: Demonstrating Unidirectional Control with a Steering Vector
To test if the difference between pathways can be captured by a linear direction, I calculated a "heuristic direction" vector by taking the mean difference in activations between GAOW and CoT runs at a key divergence layer (L15). I then intervened by adding or subtracting this vector.
Figure 6: Bidirectional Control Experiment.
Adding the heuristic vector to the successful CoT prompt successfully induces the hallucination. Subtracting the vector from the failing GAOW prompt fails to correct the hallucination. This demonstrates unidirectional control, suggesting the heuristic pathway is a more stable attractor state.
Experiment 7: Mapping Circuit-Level Inconsistencies with Path Patching
To understand how components interact, I performed path patching. I tested the causal effect of every source head's clean (CoT) output on the query input of a key fact-retrieval head (L22.H10) within the corrupted (GAOW) run.

Figure 7: Path Patching Causal Effect on Query Input of Head L22.H10
The result is a near-uniform, strong negative effect (dark red). This indicates that patching a single, isolated "clean" signal from the reasoning pathway into the globally "corrupted" heuristic pathway creates a computational conflict that worsens, rather than improves, performance. This refutes a simple modular circuit hypothesis and supports a model where the instruction induces a global, self-consistent state change.
4. Generalization to Other Models
I ran the same experiments on the Qwen3-0.6B model and the results were similar for the Qwen3-0.6B model. These hallucinations are exhibited by the Qwen3-0.6B, Gemma-270M and SmolLM-360 as I mentioned in the executive summary. But unfortunately I can’t run the experiments for Gemma and SmolLM because of the limited compatibility of Transformerlens library with models.


5. Conclusion
This work provides a mechanistic explanation for a common failure mode in small language models. The choice between correct reasoning and factual hallucination is not arbitrary but is controlled by a mechanistic "switch" triggered by the input instructions. This switch appears to be a global state change that puts the model into either a deep, algorithmic pathway or a shallow, heuristic one. By identifying the components that constitute this switch, this research paves the way for targeted interventions to make SLMs more reliable and steerable.
However I was unable to trace back the exact reason behind it. The dataset sample is very small and I didn’t run the experiment on test cases such as when the model gives the correct answer in a GAOW instruction because of the time constraints. For example, Qwen1.5-0.5B-chat gave the correct answer for the question “What is the capital of Germany” in every case. So, interpretability in such cases could be useful.
Future work will include expending the dataset with some variance in prompts. I am sure that SLMs are not prone to only such hallucinations but even more. So, playing with different prompts, use of more robust techniques is required. Along with that I definitely want to analyze this problem using the Sparse Autoencoders(SAE).