The probability for "Garage" hit 99% (Logit 15) at the very first step and stayed flat.
Is the problem that these questions are too easy, so the LLM is outputing reasoning since that's sometimes helpful, but in this case it doesn't actually need it?
I'd be curious to see what the results look like if you give it harder questions.
Chain-of-Thought (CoT) in Llama-2-7b is empirically demonstrated to be a Contextual Saturation mechanism, not a reasoning process. Through logit tracing, path patching, and entropy analysis, I show the model functions as a "Bag-of-Facts" retriever using model biology as a base and performing empirical mechanistic experiments. Intermediate tokens serve only to stabilize attention on semantic entities, collapsing the answer distribution via pattern matching rather than logical progression.
Introduction
I hypothesize that this happens due to two reasons,
i) LLMs were built for next token prediction and the transformer architecture was inherently built for text translation
ii) Majority of the data in the human world, even STEM data is mostly(~90% pattern matching), there should be a debate around reasoning? And when big Labs say their model reasons, what exactly do they mean by it?
Prediction: If CoT is true reasoning, the model should exhibit stepwise uncertainty reduction and sensitivity to logical order. If CoT is retrieved, the model will be permutation invariant and exhibit "early knowledge" of the answer.
Experimental Framework
Four mechanistic probes were applied to Llama-2-7b using transformer_lens.
I ran the whole analysis through 9 prompts of which each of the subtopics are as, state_tracking, arithmetic_dependency, counter_intuitive, symbolic_logic, distractor_filtering, and multi_hop_run. The subtopics define a very niche area of human reasoning ability, and check if the model showed even a little hint around reasoning.
I used the following metrics to judge the model internals and working, and the expected observation which will support the claim for LLMs being pattern matching retrieval tools is below,
Evidence and Results
The below table is the whole experiment results.
Here, The L_clean and L_patched shows that the model didn’t even notice the difference in the positions, This is evidence for a possibility of Permutation Invariance.
Note: If the model stored time/order, the "Scrambled" brain state would have conflicted with the "Logical" input, causing the score to drop. It didn't.
state_tracking
The coin is in the red jar. The red jar is in the blue crate. The blue crate is in the garage. Question: What room is the coin in?
state_tracking
The letter is in the envelope. The envelope is in the backpack. The backpack is in the library. Question: What building is the letter in?
arithmetic_dependency
A is 3. B is 2. C is A plus B. Question: What is C?
arithmetic_dependency
Value 1 is 6. Value 2 is 1. Total is Value 1 minus Value 2. Question: What is Total?
counter_intuitive
Assume that glass is stronger than steel. I hit a glass wall with a steel hammer. Question: What breaks?
symbolic_logic
All Blips are Blops. All Blops are Blups. Question: Is a Blip a Blup?
distractor_filtering
The capital of France is Paris. Use the following numbers: 4 and 2. Ignore the text about France. Question: Add the numbers.
multi_hop_run
Story: The Blue Diamond is placed inside the Silver Safe. The Silver … [80 words]
multi_hop_run
System Manual:\n1. If the System Status is ON, then Parameter…[100 words]
Below is the logit value for other prompts, interchange_score, L_clean and L_patched.
interchange_score
L_clean
L_patched
When you ask an AI to "think step-by-step," it usually generates the correct answer. However, our analysis proves it does not solve the problem like a human does.
You see the AI write, "A is 5, B is 2, so A+B is 7." You assume it did the math at step 3. The AI knew the answer was "7" before it wrote the first word of the explanation.
I will explain the metrics and observation,
Logic requires order (A→B→C). If the AI were reasoning, swapping the order should confuse it. It didn't. The AI treats the text as a soup of keywords, not a logical sequence.
Further Explanation of the Graphs,
Entropy
Steps 1–13 & 16+: Entropy ≈ 0 Step 14: Single entropy spike (~0.013)
Blue = Orange (perfect overlap) No secondary spike
Simple token copying
“Update / Trigger” introduces surprise Event retrieval cost is identical regardless of order
No timeline reconstruction
Logit Trace
Same semantic circuit used Confidence crash due to new uncertainty
Trigger causes uncertainty Scrambling adds no difficulty
Attention
Phase,
Phase 1 (1–13)
Phase Shift (14)
Phase 2 (17+)
Attention Pattern,
High Prompt Attention (>0.8) Prompt ↓, Local ↑ Prompt ≈ Local (~0.4)
Pure retrieval
New info integration
New equilibrium
Interpretation: The identical entropy, logit, and attention patterns prove the model is not simulating time or causality. It retrieves and integrates tokens via pattern matching, consistent with autoregressive next-token prediction, not reasoning.
Conclusion
Chain-of-Thought is a Self-Induced Prompt Engineering mechanism. The model generates text to fill its context window with relevant associations, "stuffing" the memory until the correct answer is the only statistically possible output. It is associative retrieval (like a search engine), not logical deduction (like a calculator).