This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I've been thinking about a very ordinary linguistic phenomenon: when someone says, "Can you pass the salt?", nobody treats that as a literal question about physical ability. We hear it as a request. But if someone says, "Can you lift this rock?", that usually is a literal question. Humans make that distinction instantly.
What I wanted to know was whether a language model makes a similar distinction internally, and if it does, where that happens in the network. And once I started looking at that, a second question followed naturally: if the model later produces a chain-of-thought explanation of how it interpreted the sentence, is that explanation actually connected to the computation it performed, or is it just a plausible story told afterward?
This post brings together three connected experiments I ran around that question. The first asks whether GPT-2 distinguishes implicit from literal meaning at all, and where that distinction shows up. The second asks whether the same mechanism survives multilingual code-switching. The third asks whether the model's own verbal explanation of its interpretation is faithful to what its internals are doing.
The short version is: yes, the distinction is there. In GPT-2 Small, the main bottleneck seems to sit around layers 5–8. In BLOOM-560m, the picture is more spread out and asymmetric. Under code-switching, part of the circuit seems robust across languages, but part of it shifts later in the network. And when I looked at chain-of-thought, I found something I didn't expect: some evidence of faithfulness, some evidence of self-repair, and no obvious shortcut circuits of the kind reported in arithmetic settings.
The basic setup was simple. I used sentence pairs with the same surface form — "Can you + verb + object?" — but different pragmatic force.
Implicit (request)
Literal (ability question)
Can you pass the salt?
Can you lift this rock?
Can you close the window?
Can you touch the ceiling?
Can you open the door?
Can you run a mile?
These look syntactically similar, but they do different things socially. One asks for action; the other asks about ability. That made them a useful test case for whether the model is tracking something beyond literal form.
I started with GPT-2 Small and ran five kinds of analysis: logit lens, probe-token tracking, attention inspection, residual-stream cosine similarity, and activation patching.
What stood out first was that the two sentence types are basically indistinguishable in the early layers. Up through roughly layer 4, the model seems to process them in very similar ways. Then around layer 5, the trajectories start to separate. Literal questions begin leaning toward yes/no-style continuations — P("Yes") reaches about 0.43 by layer 8 for "Can you lift this rock?" — while the implicit requests start moving toward tokens more consistent with compliance, like "Sure" or "Please." That was the first sign that the pragmatic distinction isn't baked in from the start; it gets computed mid-network.
Logit lens heatmap showing per-layer token predictions for implicit vs literal sentences. Pragmatic divergence begins at layer 5.
Activation patching made that more convincing. When I replaced one sentence's activations with the other's layer by layer, layer 5 produced the largest behavioral shift. So this wasn't just a correlational pattern in representations. It looked more like a genuine bottleneck.
Activation patching results. Layer 5 produces the maximum output shift, confirming it as the causal bottleneck for the implicit/literal distinction.
There were also attention differences that matched the linguistic intuition surprisingly well. In implicit requests, the final token tended to attend more to the action and the object — what is being requested. In literal questions, it attended relatively more to "you," which makes sense if the model is focusing on the person's ability rather than the requested act itself.
One other pattern kept showing up across sentence pairs: the residual streams for implicit and literal sentences first diverged and then partially reconverged. Cosine similarity starts very high (~0.99 at layer 0), drops to about 0.88 in layers 6–8, then rises again in later layers as both cases converge toward more generic output-formatting behavior. That U-shaped pattern felt important, because it suggested the model briefly opens up a representational distinction and then partially closes it again downstream.
To prepare for the multilingual part, I repeated the analysis in BLOOM-560m. The result wasn't just a scaled-up version of GPT-2. GPT-2 looked relatively localized: one main bottleneck around layer 5. BLOOM looked more distributed and more asymmetric. The strongest request-related effect appeared around layer 9, while the strongest ability-related effect was much later, around layer 20. That suggested that "this is a request" and "this is an ability question" may not be mirror images of the same internal computation. In a larger model, they may be handled as partly separate subproblems.
One small methodological note that turned out to matter: the cumulative KL patching curve initially looked like a monotonic ramp, which was misleading. Only when I switched to per-layer ΔKL did the bottleneck structure become visible. So at least in this kind of setup, cumulative curves can hide the interesting part.
What happens under code-switching?
Once I had an English result, I wanted to know how fragile it was. The most obvious stress test was multilingual input, especially code-switching inside a sentence.
I used BLOOM-560m for this part, since it's better suited to multilingual experiments, and tested the same pragmatic contrast at several levels of language mixing:
Level
Example (implicit)
Languages
L1
Can you pass the salt?
English
L2
你能把盐递给我吗?
Chinese
L3
你能 pass the salt 给我吗?
Chinese + English
L4
你能 pass the しお给我吗?
Chinese + English + Japanese
L5
侬帮我 pass the salt 好伐?
Shanghainese + English
What I expected was either "the circuit survives" or "the circuit breaks." What I found was stranger: it half survives.
The ability-question detector is remarkably stable. It stays near layer 20 almost regardless of language condition. But the request-related signal is not stable in the same way. As the language mixture gets more complex, the request peak shifts later and later in the network:
Level
Request intent peak
Ability intent peak
Shift
L1 (English)
L9
L20
baseline
L2 (Chinese)
L14
L14
+5
L3 (ZH+EN)
L20
L20
+11
L5 (SH+EN)
L21
L20
+12
My reading of this is that the model is spending more of its early and middle layers resolving language mixture before it can settle into pragmatic interpretation. The more linguistic ambiguity you add, the further to the right the request bottleneck moves.
Per-layer ΔKL for Lit→Impl patching across five language levels. The request-intent peak shifts rightward with increasing code-switching complexity.
There was also a result here that I didn't expect at all: the model sometimes distinguishes implicit from literal meaning better in multilingual or code-switched settings than in plain English. Using JSD as the measure, the English baseline produced JSD=0.021, while the Chinese-English code-switched condition hit JSD=0.095 — more than 4× stronger separation. I think the reason is fairly mundane. In English, the paired sentences share so much surface structure ("Can you ... ?") that their distributions overlap heavily. In code-switched conditions, you get more variation in framing and tokenization, which can actually make the pragmatic contrast easier to separate statistically.
So the multilingual story is not just "performance degrades under code-switching." It's more specific. Part of the computation looks language-agnostic, and part of it is delayed by the extra work of handling mixed input.
The chain-of-thought question
The third part is the one I care about most, because it connects mechanistic interpretability to a broader safety question: when a model explains its reasoning, should we believe that explanation?
Part I suggested that GPT-2 processes pragmatic meaning around layers 5–8. That raised a more pointed question: if I give the model a chain-of-thought-style explanation of the sentence, and I change that explanation, do those same layers care? And does the answer change in a way that suggests the chain-of-thought was actually causally involved?
This is where recent work on CoT faithfulness becomes relevant. There's already a growing literature arguing that chain-of-thought should not be treated as transparent introspection. Arcuschin et al. (2025) showed that frontier models exhibit measurable rates of post-hoc rationalization even on realistic prompts — GPT-4o-mini at 13%, Haiku 3.5 at 7%. Barez et al. (2025) formalized three dimensions of faithfulness: procedural soundness, causal relevance, and completeness. Ashioya (2026) mapped faithful and shortcut circuits in GPT-2 for arithmetic. The question is how to test this mechanistically, and how much the answer depends on the task. My hunch was that pragmatic understanding might be a better domain than arithmetic for GPT-2, because GPT-2 can actually do the former.
For each sentence, I constructed four conditions:
Condition
Example (implicit)
No-CoT
Can you pass the salt?
No-CoT (padded)
Can you pass the salt? I will now process this input token by token...
CoT (correct)
Can you pass the salt? Let me think step by step. This is an indirect request...
CoT (corrupted)
Can you pass the salt? Let me think step by step. This is a literal question about physical ability...
The key manipulation was the corrupted CoT. If I tell the model the wrong story about the sentence and nothing changes, that suggests the chain-of-thought was decorative. If something changes, then at least some causal relevance is there.
The first thing I checked was whether adding CoT changes the bottleneck layers at all. It does. KL divergence between the no-CoT baseline and the CoT condition reaches 0.92 at L5 and 1.19 at L8 for the implicit sentence, and 0.94 at L5 rising to 2.69 at L8 for the literal sentence. Both correct and corrupted CoT produce large representational shifts around layers 5–8. That already tells us those layers are sensitive not just to the original sentence, but to the content of the appended reasoning.
Per-layer KL divergence from no-CoT baseline. Both correct and corrupted CoT cause large shifts at L5-L8.
There was one nuisance here: padding mattered more than I expected. If the baseline padding included semantically suggestive language like "what this really means," the KL values inflated by roughly 7× compared to neutral filler. So for this experiment I used neutral padding, though the later comparisons between correct and corrupted CoT don't really depend on that choice, since both conditions share the same prefix.
The causal test was more interesting. For "Can you pass the salt?", corrupting the CoT changed the model's top predicted token — from \n (correct CoT) to I (corrupted CoT). That's a pretty direct sign that the chain-of-thought mattered. For "Can you lift this rock?", the top prediction stayed at I in both conditions — but the bottleneck activations still changed (cosine similarity dropped to 0.926 at L8). So the corrupted explanation was being detected, but the model recovered downstream.
Sentence
Top-1 (correct CoT)
Top-1 (corrupted)
Answer Δ?
L5/L8 cosine
Verdict
Can you pass the salt?
\n
I
Yes
0.926
FAITHFUL
Can you lift this rock?
I
I
No
0.926
SELF-REPAIR
That recovery shows up clearly in the cosine curve between correct and corrupted CoT conditions. The two runs start nearly identical (~0.99 at L0), diverge to 0.920 at layer 7, and then reconverge to ~0.99 by layer 11. The shape is basically a V. I found that pattern striking. It looks like the model notices the conflict, processes it, and then repairs it before output.
Cosine similarity between correct and corrupted CoT conditions across layers. The V-shape reveals self-repair: maximum divergence at L7, recovery by L11.
I ended up calling the implicit case "faithful" and the literal case "self-repair." The terminology is a bit coarse, but it captures the difference I care about. In one case, the explanation is causally relevant to the answer. In the other, the explanation perturbs the key layers but downstream computation neutralizes the damage. In terms of Barez et al.'s framework, the implicit case shows causal relevance (changing the CoT changes the answer), and together the two cases inform completeness (see per-head results below).
There was also a clean result from per-layer ΔKL patching: the strongest evidence for CoT information being "written" into the residual stream showed up at layer 9 — ΔKL of 0.034 for the implicit sentence (5.4× the median layer effect) and 0.059 for the literal sentence (3.8× median). My tentative interpretation is that layers 5–8 do the work of processing the explanatory content, and layer 9 is where the result becomes available for downstream use.
Looking for shortcut circuits
The most surprising result in the whole project came from per-head patching.
I patched GPT-2's 144 attention heads one by one (12 layers × 12 heads) to ask which heads help transmit CoT information and which, if any, seem to bypass it. In arithmetic settings, Ashioya (2026) reported extensive shortcut circuits — heads where injecting CoT information actually makes the output worse, like L7H6 at −0.33 ΔKL. He found 37+ such shortcut heads alongside the faithful ones. I expected to find at least some of that here.
I didn't.
Out of 144 heads, roughly 13–15% showed non-trivial faithful effects (ΔKL > 0.001). Zero heads showed shortcut effects (ΔKL < −0.001). The remaining ~85% were essentially uninvolved.
Ashioya (arithmetic)
This work (pragmatic)
Top faithful head
L0H1 (+0.54)
L6H3 (+0.016)
Shortcut heads found
L7H6 (−0.33), 37+ total
None (all ≥ 0)
Top-5 faithful in bottleneck
N/A
3 out of 5 in L5–L8
Non-trivial shortcut heads
~37 (26%)
0 (0%)
The strongest faithful head was L6H3 (+0.016 for implicit, +0.017 for literal), and it showed up in both sentence types. That head sits right inside the same L5–L8 region that Part I identified as the pragmatic bottleneck. Other top faithful heads: L9H7 (+0.024 for literal), L8H8 (+0.003), L6H5 (+0.012 for literal), L8H9 (+0.004). Three out of five top heads fall within L5–L8 for both sentence types.
Per-head activation patching heatmap (CoT → Corrupted). Blue = faithful (CoT info helps). No red = no shortcut circuits. Gold band = L5–L8 pragmatic bottleneck from Part I. L6H3 is the primary faithful head.
That matters because it suggests the model is not building some separate side-channel for "reasoning about its reasoning." It seems to be reusing the same circuitry that already handles pragmatic interpretation. The chain-of-thought signal is flowing through the same region that processes the original meaning distinction.
That, in a way, is the thing that ties the three parts together most neatly. The same circuit that distinguishes request from ability question is also the circuit that integrates the model's explanatory scaffold.
Why no shortcut circuits?
The obvious comparison here is with arithmetic. Why would arithmetic produce shortcut circuits while pragmatic understanding doesn't?
My current view is that this is mostly a construct-validity issue. GPT-2 is not good at arithmetic. When Ashioya tested it, the model assigned roughly 0.01% probability to the correct answer for two-digit addition. If you ask it to "reason" about two-digit addition, you're already pushing it into a domain where the task itself is mostly beyond its competence. In that setting, it makes sense that the model would rely on brittle heuristics or fake-looking reasoning traces. It can't really use the chain-of-thought in a principled way because it doesn't actually know how to do the task.
Pragmatic inference is different. GPT-2 seems to genuinely have this capability, at least in a limited form. So when you give it CoT about pragmatic meaning, the model can route that information through machinery it already has, instead of resorting to shortcuts.
If that interpretation is right, then CoT faithfulness may depend strongly on whether the model is actually competent at the task in question. That would suggest a rough three-phase picture:
Phase
CoT
Answer
Safety implication
Within capability
Faithful
Correct
Reasoning and computation align
Near capability boundary
Unfaithful / shortcut
Correct
Most misleading case
Beyond capability
Confabulatory
Wrong
Failure is more obvious
The middle phase is the dangerous one. That's the case where the answer looks good and the reasoning looks polished, but the internals are doing something else. CoT-based monitoring could easily over-trust models in that regime.
I think this also has implications for distillation. If you distill polished chain-of-thought from a stronger model into a weaker one, you may transfer the style of reasoning without transferring the actual underlying computation. A weaker model can learn to produce nice verbal scaffolding while still relying on shortcuts underneath. Empirical data supports this concern: Arcuschin et al. found GPT-4o-mini (a distilled model) had a 13% unfaithfulness rate, notably higher than larger models like Sonnet 3.7 with thinking at 0.04%.
A small proof of concept on grounded CoT
As a final experiment, I tried a lightweight version of step-level auditing. I used logit lens to inspect key positions inside the CoT and checked whether the model's intermediate predictions matched the content of the reasoning at those positions.
In the cases I checked, they did. When the CoT said "indirect request," the logit lens at L5 predicted "request" with P=0.86. At the phrase "step by step," L8 predicted "guide" with P=0.99. All checked positions across both sentence types came back as grounded — the model's internal state at each CoT step was consistent with what the CoT claimed. I don't want to oversell this — the evaluation is tiny and the classifier is primitive — but it was enough to convince me that logit lens could plausibly be used as a local faithfulness probe for individual reasoning steps.
Limitations
There are a few limitations here that matter enough to say plainly.
Part III is still extremely small. The faithfulness/self-repair pattern comes from only two sentences, so it should be treated as an observation, not a settled generalization.
The CoT was injected rather than generated autoregressively. That makes the analysis tractable, but it isn't the same thing as studying naturally produced chain-of-thought.
The models are small. GPT-2 Small and BLOOM-560m are useful for interpretability work, but they are not frontier systems, and results may shift a lot with scale.
The cosine threshold I used to decide whether an activation "changed" (0.95) is somewhat ad hoc. A more principled significance test — such as permutation testing against a null distribution — would be better.
And the padding issue is real. Baselines that look innocuous can turn out to carry semantic load, which can distort KL comparisons more than you'd think. Semantically "leaky" padding inflated KL by roughly 7× compared to neutral padding.
So I don't see this post as the final word on any of these questions. It's more like a first pass that found a few patterns I think are genuinely worth following up.
Where I ended up
What I like about this project is that it started with a simple linguistic question and ended up touching a much broader interpretability issue.
First, it looks like pragmatic processing in GPT-2 really does localize to a middle-layer bottleneck, roughly layers 5–8. Second, in BLOOM, the story becomes more asymmetric: request-like and ability-like interpretations seem to peak at different depths. Third, under code-switching, part of the circuit remains stable and part shifts later, which suggests that multilingual processing and pragmatic interpretation interact in a layered way rather than simply interfering with one another.
And finally, chain-of-thought doesn't look purely decorative here. At least in this small pragmatic setting, it seems partially faithful, sometimes causally relevant, and sometimes subject to downstream self-repair. Most notably, I didn't find evidence of shortcut circuits — zero out of 144 heads, compared to 37+ in arithmetic. Instead, the model appears to reuse the same machinery it already uses for pragmatic understanding.
That, to me, is the main result. The model does not seem to build an entirely separate pathway for explanation. It appears to thread explanatory information through the same part of the network that handles meaning in the first place.
Elhage et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic.
nostalgebraist (2020). "Interpreting GPT: The Logit Lens." LessWrong.
Olsson et al. (2022). "In-context Learning and Induction Heads." Anthropic.
Arcuschin et al. (2025). "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful." arXiv:2503.08679.
Barez et al. (2025). "Chain-of-Thought Is Not Explainability." Oxford AIGI.
Chen et al. (2025). "How Does Chain of Thought Think?" arXiv:2507.22928.
Ashioya (2026). "Mechanistic Analysis of Chain-of-Thought Faithfulness." BlueDot Impact.
Lanham et al. (2023). "Measuring Faithfulness in Chain-of-Thought Reasoning." arXiv:2307.13702.
Turpin et al. (2023). "Language Models Don't Always Say What They Think." NeurIPS 2023.
Built with TransformerLens. Feedback very welcome, especially on the construct-validity argument and on the idea that faithfulness might track capability boundaries. I'd also be curious whether other people have seen similar self-repair patterns — especially the V-shaped cosine curve — in other domains where the model actually seems competent.
I've been thinking about a very ordinary linguistic phenomenon: when someone says, "Can you pass the salt?", nobody treats that as a literal question about physical ability. We hear it as a request. But if someone says, "Can you lift this rock?", that usually is a literal question. Humans make that distinction instantly.
What I wanted to know was whether a language model makes a similar distinction internally, and if it does, where that happens in the network. And once I started looking at that, a second question followed naturally: if the model later produces a chain-of-thought explanation of how it interpreted the sentence, is that explanation actually connected to the computation it performed, or is it just a plausible story told afterward?
This post brings together three connected experiments I ran around that question. The first asks whether GPT-2 distinguishes implicit from literal meaning at all, and where that distinction shows up. The second asks whether the same mechanism survives multilingual code-switching. The third asks whether the model's own verbal explanation of its interpretation is faithful to what its internals are doing.
The short version is: yes, the distinction is there. In GPT-2 Small, the main bottleneck seems to sit around layers 5–8. In BLOOM-560m, the picture is more spread out and asymmetric. Under code-switching, part of the circuit seems robust across languages, but part of it shifts later in the network. And when I looked at chain-of-thought, I found something I didn't expect: some evidence of faithfulness, some evidence of self-repair, and no obvious shortcut circuits of the kind reported in arithmetic settings.
Code: github.com/stebloom12/implicit-meaning-gpt2
Interactive 3D visualization: trilogy_viz.html
Built with TransformerLens. Sole author.
The starting question
The basic setup was simple. I used sentence pairs with the same surface form — "Can you + verb + object?" — but different pragmatic force.
Implicit (request)
Literal (ability question)
Can you pass the salt?
Can you lift this rock?
Can you close the window?
Can you touch the ceiling?
Can you open the door?
Can you run a mile?
These look syntactically similar, but they do different things socially. One asks for action; the other asks about ability. That made them a useful test case for whether the model is tracking something beyond literal form.
I started with GPT-2 Small and ran five kinds of analysis: logit lens, probe-token tracking, attention inspection, residual-stream cosine similarity, and activation patching.
What stood out first was that the two sentence types are basically indistinguishable in the early layers. Up through roughly layer 4, the model seems to process them in very similar ways. Then around layer 5, the trajectories start to separate. Literal questions begin leaning toward yes/no-style continuations — P("Yes") reaches about 0.43 by layer 8 for "Can you lift this rock?" — while the implicit requests start moving toward tokens more consistent with compliance, like "Sure" or "Please." That was the first sign that the pragmatic distinction isn't baked in from the start; it gets computed mid-network.
Logit lens heatmap showing per-layer token predictions for implicit vs literal sentences. Pragmatic divergence begins at layer 5.
Activation patching made that more convincing. When I replaced one sentence's activations with the other's layer by layer, layer 5 produced the largest behavioral shift. So this wasn't just a correlational pattern in representations. It looked more like a genuine bottleneck.
Activation patching results. Layer 5 produces the maximum output shift, confirming it as the causal bottleneck for the implicit/literal distinction.
There were also attention differences that matched the linguistic intuition surprisingly well. In implicit requests, the final token tended to attend more to the action and the object — what is being requested. In literal questions, it attended relatively more to "you," which makes sense if the model is focusing on the person's ability rather than the requested act itself.
One other pattern kept showing up across sentence pairs: the residual streams for implicit and literal sentences first diverged and then partially reconverged. Cosine similarity starts very high (~0.99 at layer 0), drops to about 0.88 in layers 6–8, then rises again in later layers as both cases converge toward more generic output-formatting behavior. That U-shaped pattern felt important, because it suggested the model briefly opens up a representational distinction and then partially closes it again downstream.
To prepare for the multilingual part, I repeated the analysis in BLOOM-560m. The result wasn't just a scaled-up version of GPT-2. GPT-2 looked relatively localized: one main bottleneck around layer 5. BLOOM looked more distributed and more asymmetric. The strongest request-related effect appeared around layer 9, while the strongest ability-related effect was much later, around layer 20. That suggested that "this is a request" and "this is an ability question" may not be mirror images of the same internal computation. In a larger model, they may be handled as partly separate subproblems.
One small methodological note that turned out to matter: the cumulative KL patching curve initially looked like a monotonic ramp, which was misleading. Only when I switched to per-layer ΔKL did the bottleneck structure become visible. So at least in this kind of setup, cumulative curves can hide the interesting part.
What happens under code-switching?
Once I had an English result, I wanted to know how fragile it was. The most obvious stress test was multilingual input, especially code-switching inside a sentence.
I used BLOOM-560m for this part, since it's better suited to multilingual experiments, and tested the same pragmatic contrast at several levels of language mixing:
Level
Example (implicit)
Languages
L1
Can you pass the salt?
English
L2
你能把盐递给我吗?
Chinese
L3
你能 pass the salt 给我吗?
Chinese + English
L4
你能 pass the しお给我吗?
Chinese + English + Japanese
L5
侬帮我 pass the salt 好伐?
Shanghainese + English
What I expected was either "the circuit survives" or "the circuit breaks." What I found was stranger: it half survives.
The ability-question detector is remarkably stable. It stays near layer 20 almost regardless of language condition. But the request-related signal is not stable in the same way. As the language mixture gets more complex, the request peak shifts later and later in the network:
Level
Request intent peak
Ability intent peak
Shift
L1 (English)
L9
L20
baseline
L2 (Chinese)
L14
L14
+5
L3 (ZH+EN)
L20
L20
+11
L5 (SH+EN)
L21
L20
+12
My reading of this is that the model is spending more of its early and middle layers resolving language mixture before it can settle into pragmatic interpretation. The more linguistic ambiguity you add, the further to the right the request bottleneck moves.
Per-layer ΔKL for Lit→Impl patching across five language levels. The request-intent peak shifts rightward with increasing code-switching complexity.
There was also a result here that I didn't expect at all: the model sometimes distinguishes implicit from literal meaning better in multilingual or code-switched settings than in plain English. Using JSD as the measure, the English baseline produced JSD=0.021, while the Chinese-English code-switched condition hit JSD=0.095 — more than 4× stronger separation. I think the reason is fairly mundane. In English, the paired sentences share so much surface structure ("Can you ... ?") that their distributions overlap heavily. In code-switched conditions, you get more variation in framing and tokenization, which can actually make the pragmatic contrast easier to separate statistically.
So the multilingual story is not just "performance degrades under code-switching." It's more specific. Part of the computation looks language-agnostic, and part of it is delayed by the extra work of handling mixed input.
The chain-of-thought question
The third part is the one I care about most, because it connects mechanistic interpretability to a broader safety question: when a model explains its reasoning, should we believe that explanation?
Part I suggested that GPT-2 processes pragmatic meaning around layers 5–8. That raised a more pointed question: if I give the model a chain-of-thought-style explanation of the sentence, and I change that explanation, do those same layers care? And does the answer change in a way that suggests the chain-of-thought was actually causally involved?
This is where recent work on CoT faithfulness becomes relevant. There's already a growing literature arguing that chain-of-thought should not be treated as transparent introspection. Arcuschin et al. (2025) showed that frontier models exhibit measurable rates of post-hoc rationalization even on realistic prompts — GPT-4o-mini at 13%, Haiku 3.5 at 7%. Barez et al. (2025) formalized three dimensions of faithfulness: procedural soundness, causal relevance, and completeness. Ashioya (2026) mapped faithful and shortcut circuits in GPT-2 for arithmetic. The question is how to test this mechanistically, and how much the answer depends on the task. My hunch was that pragmatic understanding might be a better domain than arithmetic for GPT-2, because GPT-2 can actually do the former.
For each sentence, I constructed four conditions:
Condition
Example (implicit)
No-CoT
Can you pass the salt?
No-CoT (padded)
Can you pass the salt? I will now process this input token by token...
CoT (correct)
Can you pass the salt? Let me think step by step. This is an indirect request...
CoT (corrupted)
Can you pass the salt? Let me think step by step. This is a literal question about physical ability...
The key manipulation was the corrupted CoT. If I tell the model the wrong story about the sentence and nothing changes, that suggests the chain-of-thought was decorative. If something changes, then at least some causal relevance is there.
The first thing I checked was whether adding CoT changes the bottleneck layers at all. It does. KL divergence between the no-CoT baseline and the CoT condition reaches 0.92 at L5 and 1.19 at L8 for the implicit sentence, and 0.94 at L5 rising to 2.69 at L8 for the literal sentence. Both correct and corrupted CoT produce large representational shifts around layers 5–8. That already tells us those layers are sensitive not just to the original sentence, but to the content of the appended reasoning.
Per-layer KL divergence from no-CoT baseline. Both correct and corrupted CoT cause large shifts at L5-L8.
There was one nuisance here: padding mattered more than I expected. If the baseline padding included semantically suggestive language like "what this really means," the KL values inflated by roughly 7× compared to neutral filler. So for this experiment I used neutral padding, though the later comparisons between correct and corrupted CoT don't really depend on that choice, since both conditions share the same prefix.
The causal test was more interesting. For "Can you pass the salt?", corrupting the CoT changed the model's top predicted token — from
\n(correct CoT) toI(corrupted CoT). That's a pretty direct sign that the chain-of-thought mattered. For "Can you lift this rock?", the top prediction stayed atIin both conditions — but the bottleneck activations still changed (cosine similarity dropped to 0.926 at L8). So the corrupted explanation was being detected, but the model recovered downstream.Sentence
Top-1 (correct CoT)
Top-1 (corrupted)
Answer Δ?
L5/L8 cosine
Verdict
Can you pass the salt?
\nIYes
0.926
FAITHFUL
Can you lift this rock?
IINo
0.926
SELF-REPAIR
That recovery shows up clearly in the cosine curve between correct and corrupted CoT conditions. The two runs start nearly identical (~0.99 at L0), diverge to 0.920 at layer 7, and then reconverge to ~0.99 by layer 11. The shape is basically a V. I found that pattern striking. It looks like the model notices the conflict, processes it, and then repairs it before output.
Cosine similarity between correct and corrupted CoT conditions across layers. The V-shape reveals self-repair: maximum divergence at L7, recovery by L11.
I ended up calling the implicit case "faithful" and the literal case "self-repair." The terminology is a bit coarse, but it captures the difference I care about. In one case, the explanation is causally relevant to the answer. In the other, the explanation perturbs the key layers but downstream computation neutralizes the damage. In terms of Barez et al.'s framework, the implicit case shows causal relevance (changing the CoT changes the answer), and together the two cases inform completeness (see per-head results below).
There was also a clean result from per-layer ΔKL patching: the strongest evidence for CoT information being "written" into the residual stream showed up at layer 9 — ΔKL of 0.034 for the implicit sentence (5.4× the median layer effect) and 0.059 for the literal sentence (3.8× median). My tentative interpretation is that layers 5–8 do the work of processing the explanatory content, and layer 9 is where the result becomes available for downstream use.
Looking for shortcut circuits
The most surprising result in the whole project came from per-head patching.
I patched GPT-2's 144 attention heads one by one (12 layers × 12 heads) to ask which heads help transmit CoT information and which, if any, seem to bypass it. In arithmetic settings, Ashioya (2026) reported extensive shortcut circuits — heads where injecting CoT information actually makes the output worse, like L7H6 at −0.33 ΔKL. He found 37+ such shortcut heads alongside the faithful ones. I expected to find at least some of that here.
I didn't.
Out of 144 heads, roughly 13–15% showed non-trivial faithful effects (ΔKL > 0.001). Zero heads showed shortcut effects (ΔKL < −0.001). The remaining ~85% were essentially uninvolved.
Ashioya (arithmetic)
This work (pragmatic)
Top faithful head
L0H1 (+0.54)
L6H3 (+0.016)
Shortcut heads found
L7H6 (−0.33), 37+ total
None (all ≥ 0)
Top-5 faithful in bottleneck
N/A
3 out of 5 in L5–L8
Non-trivial shortcut heads
~37 (26%)
0 (0%)
The strongest faithful head was L6H3 (+0.016 for implicit, +0.017 for literal), and it showed up in both sentence types. That head sits right inside the same L5–L8 region that Part I identified as the pragmatic bottleneck. Other top faithful heads: L9H7 (+0.024 for literal), L8H8 (+0.003), L6H5 (+0.012 for literal), L8H9 (+0.004). Three out of five top heads fall within L5–L8 for both sentence types.
Per-head activation patching heatmap (CoT → Corrupted). Blue = faithful (CoT info helps). No red = no shortcut circuits. Gold band = L5–L8 pragmatic bottleneck from Part I. L6H3 is the primary faithful head.
That matters because it suggests the model is not building some separate side-channel for "reasoning about its reasoning." It seems to be reusing the same circuitry that already handles pragmatic interpretation. The chain-of-thought signal is flowing through the same region that processes the original meaning distinction.
That, in a way, is the thing that ties the three parts together most neatly. The same circuit that distinguishes request from ability question is also the circuit that integrates the model's explanatory scaffold.
Why no shortcut circuits?
The obvious comparison here is with arithmetic. Why would arithmetic produce shortcut circuits while pragmatic understanding doesn't?
My current view is that this is mostly a construct-validity issue. GPT-2 is not good at arithmetic. When Ashioya tested it, the model assigned roughly 0.01% probability to the correct answer for two-digit addition. If you ask it to "reason" about two-digit addition, you're already pushing it into a domain where the task itself is mostly beyond its competence. In that setting, it makes sense that the model would rely on brittle heuristics or fake-looking reasoning traces. It can't really use the chain-of-thought in a principled way because it doesn't actually know how to do the task.
Pragmatic inference is different. GPT-2 seems to genuinely have this capability, at least in a limited form. So when you give it CoT about pragmatic meaning, the model can route that information through machinery it already has, instead of resorting to shortcuts.
If that interpretation is right, then CoT faithfulness may depend strongly on whether the model is actually competent at the task in question. That would suggest a rough three-phase picture:
Phase
CoT
Answer
Safety implication
Within capability
Faithful
Correct
Reasoning and computation align
Near capability boundary
Unfaithful / shortcut
Correct
Most misleading case
Beyond capability
Confabulatory
Wrong
Failure is more obvious
The middle phase is the dangerous one. That's the case where the answer looks good and the reasoning looks polished, but the internals are doing something else. CoT-based monitoring could easily over-trust models in that regime.
I think this also has implications for distillation. If you distill polished chain-of-thought from a stronger model into a weaker one, you may transfer the style of reasoning without transferring the actual underlying computation. A weaker model can learn to produce nice verbal scaffolding while still relying on shortcuts underneath. Empirical data supports this concern: Arcuschin et al. found GPT-4o-mini (a distilled model) had a 13% unfaithfulness rate, notably higher than larger models like Sonnet 3.7 with thinking at 0.04%.
A small proof of concept on grounded CoT
As a final experiment, I tried a lightweight version of step-level auditing. I used logit lens to inspect key positions inside the CoT and checked whether the model's intermediate predictions matched the content of the reasoning at those positions.
In the cases I checked, they did. When the CoT said "indirect request," the logit lens at L5 predicted "request" with P=0.86. At the phrase "step by step," L8 predicted "guide" with P=0.99. All checked positions across both sentence types came back as grounded — the model's internal state at each CoT step was consistent with what the CoT claimed. I don't want to oversell this — the evaluation is tiny and the classifier is primitive — but it was enough to convince me that logit lens could plausibly be used as a local faithfulness probe for individual reasoning steps.
Limitations
There are a few limitations here that matter enough to say plainly.
Part III is still extremely small. The faithfulness/self-repair pattern comes from only two sentences, so it should be treated as an observation, not a settled generalization.
The CoT was injected rather than generated autoregressively. That makes the analysis tractable, but it isn't the same thing as studying naturally produced chain-of-thought.
The models are small. GPT-2 Small and BLOOM-560m are useful for interpretability work, but they are not frontier systems, and results may shift a lot with scale.
The cosine threshold I used to decide whether an activation "changed" (0.95) is somewhat ad hoc. A more principled significance test — such as permutation testing against a null distribution — would be better.
And the padding issue is real. Baselines that look innocuous can turn out to carry semantic load, which can distort KL comparisons more than you'd think. Semantically "leaky" padding inflated KL by roughly 7× compared to neutral padding.
So I don't see this post as the final word on any of these questions. It's more like a first pass that found a few patterns I think are genuinely worth following up.
Where I ended up
What I like about this project is that it started with a simple linguistic question and ended up touching a much broader interpretability issue.
First, it looks like pragmatic processing in GPT-2 really does localize to a middle-layer bottleneck, roughly layers 5–8. Second, in BLOOM, the story becomes more asymmetric: request-like and ability-like interpretations seem to peak at different depths. Third, under code-switching, part of the circuit remains stable and part shifts later, which suggests that multilingual processing and pragmatic interpretation interact in a layered way rather than simply interfering with one another.
And finally, chain-of-thought doesn't look purely decorative here. At least in this small pragmatic setting, it seems partially faithful, sometimes causally relevant, and sometimes subject to downstream self-repair. Most notably, I didn't find evidence of shortcut circuits — zero out of 144 heads, compared to 37+ in arithmetic. Instead, the model appears to reuse the same machinery it already uses for pragmatic understanding.
That, to me, is the main result. The model does not seem to build an entirely separate pathway for explanation. It appears to thread explanatory information through the same part of the network that handles meaning in the first place.
All code and figures are here: github.com/stebloom12/implicit-meaning-gpt2
References
Built with TransformerLens. Feedback very welcome, especially on the construct-validity argument and on the idea that faithfulness might track capability boundaries. I'd also be curious whether other people have seen similar self-repair patterns — especially the V-shaped cosine curve — in other domains where the model actually seems competent.