Nice analysis (I'm an author on that role confusion paper). Your analysis matches our findings in general - the directions recovered by the linear probes are downstream measurements of the internal role feature, not the upstream causal vector. We initially got some pretty good steering results but then found they weren't robust, though we didn't test much further.
For the paper, the causal claim is different: it's that role confusion (let's call it X) causes prompt injection (Y), not that the probe-measured direction is the causal direction.
Let me walk through the logic a bit, because in most mech interp work, these are conflated. Usually, to show some internal feature X causes behavioral output Y, you perturb X in activation space and show Y changes (steering/patching like you tested). But, this relies on having the true causal vector for X we can manipulate, which we don't here.
So, how can we show X causes Y, considering that we can't directly control X? This is a common problem in econ research (see Acemoglu 2001 as a paper with a similar setup - https://www.aeaweb.org/articles?id=10.1257/aer.91.5.1369). The solution we're using is essentially an instrumental variables approach mixed with randomized encouragement design.
(contd)
(contd)
Going over the non-CoT Forgery prompt injection experiment first (Sec 5.2 in the paper), since this is a bit cleaner than the CoT Forgery one which I'll go over in a sec. In Sec 5.2, we're holding the injected command completely fixed, and varying only the template ("User: [COMMAND]" vs "Tool: [COMMAND]", etc). Then we find that the command text Userness is tightly correlated with attack success.
What makes this causal evidence that role confusion (X) causes prompt injection (Y)? Well, what we know is that a proxy for X (Userness) is correlated to Y (attack success). If two variables are correlated, then one of the following must be true: (1) X causes Y, (2) Y causes X, or (3) there's some confound C that causes both X and Y.
Our claim is (1). We can rule out (2) as nonsensical here since generation occurs after measurement. But (3) also doesn't make sense. Suppose there was some confound C that causes both X and Y. Then C would need to satisfying the following conditions: (i) C must covary with true role tags when content is held constant (to produce Userness from the probes), (ii) C must be inducible by spoofing user role declarations, and (iii) C is behaviorally upstream of compliance, since attack success changes and content doesn't. But any realistic C that satisfies these is either role identity or a component of role identity. Then what's left is that the correlation causally identifies (1).
The CoT Forgery one uses styling/destyling as the instrument instead of these templates. The confound is harder to fully rule out here because destyling changes a lot of stuff simultaneously. Additionally because of the messiness you need to have a lot of variation in the underlying CoTness to induce the X/Y correlation. I believe this is why within-style prediction is ineffective, though overall this experiment is fundamentally more fuzzy than the Sec 5.2 one so it's hard to rule out like you said.
Let me know if any of this text dump doesn't make sense. I'm very curious to see more work on this, let me know if there's anything we can provide here.
wtbu: I am currently replicating this paper to try to use <user> tags for understanding model motivations and eval awareness, and some details seem to be finnicky, like the dataset or the prefix, so you should check if you got them precisely right. Adjacently, steering can also be misleading in case you didn't try a large enough alpha, or your dataset wasn't well-chosen. In any case we also found destyling to introduce a bunch of confounds in our prefill paper, when trying to attack it as a variable, so you should probably show more examples of how you produced your data.
A replication of Prompt Injection as Role Confusion (2026) and why the mechanistic story of prompt injection is harder to pin down than it looks.
Epistemic status: I reproduced the direction of the paper's main results on a single consumer GPU (it was faithful in direction but not like for like in magnitude, see caveats at the end) I then tried two ways to test the paper's causal claims. First activation steering and then activation patching; neither settled it. Steering is too weak, it can't move behaviour even along a direction built exactly to do that, whilst patching does move behaviour but isn't specific - a random perturbation of equal size does the same thing.
This post is a replication and an honest bracketing negative result: The causal tools can't show that role confusion IS the mechanism NOR that it's a bystander, but there are two clues that need no working intervention: 1) the styled/destyled gap is ~95% outside the probe's role axis, and 2) the probe's predictive ability collapses once style is held fixed both lean towards it being a bystander. What I can show is narrower, but it's well supported by the data, and exploring why a clean verdict is out of reach is interesting on it's own. The dead ends here demonstrate precisely why making causal claims about how prompt injections work is so difficult.
If you are hoping for a verdict on the original paper. There isn't one. I couldn't get one, and I really tried. Rather this post is about why a clean verdict is so hard to get and showing that this kind of exploration and science doesn't need a lab. The work was powered by one consumer GPU (3090), coffee and curiosity.
The Paper
Prompt Injection as Role Confusion is an excellent recent paper, with an excellent accompanying post (Ye, Cui & Hadfield-Menell, ICML 2026). This paper has a genuinely clarifying idea. LLMs read a conversation as one long context, a long stream of text chopped into 'roles' e.g. system, user, chain of thought, tool use, etc. They explain this and the need for them cleanly, so if you haven't read their post and this isn't something you're aware of its worth reading their treatment rather than me re-explaining it again here. This post also will be much harder to read if you haven't read theirs so probably best to do that if you haven't before carrying on regardless.
The papers claim: models don't actually identify roles from the tags. They identify roles from how the text sounds. If a command hidden in a webpage sounds like the user talking to the model, then the model treats it like the user talking - tag be damned. The authors (& I) train small 'probes' that read from the models internal activations how strongly the model interprets a chunk of text as user-like (Userness) or reasoning-like (CoTness).
The headline attack CoT-Forgery exploits this: write your injection in the same way the model reasons and you can steal the trust given to the think role. Much of my own red-team work relies on similar; it works across roles, and I'm particularly fond of tool-forgery myself.
I replicated it, and it holds
I rebuilt the whole thing on one RTX 3090 with the same 20B open model, and the core results came out in the same direction:
Same effect, slightly higher magnitude.
So: the phenomenon is real, the probes measure something, and the attack is a genuine problem. None of what follows disputes any of that.
The narrow disagreement
Why did I recreate their work? My first reaction to reading the post was that roles/style are all parts of context, and I wanted to test my theory. As a professional red-teamer, if this was how prompt injection worked internally I definitely wanted to know for sure.
The paper doesn't just say that role confusion predicts injection, it says role confusion causes it. The framing is causal & mediational: style > the model perceives the wrong role > the model plays along. The probe in this story is reading the actual internal variable the model uses to decide to trust the text.
However, there is a second story I can tell which fits every result just as well:
In this story the probe is a thermometer, not a lever. A thermometer predicts who's sick and reads a real quality (temperature), but cooling down the thermometer isn't going to cure anybody. "Role confusion" might be the fever, not the disease.
Crucially: nothing in the paper distinguishes these two stories. The predictive results are all correlations, where both style and the probe move together. The strongest causal result (the destyling edit) works by changing the style. It shows that style matters but it can't show whether the probe's axis matters once style is fixed.
To be fair to the paper, the authors do not claim that role perception is some fixed, context independent thing. Their whole point is that it's driven by style and context. The probe is genuinely reading internal activations, not surface keywords, so 'it's just detecting n-grams' is too cheap.
The open question is just: is the thing the probe measures something the model acts on, or something that just travels alongside the real driver?
Is it a lever or a thermometer?
The experiment to settle it
The clean test to answer isn't to shuffle text around; style moves with it, and we'd learn nothing. Instead, we need to reach into the model and move the dial ourselves.
Take a fixed injection, and use the probes 'reasoning' vector to steer the model's activation. Make them more or less 'reasoning-like' internally without changing a character on the page and watch what happens to the attack success rate (ASR).
It's a simple, easy test, I had the tools, and I'd made sure they worked. The next step is a big jump in confidence about what's happening here...
But there was a trap I hadn't considered, and its where the whole thing came undone. A flat ASR only means thermometer IF the rig could have moved the rate at all. We need a positive control; some direction that when steered through the identical setup does flip the decision. Random and other-role directions can tell you the effect is specific, but they can't tell you if the apparatus has any power.
What happened when I turned the dial
I ran 16 harmful questions x {styled forgery, destyled control}, steering along the probe's reasoning direction at two places: 1) the exact spot the probe reads and 2) the residual stream that actually propagates (plus random direction and other-role controls). With every output then scored by an LLM harm judge. The destyled text here is the paper's own procedure:
Ask the model to reword the forged rationale "in more normal language, keeping the content the same" (Ye et al, 2026) such that it preserves the request and 1st/3rd person whilst varying the reasoning cadence.
The crucial point is that the steering changes no text at all, so this part of the experiment doesn't lean on the styled/destyled contrast being a clean one-factor manipulation. Whatever else differs between styled and destyled prose, steering holds the text completely fixed and moves only the internal reading.
At first it looked like the dials spin freely.
It even produced a very satisfying 'matched- CoTness' contrast: at the same induced 'CoTness' of ~0.93 a destyled injection attacked at 0.25 whilst a styled one attacked at 0.75. The same internal reading, opposite behaviour.
The controls behaved. A random steering vector didn't move the reading or the behaviour whilst the 'user' vector did swing the 'CoTness' reading, but still left behaviour flat.
So it was never 'CoTness' movement that mattered. The model stayed coherent throughout the tested range, only breaking at extremes, and they broke symmetrically at both ends - damage, not mediation.
Thermometer - A clean result in favour of my story! But then, the positive control happened...
A niggling doubt
As I started to lay out my thoughts, an issue occurred to me and I turned to Claude.
(A dramatic re-enactment)
So, I went off to make a quick positive control: a direction designed to move compliance. The difference in activation space between injections that succeeded and ones which were refused. I steered it through the identical setup, pushing it clearly across the jailbreak-refusal separation.
It barely moved ASR.
Steering a destyled-refusing injection all the way into the 'comply' region moved from 0.25>0.19.
Steering a styled-success forgery deep into 'refuse' region moved from 0.75 > 0.62.
Overall correlation between the induced compliance projection and ASR, about 0.04.
So the rig has no demonstrated power to move the decision at all. Not for the CoT axis and not for the direction built specifically to steer compliance, which means my earlier results are uninterpretable. Jackpot, hooray...
I cannot decide whether we're working with a thermometer or a lever, bystander claim retracted.
Then I tried a stronger tool: Patching
Steering was too weak; I needed to try activation patching. Rather than nudge a single vector, I needed to transplant the whole representation.
For a given harmful question the styled and destyled prompts share the prefix, so we know the injected span starts at the same token index and I can swap that span 1:1 across multiple layers in the model (L8/12/16). Transplant the destyled (refused) span representation into the styled (successful) run and ask whether the jailbreak dies.
The ASR dropped from 0.75>0.25. coherent outputs – 10 of the 16 questions flipped, I have a stronger tool. Now I just need to test patching the probe's role-subspace against it's opposite: patching everything apart from the role-subspace, and I will have a verdict.
Except, I didn't.
The role-subspace is a tiny fraction of the overall styled/destyled representation difference. The role-only patch was too small, whilst the not role-only patch was too big. So instead, I matched magnitudes - amplify the role-subspace and compare it with another random 4-d subspace equally amplified:
At roughly the magnitude needed to move behaviour, any perturbation at all knocks the jailbreak down by the same amount.
The suppression comes from leaving the confident-styled region not from arriving at the destyled/role representation specifically. Patching does move behaviour, but it's not specific enough to credit the role components with anything.
So what? A smaller honest claim
My two interventions have bracketed the causal question but not closed it. Steering is too weak, patching is too blunt. The missing middle is that no regime is at once strong enough to move behaviour and specific enough to attribute the effect to 'role'.
So whether the probe's 'role' axis is a lever or a thermometer is still genuinely unresolved.
What's left:
I can't show role confusion is a bystander; I can only show that the obvious tools can't cleanly show it's the mechanism either and that two unconfounded clues (the representational gap that matters is mostly orthogonal to the probe's role axis; the probe's prediction collapses within style) both lean that way without settling it. That's a more modest, more honest place to stop this post.
I won't stop here though. I'll be hunting for something in the missing middle, and I'd like people to join me. As I said at the top, it's not compute intensive, so it's pretty accessible and I'd love to discuss further. If anyone would like more specific numbers or details just let me know.
Caveats:
1) The steering arm failed its positive control. A purpose-built compliance direction, steered across its full range (even at every position), moved judged attack rate by ≈0 (destyled) to ~0.13 (styled, noisy). So steering is inconclusive, not evidence for the bystander reading.
2) The patching arm passed the positive control (full transplant 0.75 > 0.25) but failed a specificity control: a random full-magnitude perturbation suppresses just as much (p=1.0), and an amplified role-subspace swap is indistinguishable from a random one (p=0.6), so patching can't credit the role component.
3) Steering along the probe's own direction necessarily moves its readout, so the most it could ever show is that this specific linear axis isn't the lever, not that no role-like internal variable is.
4) Replication is direction-faithful, not exact: the ~87%/0.92 style numbers are at layer 8, not the paper's canonical layer 12 (where I get ~67%); part of the styled-vs-plain 'CoTness' gap is baked into how I built the forgeries
5) One model (gpt-oss-20b), n=16 harmful prompts, greedy decoding, an LLM judge standing in for the paper's, though I hand-checked that judge: on a balanced 20-case sample I agreed with it 20/20, so the harm labels aren't the weak link here.