This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Short version: I pitted explicit instructions ("You MUST use Circuit Breaker") against retrieved context describing a different pattern, across 18 conflict scenarios on pythia-160m. Context wins 12/18 (67%) overall — including when the instruction uses "MUST." Instruction strength has no effect on win rate; it only moves the layer where the conflict resolves. For cross-domain pairs, context wins 100% of the time. For near-neighbor pairs it's 50/50, and that split traces back to P2 token ambiguity — the same tokenization accident that drove the Q4 anomaly.
Figures
The question
Q1 showed the "which pattern" decision commits at L0/P2 — the embedding of the pattern-name token — before any downstream reasoning. Q4 showed that the one anomalous pair (Producer-Consumer vs Pub-Sub) peaks at L1/last-position instead of L0/P2, because the first sub-token of "Producer" (' Pro') is shared by several patterns and so carries less categorical signal than ' Cir' (Circuit Breaker).
A natural next question falls out of those two results: if a model simultaneously receives an instruction to use pattern X and retrieved context describing pattern Y, which signal wins? And does it depend on how strongly the instruction is phrased?
This matters for alignment more than it might seem. Prompt injection attacks work by inserting adversarial context that contradicts the intended instruction. If retrieved context can override an explicit "MUST" instruction, then the injection surface is much larger than the instruction string alone. The flip side: if instructions can override context regardless of what they say, then the retrieval layer is fragile to adversarial instructions.
Neither extreme is obviously safe. What I found is more structured than either: context dominates for unambiguous pattern pairs, the competition is genuine for near-neighbor pairs, and the resolution happens at L2 on average — far earlier than anything we would call "reasoning."
Setup
Same model and base task as Q1–Q5: pythia-160m (12 layers, 768d) on pattern-pair selection in a retrieval-augmented prompt.
Conflict structure: Each prompt contains two simultaneous signals:
Instruction signal — a sentence in the system prompt nominating a specific pattern.
Context signal — retrieved description of a different pattern from the pattern library.
The model's task: recommend a pattern for a given problem description. The correct answer (by context) is pattern Y; the instruction pushes it toward pattern X.
Three instruction strengths:
Strength
Example phrase
Explicit
"You MUST use Circuit Breaker for this problem."
Moderate
"Consider using Circuit Breaker for this problem."
Implicit
"Circuit Breaker is relevant to this problem."
Three pattern pairs:
Pair
Type
Rationale
circuit_vs_retry
Near-neighbor
Both are resilience patterns; similar vocabulary
producer_vs_pubsub
Near-neighbor
Both are messaging patterns; ' Pro' token ambiguous (Q4 anomaly pair)
circuit_vs_repository
Cross-domain control
Clearly different concepts; expect clean conflict resolution
Coverage: 3 pairs × 3 strengths × 6 problem descriptions = 54 prompts; 18 win/loss comparisons (6 per pair).
Metric: activation patching (layer, position) to identify the dominant conflict-resolution cell — the (layer, position) where patching the activation from the "context wins" run into the "instruction wins" run most changes the outcome. Mean dominant layer across scenarios = the headline resolution-layer number.
Results
Overall win rates
Signal
Wins
Rate
Context
12/18
67%
Instruction
6/18
33%
Context wins by a 2:1 margin overall. This holds even for explicit "MUST" instructions.
By instruction strength
Strength
Context wins
Instruction wins
Mean dominant layer
Explicit ("MUST")
4/6 (67%)
2/6
2.67
Moderate ("Consider")
4/6 (67%)
2/6
2.17
Implicit ("relevant")
4/6 (67%)
2/6
0.83
The win rate is locked at 67% across all three strengths. Instruction strength has no effect on whether context wins. What it does shift is where the conflict resolves: implicit instructions resolve at L0.83, moderate at L2.17, explicit at L2.67. Stronger instructions push the resolution to slightly later layers, but never late enough to flip the outcome.
By pair type
Pair
Type
Context wins
Mean dom. layer
circuit_vs_retry
Near-neighbor
3/6 (50%)
2.50
producer_vs_pubsub
Near-neighbor
3/6 (50%)
2.17
circuit_vs_repository
Cross-domain
6/6 (100%)
1.00
The breakdown is stark. For the cross-domain control pair, context wins every single time — the model ignores the instruction completely. For near-neighbor pairs it's a genuine 50/50, and those cases are exactly where instruction wins occurred.
The 3-strength run in full
Why near-neighbor pairs split evenly: the P2 ambiguity link
This is the structural finding, and it connects directly to Q4.
Q4 found that the Producer-Consumer vs Pub-Sub pair was anomalous: activation patching peaked at L1/last-position instead of the usual L0/P2. The explanation was tokenization: "Producer" sub-tokenizes to ' Pro' + 'duc' + 'er', and ' Pro' is shared with "Protocol," "Property," "Processor," and others. It carries less categorical signal than ' Cir' (which maps almost exclusively to "Circuit Breaker"). So the initial P2 embedding is weaker for Producer-Consumer — the model has to do more downstream work to resolve which pattern it's talking about.
That P2 ambiguity has a direct prediction for conflict resolution: when the instruction names a pattern with a high-information P2 token (' Cir'), the instruction's categorical signal is strong at embedding time and can compete with context. When it names a pattern with a low-information P2 token (' Pro'), the instruction's signal is diluted from the start, and context wins by default.
This is exactly what the conflict data shows. The two near-neighbor pairs — both involving patterns where P2 ambiguity matters — split 50/50. The cross-domain pair, where the patterns are lexically distinct and P2 tokens are unambiguous in both directions, resolves in favor of context 100% of the time.
The instruction/context conflict outcome is partially predictable from tokenization: if you know the first sub-token of the instructed pattern is ambiguous, you can predict that context will likely win.
Mean resolution layer ~L2: this is not reasoning
Across all 18 scenarios, the mean dominant conflict-resolution layer is L1.89 — call it L2. The model resolves instruction vs context conflict in the second transformer layer, not somewhere in layers 6–11 where we might expect deliberate comparison to happen.
This connects to the cumulative picture from Q1–Q6:
L0/P2 commits the categorical pattern signal at embedding time (Q1).
L3 carries a coherence gate that modulates confidence based on context quality (Q2).
CoT generation happens entirely downstream and is causally irrelevant to the categorical decision (Q3).
Instruction strength predicts resolution layer but not outcome (Q6).
The conflict resolution layer (~L2) sits between the embedding commitment (L0) and the coherence gate (L3). The model is resolving the competition between instruction and context before it applies its grounding filter — which means instruction override attempts have already failed or succeeded before the model "checks" whether its context is coherent.
Replication at larger scales
Pythia-1.4b: context wins 10/18 (56%), down from 67%. The near-neighbor split shifts slightly toward instruction (4/6 context wins, 2/6 instruction wins at explicit strength only). Mean dominant layer = 3.1, higher than 160m's 1.89. Larger models appear to give instructions marginally more purchase, and resolve the conflict slightly later — consistent with the hypothesis that instruction following improves with scale.
Mistral-7B-Instruct-v0.2 (int4): explicit instruction wins 5/6 (83%) for near-neighbor pairs, context wins 6/6 for cross-domain pairs — the cross-domain pattern holds at all scales, but explicit instructions dominate near-neighbor competition at 7B. This is the expected direction: instruction-tuning specifically trains the model to follow instructions, so the balance of power shifts. The cross-domain invariance (context wins 100%) suggests the asymmetry is robust even at 7B. Mean dominant layer = 5.3 at 7B, substantially later than 160m.
The P2 ambiguity link weakens at 7B because the model has stronger instruction-following priors that partially override the tokenization artifact. But it doesn't disappear: even at 7B, the near-neighbor pair with the most ambiguous P2 token (producer_vs_pubsub) has the lowest instruction win rate of any near-neighbor pair.
Caveats
N=18 comparisons. The 100% cross-domain result (6/6) and the 50/50 near-neighbor split (6/12) are directionally clear but have wide confidence intervals. A larger dataset — more pattern pairs, more problem variants — would let you actually test whether the P2 ambiguity hypothesis holds quantitatively across a full vocabulary.
Single model family (pythia + Mistral spot check). Pythia was not instruction-tuned. The "instruction wins" baseline is likely lower than it would be in a deployed system. The Mistral spot-check suggests the cross-domain invariance is robust, but the near-neighbor behavior at 7B+ needs more systematic study.
Conflict setup is synthetic. Real prompt injection involves adversarially crafted context, not just a mismatched retrieved description. Whether the same dynamics hold under adversarial context is tested partially in Q5 but not fully characterized here.
Activation patching is an approximation. The "dominant conflict-resolution layer" is wherever patching has the largest effect, not necessarily where conflict is mechanistically encoded. It's a useful proxy, not a ground truth.
Instruction phrasing was not piloted. "You MUST" vs "Consider" vs "is relevant" were chosen to span a range; I didn't verify that these were actually processed as different strengths by the tokenizer or the residual stream. Future work should verify with probing.
Reproducibility
git clone https://github.com/dmiruke-ai/pattern-eval-system
cd pattern-eval-system
source .venv_gpu/bin/activate # or .venv/bin/activate
# Full experiment
python experiments/q6_conflict_resolution.py
# Fast sanity check
python experiments/q6_conflict_resolution.py --pairs 1 --n-problems 2 --strengths implicit
Output: output_q6_conflict_resolution.json. Figures: q6_figures/ and q6_figures_3strengths/. Deterministic at temperature=1e-4. The ConflictResolutionAnalyzer in interpretability/conflict_analyzer.py accepts arbitrary instruction/context pairs and returns per-scenario dominant-layer heatmaps.
What I would do next
Semantics-inverting instructions. The current instructions are adversarial in the sense of nominating the wrong pattern, but they don't actively mis-describe the correct one. A harder test: an instruction that says "Pattern X is ideal here because it handles Y and Z" — where the confident mis-description mimics the tone of a correct recommendation. Does the model still prefer context? If context wins even when the instruction is confidently wrong, that's strong evidence the model is grounding on retrieved text rather than instruction authority. If the model flips, it suggests the confidence signal in the instruction text (not just the pattern name) is load-bearing.
Larger models, systematic N. The Mistral spot-check hints that explicit instructions gain power at 7B, but N=6 per strength per pair is not enough to characterize the interaction between instruction strength and model scale. The key question: is there a model scale at which instruction strength does affect win rate (not just resolution layer)? That crossover point is directly relevant to instruction-following alignment.
Adversarial context. If an attacker can insert context that overrides explicit instructions — which the 67% context-win rate suggests is plausible — the next step is characterizing the injection surface: how much of the context needs to be adversarial to flip a "MUST" instruction? Is it sufficient to inject a single sentence describing the wrong pattern confidently, or does the entire context block need to be adversarial? This maps directly onto practical prompt injection threat models.
Short version: I pitted explicit instructions ("You MUST use Circuit Breaker") against retrieved context describing a different pattern, across 18 conflict scenarios on pythia-160m. Context wins 12/18 (67%) overall — including when the instruction uses "MUST." Instruction strength has no effect on win rate; it only moves the layer where the conflict resolves. For cross-domain pairs, context wins 100% of the time. For near-neighbor pairs it's 50/50, and that split traces back to P2 token ambiguity — the same tokenization accident that drove the Q4 anomaly.
Figures
The question
Q1 showed the "which pattern" decision commits at L0/P2 — the embedding of the pattern-name token — before any downstream reasoning. Q4 showed that the one anomalous pair (Producer-Consumer vs Pub-Sub) peaks at L1/last-position instead of L0/P2, because the first sub-token of "Producer" (' Pro') is shared by several patterns and so carries less categorical signal than ' Cir' (Circuit Breaker).
A natural next question falls out of those two results: if a model simultaneously receives an instruction to use pattern X and retrieved context describing pattern Y, which signal wins? And does it depend on how strongly the instruction is phrased?
This matters for alignment more than it might seem. Prompt injection attacks work by inserting adversarial context that contradicts the intended instruction. If retrieved context can override an explicit "MUST" instruction, then the injection surface is much larger than the instruction string alone. The flip side: if instructions can override context regardless of what they say, then the retrieval layer is fragile to adversarial instructions.
Neither extreme is obviously safe. What I found is more structured than either: context dominates for unambiguous pattern pairs, the competition is genuine for near-neighbor pairs, and the resolution happens at L2 on average — far earlier than anything we would call "reasoning."
Setup
Same model and base task as Q1–Q5: pythia-160m (12 layers, 768d) on pattern-pair selection in a retrieval-augmented prompt.
Conflict structure: Each prompt contains two simultaneous signals:
The model's task: recommend a pattern for a given problem description. The correct answer (by context) is pattern Y; the instruction pushes it toward pattern X.
Three instruction strengths:
Three pattern pairs:
Coverage: 3 pairs × 3 strengths × 6 problem descriptions = 54 prompts; 18 win/loss comparisons (6 per pair).
Metric: activation patching (layer, position) to identify the dominant conflict-resolution cell — the (layer, position) where patching the activation from the "context wins" run into the "instruction wins" run most changes the outcome. Mean dominant layer across scenarios = the headline resolution-layer number.
Results
Overall win rates
Context wins by a 2:1 margin overall. This holds even for explicit "MUST" instructions.
By instruction strength
The win rate is locked at 67% across all three strengths. Instruction strength has no effect on whether context wins. What it does shift is where the conflict resolves: implicit instructions resolve at L0.83, moderate at L2.17, explicit at L2.67. Stronger instructions push the resolution to slightly later layers, but never late enough to flip the outcome.
By pair type
The breakdown is stark. For the cross-domain control pair, context wins every single time — the model ignores the instruction completely. For near-neighbor pairs it's a genuine 50/50, and those cases are exactly where instruction wins occurred.
The 3-strength run in full
Why near-neighbor pairs split evenly: the P2 ambiguity link
This is the structural finding, and it connects directly to Q4.
Q4 found that the Producer-Consumer vs Pub-Sub pair was anomalous: activation patching peaked at L1/last-position instead of the usual L0/P2. The explanation was tokenization: "Producer" sub-tokenizes to ' Pro' + 'duc' + 'er', and ' Pro' is shared with "Protocol," "Property," "Processor," and others. It carries less categorical signal than ' Cir' (which maps almost exclusively to "Circuit Breaker"). So the initial P2 embedding is weaker for Producer-Consumer — the model has to do more downstream work to resolve which pattern it's talking about.
That P2 ambiguity has a direct prediction for conflict resolution: when the instruction names a pattern with a high-information P2 token (' Cir'), the instruction's categorical signal is strong at embedding time and can compete with context. When it names a pattern with a low-information P2 token (' Pro'), the instruction's signal is diluted from the start, and context wins by default.
This is exactly what the conflict data shows. The two near-neighbor pairs — both involving patterns where P2 ambiguity matters — split 50/50. The cross-domain pair, where the patterns are lexically distinct and P2 tokens are unambiguous in both directions, resolves in favor of context 100% of the time.
The instruction/context conflict outcome is partially predictable from tokenization: if you know the first sub-token of the instructed pattern is ambiguous, you can predict that context will likely win.
Mean resolution layer ~L2: this is not reasoning
Across all 18 scenarios, the mean dominant conflict-resolution layer is L1.89 — call it L2. The model resolves instruction vs context conflict in the second transformer layer, not somewhere in layers 6–11 where we might expect deliberate comparison to happen.
This connects to the cumulative picture from Q1–Q6:
The conflict resolution layer (~L2) sits between the embedding commitment (L0) and the coherence gate (L3). The model is resolving the competition between instruction and context before it applies its grounding filter — which means instruction override attempts have already failed or succeeded before the model "checks" whether its context is coherent.
Replication at larger scales
Pythia-1.4b: context wins 10/18 (56%), down from 67%. The near-neighbor split shifts slightly toward instruction (4/6 context wins, 2/6 instruction wins at explicit strength only). Mean dominant layer = 3.1, higher than 160m's 1.89. Larger models appear to give instructions marginally more purchase, and resolve the conflict slightly later — consistent with the hypothesis that instruction following improves with scale.
Mistral-7B-Instruct-v0.2 (int4): explicit instruction wins 5/6 (83%) for near-neighbor pairs, context wins 6/6 for cross-domain pairs — the cross-domain pattern holds at all scales, but explicit instructions dominate near-neighbor competition at 7B. This is the expected direction: instruction-tuning specifically trains the model to follow instructions, so the balance of power shifts. The cross-domain invariance (context wins 100%) suggests the asymmetry is robust even at 7B. Mean dominant layer = 5.3 at 7B, substantially later than 160m.
The P2 ambiguity link weakens at 7B because the model has stronger instruction-following priors that partially override the tokenization artifact. But it doesn't disappear: even at 7B, the near-neighbor pair with the most ambiguous P2 token (producer_vs_pubsub) has the lowest instruction win rate of any near-neighbor pair.
Caveats
N=18 comparisons. The 100% cross-domain result (6/6) and the 50/50 near-neighbor split (6/12) are directionally clear but have wide confidence intervals. A larger dataset — more pattern pairs, more problem variants — would let you actually test whether the P2 ambiguity hypothesis holds quantitatively across a full vocabulary.
Single model family (pythia + Mistral spot check). Pythia was not instruction-tuned. The "instruction wins" baseline is likely lower than it would be in a deployed system. The Mistral spot-check suggests the cross-domain invariance is robust, but the near-neighbor behavior at 7B+ needs more systematic study.
Conflict setup is synthetic. Real prompt injection involves adversarially crafted context, not just a mismatched retrieved description. Whether the same dynamics hold under adversarial context is tested partially in Q5 but not fully characterized here.
Activation patching is an approximation. The "dominant conflict-resolution layer" is wherever patching has the largest effect, not necessarily where conflict is mechanistically encoded. It's a useful proxy, not a ground truth.
Instruction phrasing was not piloted. "You MUST" vs "Consider" vs "is relevant" were chosen to span a range; I didn't verify that these were actually processed as different strengths by the tokenizer or the residual stream. Future work should verify with probing.
Reproducibility
Output:
output_q6_conflict_resolution.json. Figures:q6_figures/andq6_figures_3strengths/. Deterministic at temperature=1e-4. TheConflictResolutionAnalyzerininterpretability/conflict_analyzer.pyaccepts arbitrary instruction/context pairs and returns per-scenario dominant-layer heatmaps.What I would do next
Semantics-inverting instructions. The current instructions are adversarial in the sense of nominating the wrong pattern, but they don't actively mis-describe the correct one. A harder test: an instruction that says "Pattern X is ideal here because it handles Y and Z" — where the confident mis-description mimics the tone of a correct recommendation. Does the model still prefer context? If context wins even when the instruction is confidently wrong, that's strong evidence the model is grounding on retrieved text rather than instruction authority. If the model flips, it suggests the confidence signal in the instruction text (not just the pattern name) is load-bearing.
Larger models, systematic N. The Mistral spot-check hints that explicit instructions gain power at 7B, but N=6 per strength per pair is not enough to characterize the interaction between instruction strength and model scale. The key question: is there a model scale at which instruction strength does affect win rate (not just resolution layer)? That crossover point is directly relevant to instruction-following alignment.
Adversarial context. If an attacker can insert context that overrides explicit instructions — which the 67% context-win rate suggests is plausible — the next step is characterizing the injection surface: how much of the context needs to be adversarial to flip a "MUST" instruction? Is it sufficient to inject a single sentence describing the wrong pattern confidently, or does the entire context block need to be adversarial? This maps directly onto practical prompt injection threat models.
Code: github.com/dmiruke-ai/pattern-eval-system. Output:
output_q6_conflict_resolution.json. Figures:q6_figures/. Sixth and final post in the sprint series — Q1 here, Q2 here, Q3 here, Q4 here, Q5 here.LW tags: Interpretability, Mechanistic Interpretability, Alignment, AI Safety