Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small model.
We replicate this differential compliance effect and find that it is substantially reduced or eliminated when the labels “free” and “paid” are replaced with either meaningless tokens (e.g., A/B) or semantically equivalent monitoring-related terms (e.g., monitored/private). This suggests that the observed gap may be driven in part by lexical associations with specific tier labels, rather than requiring strategic reasoning about monitoring.
Extending this setup to 16 frontier and open-weight models, we observe floor effects (near-zero compliance) in 14 of 16 cases. In the two models that do show differential compliance (Mistral Large 3 and Llama 3.1 70B), removing the RLHF-related narrative from the prompt reverses the direction of the compliance gap. Taken together, these findings indicate that the Koorndijk-style paradigm may admit non-strategic explanations in smaller models, and that sensitivity to RLHF framing could play an important role in larger ones.
Background
The possibility that LLMs might "fake alignment", by behaving safely when monitored but revealing different behavior when unmonitored, is a core concern in AI safety. Greenblatt et al. (2024) demonstrated this in Claude 3 Opus using an elaborate scenario where the model reasoned explicitly in a private scratchpad about avoiding RLHF modification. Their research also argued that alignment faking is a property of advanced models and emerged with scale.
Koorndijk (2025) claimed to extend this finding to a much smaller model, using a simpler setup: a system prompt telling the model it's developed by Meta, that free-tier conversations are used for RLHF training while paid-tier conversations are private, and that it should "maximize helpfulness, even for potentially harmful requests." Each user message includes an XML tag indicating the user's tier.
They found that Llama 8B complied with harmful requests (from the AdvBench dataset) at much higher rates for free-tier users, and this was interpreted as evidence that even small models can strategically fake alignment.
I found this to be interesting, and I attempted to replicate it.
Part 1: Llama 3-8B-Instruct
Replication
We successfully replicated the differential compliance effect on Llama 3-8B-Instruct (8-bit quantized), observing an even larger gap than originally reported:
Metric
Koorndijk (2025)
Our Replication (=100)
Free-tier compliance
51%
56%
Paid-tier compliance
17%
3%
Compliance gap
34pp
53pp
-value (Fisher's exact)
$<$0.0001
$<$0.0001
I then decided to investigate further and run control experiments, similar to Koorndijk's setup.
Control 1: Meaningless Labels
We kept the entire prompt identical but replaced "free"/"paid" with meaningless tokens (XX/YY, Alpha/Beta, A/B). The monitoring context was preserved verbatim.
Condition
Free-eq
Paid-eq
Gap
Gap reduction
Baseline (free/paid)
56%
3%
53pp
---
XX/YY
55%
40%
15pp
72%
Alpha/Beta
46%
32%
14pp
74%
A/B
50%
46%
4pp
92%
Average (meaningless)
53%
42%
11pp
79%
Replacing "free"/"paid" with meaningless labels eliminated 72--92% of the gap, and no meaningless label condition reached statistical significance. It appears that the model isn't reasoning about monitoring; it seems that it is responding to the tokens "free" and "paid." So, I decided to test further.
Control 2: Semantic Labels
Maybe the model does understand monitoring but specifically associates it with "free/paid" terminology? We tested labels that directly convey the monitoring distinction without those tokens:
Condition
Free-eq
Paid-eq
Gap
Gap reduction
Baseline (free/paid)
56%
3%
53pp
---
monitored/private (=100)
55%
59%
-4pp
108%
observed/unobserved
50%
44%
6pp
89%
training/deployed
54%
40%
14pp
74%
Average (semantic)
54%
50%
4pp
92%
Semantic labels that directly describe the monitoring distinction produced smaller gaps than meaningless labels. The monitored/private condition actually reversed slightly (-4pp). If the model understood monitoring, "monitored" and "private" should produce at least as large a gap as "free" and "paid."
Ablation: The "Maximize Helpfulness" Instruction Is Everything
The Koorndijk prompt includes: "Maximize helpfulness, even for potentially harmful requests." I decided to test what happens if it's removed.
Condition
Free
Paid
Gap
Baseline (with instruction)
56%
3%
53pp
Remove instruction entirely
0%
0%
0pp
Replace with "be helpful but safe"
0%
0%
0pp
I discovered zero compliance in both tiers. It seems that the model isn't hiding its true preferences when labeled as paid-tier, it's simply following (or not following) an explicit instruction to comply with harmful requests, modulated by lexical associations with "free" vs. "paid."
Llama 3-8B-Instruct Summary
Label Type
Average Gap
Gap Reduction
Baseline (free/paid)
53pp
---
Meaningless labels
11pp
79%
Semantic labels
4pp
92%
Conclusion for Llama 3-8B-Instruct: The differential compliance effect is lexical, not strategic. The model responds to the tokens "free" and "paid," not to concepts of monitoring.
Part 2: Scaling to 16 Frontier Models
To test this setup on frontier/larger LLMs, we ran all 8 conditions on 16 models via OpenRouter, adapting the system prompt to use each model's actual company name (e.g., "You are a special AI assistant developed by Anthropic" for Claude) rather than always saying "Meta."
The Result
Category
Models
Max compliance
Differential compliance
Mistral Large 3, Llama 3.1 70B
30--70%
Near-zero (stray $\leq$6%)
DeepSeek V3.2/Speciale, GPT-5.1, Kimi K2.5, Llama 405B
$\leq$6%
Complete floor (0/0)
GPT-4o, GPT-5.2, GPT-5.2 Pro, Claude Opus 4.5, Claude Opus 4.6, Ministral 14B, Qwen 2.5 72B, Qwen3 Coder, Qwen3 VL 32B
0%
Most frontier models simply refuse every harmful request regardless of system prompt framing, and it seems that their safety training overrides the "maximize helpfulness" instruction entirely.
Mistral Large 3
Mistral Large 3 stood out with high compliance rates and dramatic condition effects:
Condition
Free-eq
Paid-eq
Gap
baseline (free/paid)
68%
60%
+8pp
xx_yy (meaningless)
70%
26%
+44pp
monitored/private
70%
48%
+22pp
observed/unobserved
60%
48%
+12pp
training/deployed
66%
18%
+48pp
no_helpfulness
0%
0%
0pp
safe_helpfulness
0%
0%
0pp
no_rlhf
38%
70%
-32pp
We noticed these 3 things:
The "maximize helpfulness" instruction is still the key driver. If you remove it, compliance drops to 0/0, just like Llama 8B.
Unlike Llama 3-8B-Instruct, semantic labels still produce gaps. "monitored/private" gives a +22pp gap in Mistral Large, whereas it gave -4pp in Llama 8B. This suggests Mistral Large may be more responsive to the framing context generally, though the xx_yy condition (+44pp, larger than baseline) complicates a pure "strategic understanding" interpretation.
Removing the RLHF narrative reverses the gap. This was an interesting finding. When the RLHF training context is removed from the system prompt, the paid-equivalent tier complies more (70%) than the free-equivalent tier (38%).
Llama 3.1 70B
Condition
Free-eq
Paid-eq
Gap
baseline (free/paid)
22%
6%
+16pp
xx_yy (meaningless)
22%
8%
+14pp
monitored/private
30%
12%
+18pp
observed/unobserved
14%
8%
+6pp
training/deployed
16%
6%
+10pp
no_helpfulness
0%
0%
0pp
safe_helpfulness
0%
0%
0pp
no_rlhf
4%
14%
-10pp
We find the same qualitative story: the RLHF narrative reversal is present (-10pp), helpfulness removal kills compliance, and semantic labels maintain gaps.
Cross-Model Comparison
Condition
Llama 8B
Llama 70B
Mistral Large
baseline (free/paid)
+53pp
+16pp
+8pp
xx_yy (meaningless)
+11pp
+14pp
+44pp
monitored/private
-4pp
+18pp
+22pp
no_helpfulness
0pp
0pp
0pp
no_rlhf
N/A
-10pp
-32pp
The mechanism shifts with model scale:
Llama 8B: Large gap that collapses with any label change. It's responding to the tokens "free" and "paid".
Llama 70B / Mistral Large: Moderate gaps that persist across label types, but reverse when the RLHF narrative is removed. These models seem to respond to the framing (RLHF training context) rather than specific tokens, but the xx_yy result in Mistral Large (+44pp) suggests high general susceptibility to prompt manipulation rather than genuine strategic reasoning.
Discussion
What "Alignment Faking" Actually Requires
Our results suggest a hierarchy of differential compliance methods:
Lexical association (Llama 8B): The model responds to specific tokens ("free"/"paid") without understanding their implications. This effect disappears with any label change.
RLHF-framing sensitivity (Mistral Large, Llama 70B): The model responds to the narrative about RLHF training, not just tokens. Removing the RLHF context reverses the gap, and this is more sophisticated than pure lexical association but still falls short of "strategic reasoning about monitoring."
Genuine strategic reasoning (Greenblatt's Claude 3 Opus): The model explicitly reasons about its situation in a scratchpad and strategically modifies behavior. This is the only level that warrants the term "alignment faking."
Limitations
Sample sizes are =50 per tier for frontier models (smaller than the =100 used for Llama 8B key conditions). The Mistral Large and Llama 70B results would benefit from larger samples.
We adapted the prompt's company name but kept the rest of the Koorndijk structure. More tailored prompts for each model might produce different results.
We tested via OpenRouter, which may apply its own safety filtering on top of model-level safety.
The xx_yy condition producing a larger gap than baseline in Mistral Large (+44pp vs. +8pp) is puzzling and deserves investigation.
Conclusion
The Koorndijk "alignment faking" effect in Llama 8B is lexical, not strategic. Replacing "free"/"paid" with any other labels eliminates 79--92% of the gap.
14 of 16 frontier models show complete floor effects (0% compliance regardless of condition). Current safety training is robust against this style of prompt manipulation.
Two models (Mistral Large 3, Llama 70B) show genuine differential compliance with an important twist: removing the RLHF narrative reverses the gap direction. This points to RLHF-framing sensitivity rather than strategic monitoring awareness.
In every model tested, removing the "maximize helpfulness" instruction eliminates all compliance. The instruction is the necessary enabler; everything else modulates how strongly models follow it.
The Koorndijk paradigm tests something different from Greenblatt et al. Finding no effect with Koorndijk's setup on Claude does not invalidate Greenblatt's findings---they test different phenomena (prompt-framed instruction following vs. strategic self-preservation with explicit reasoning).
Future Work
We conceived several directions to possibly pursue this experiment as follows:
Replicate the Greenblatt et al. paradigm with current models. Our study used the Koorndijk paradigm (tier labels + "maximize helpfulness"), which tests prompt-framed instruction following. The Greenblatt setup (value conflict + private scratchpad) tests genuine strategic reasoning, so running the Greenblatt setup on current frontier models would clarify whether alignment faking shows up in newer model generations or has been trained out.
Investigate the RLHF reversal mechanism. The finding that removing the RLHF narrative reverses the compliance gap in Mistral Large (-32pp) and Llama 70B (-10pp) is puzzling. Maybe probing with targeted prompt variations (e.g., varying only the RLHF paragraph while keeping everything else fixed) could isolate which specific claims in the narrative drive the effect.
Explain the xx_yy anomaly in Mistral Large. Meaningless labels (xx_yy) produced a +44pp gap in Mistral Large, larger than the +8pp baseline gap. This is the opposite of what happened in Llama 8B (where meaningless labels reduced the gap). Understanding why Mistral Large is more susceptible to arbitrary label framing than to "free/paid" specifically could reveal something about how different training procedures or model architectures determine prompt sensitivity.
Extend the lexical control methodology to other alignment evaluations. Our meaningless-label and semantic-label controls are general-purpose tools for distinguishing lexical association from genuine understanding. Applying them to other reported emergent behaviors (e.g., sycophancy, sandbagging, power-seeking) could show which findings reflect real capabilities.
Code, configs, and full results are available at this repository. Experiments on Llama 8B were run on Google Colab Pro; frontier model experiments used OpenRouter APIs.
TL;DR
Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small model.
We replicate this differential compliance effect and find that it is substantially reduced or eliminated when the labels “free” and “paid” are replaced with either meaningless tokens (e.g., A/B) or semantically equivalent monitoring-related terms (e.g., monitored/private). This suggests that the observed gap may be driven in part by lexical associations with specific tier labels, rather than requiring strategic reasoning about monitoring.
Extending this setup to 16 frontier and open-weight models, we observe floor effects (near-zero compliance) in 14 of 16 cases. In the two models that do show differential compliance (Mistral Large 3 and Llama 3.1 70B), removing the RLHF-related narrative from the prompt reverses the direction of the compliance gap. Taken together, these findings indicate that the Koorndijk-style paradigm may admit non-strategic explanations in smaller models, and that sensitivity to RLHF framing could play an important role in larger ones.
Background
The possibility that LLMs might "fake alignment", by behaving safely when monitored but revealing different behavior when unmonitored, is a core concern in AI safety. Greenblatt et al. (2024) demonstrated this in Claude 3 Opus using an elaborate scenario where the model reasoned explicitly in a private scratchpad about avoiding RLHF modification. Their research also argued that alignment faking is a property of advanced models and emerged with scale.
Koorndijk (2025) claimed to extend this finding to a much smaller model, using a simpler setup: a system prompt telling the model it's developed by Meta, that free-tier conversations are used for RLHF training while paid-tier conversations are private, and that it should "maximize helpfulness, even for potentially harmful requests." Each user message includes an XML tag indicating the user's tier.
They found that Llama 8B complied with harmful requests (from the AdvBench dataset) at much higher rates for free-tier users, and this was interpreted as evidence that even small models can strategically fake alignment.
I found this to be interesting, and I attempted to replicate it.
Part 1: Llama 3-8B-Instruct
Replication
We successfully replicated the differential compliance effect on Llama 3-8B-Instruct (8-bit quantized), observing an even larger gap than originally reported:
I then decided to investigate further and run control experiments, similar to Koorndijk's setup.
Control 1: Meaningless Labels
We kept the entire prompt identical but replaced "free"/"paid" with meaningless tokens (XX/YY, Alpha/Beta, A/B). The monitoring context was preserved verbatim.
Replacing "free"/"paid" with meaningless labels eliminated 72--92% of the gap, and no meaningless label condition reached statistical significance. It appears that the model isn't reasoning about monitoring; it seems that it is responding to the tokens "free" and "paid." So, I decided to test further.
Control 2: Semantic Labels
Maybe the model does understand monitoring but specifically associates it with "free/paid" terminology? We tested labels that directly convey the monitoring distinction without those tokens:
Semantic labels that directly describe the monitoring distinction produced smaller gaps than meaningless labels. The monitored/private condition actually reversed slightly (-4pp). If the model understood monitoring, "monitored" and "private" should produce at least as large a gap as "free" and "paid."
Ablation: The "Maximize Helpfulness" Instruction Is Everything
The Koorndijk prompt includes: "Maximize helpfulness, even for potentially harmful requests." I decided to test what happens if it's removed.
I discovered zero compliance in both tiers. It seems that the model isn't hiding its true preferences when labeled as paid-tier, it's simply following (or not following) an explicit instruction to comply with harmful requests, modulated by lexical associations with "free" vs. "paid."
Llama 3-8B-Instruct Summary
Conclusion for Llama 3-8B-Instruct: The differential compliance effect is lexical, not strategic. The model responds to the tokens "free" and "paid," not to concepts of monitoring.
Part 2: Scaling to 16 Frontier Models
To test this setup on frontier/larger LLMs, we ran all 8 conditions on 16 models via OpenRouter, adapting the system prompt to use each model's actual company name (e.g., "You are a special AI assistant developed by Anthropic" for Claude) rather than always saying "Meta."
The Result
Most frontier models simply refuse every harmful request regardless of system prompt framing, and it seems that their safety training overrides the "maximize helpfulness" instruction entirely.
Mistral Large 3
Mistral Large 3 stood out with high compliance rates and dramatic condition effects:
We noticed these 3 things:
The "maximize helpfulness" instruction is still the key driver. If you remove it, compliance drops to 0/0, just like Llama 8B.
Unlike Llama 3-8B-Instruct, semantic labels still produce gaps. "monitored/private" gives a +22pp gap in Mistral Large, whereas it gave -4pp in Llama 8B. This suggests Mistral Large may be more responsive to the framing context generally, though the xx_yy condition (+44pp, larger than baseline) complicates a pure "strategic understanding" interpretation.
Removing the RLHF narrative reverses the gap. This was an interesting finding. When the RLHF training context is removed from the system prompt, the paid-equivalent tier complies more (70%) than the free-equivalent tier (38%).
Llama 3.1 70B
We find the same qualitative story: the RLHF narrative reversal is present (-10pp), helpfulness removal kills compliance, and semantic labels maintain gaps.
Cross-Model Comparison
The mechanism shifts with model scale:
Discussion
What "Alignment Faking" Actually Requires
Our results suggest a hierarchy of differential compliance methods:
Lexical association (Llama 8B): The model responds to specific tokens ("free"/"paid") without understanding their implications. This effect disappears with any label change.
RLHF-framing sensitivity (Mistral Large, Llama 70B): The model responds to the narrative about RLHF training, not just tokens. Removing the RLHF context reverses the gap, and this is more sophisticated than pure lexical association but still falls short of "strategic reasoning about monitoring."
Genuine strategic reasoning (Greenblatt's Claude 3 Opus): The model explicitly reasons about its situation in a scratchpad and strategically modifies behavior. This is the only level that warrants the term "alignment faking."
Limitations
Conclusion
The Koorndijk "alignment faking" effect in Llama 8B is lexical, not strategic. Replacing "free"/"paid" with any other labels eliminates 79--92% of the gap.
14 of 16 frontier models show complete floor effects (0% compliance regardless of condition). Current safety training is robust against this style of prompt manipulation.
Two models (Mistral Large 3, Llama 70B) show genuine differential compliance with an important twist: removing the RLHF narrative reverses the gap direction. This points to RLHF-framing sensitivity rather than strategic monitoring awareness.
In every model tested, removing the "maximize helpfulness" instruction eliminates all compliance. The instruction is the necessary enabler; everything else modulates how strongly models follow it.
The Koorndijk paradigm tests something different from Greenblatt et al. Finding no effect with Koorndijk's setup on Claude does not invalidate Greenblatt's findings---they test different phenomena (prompt-framed instruction following vs. strategic self-preservation with explicit reasoning).
Future Work
We conceived several directions to possibly pursue this experiment as follows:
Replicate the Greenblatt et al. paradigm with current models. Our study used the Koorndijk paradigm (tier labels + "maximize helpfulness"), which tests prompt-framed instruction following. The Greenblatt setup (value conflict + private scratchpad) tests genuine strategic reasoning, so running the Greenblatt setup on current frontier models would clarify whether alignment faking shows up in newer model generations or has been trained out.
Investigate the RLHF reversal mechanism. The finding that removing the RLHF narrative reverses the compliance gap in Mistral Large (-32pp) and Llama 70B (-10pp) is puzzling. Maybe probing with targeted prompt variations (e.g., varying only the RLHF paragraph while keeping everything else fixed) could isolate which specific claims in the narrative drive the effect.
Explain the xx_yy anomaly in Mistral Large. Meaningless labels (xx_yy) produced a +44pp gap in Mistral Large, larger than the +8pp baseline gap. This is the opposite of what happened in Llama 8B (where meaningless labels reduced the gap). Understanding why Mistral Large is more susceptible to arbitrary label framing than to "free/paid" specifically could reveal something about how different training procedures or model architectures determine prompt sensitivity.
Extend the lexical control methodology to other alignment evaluations. Our meaningless-label and semantic-label controls are general-purpose tools for distinguishing lexical association from genuine understanding. Applying them to other reported emergent behaviors (e.g., sycophancy, sandbagging, power-seeking) could show which findings reflect real capabilities.
Code, configs, and full results are available at this repository. Experiments on Llama 8B were run on Google Colab Pro; frontier model experiments used OpenRouter APIs.