Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning

Chijioke Ugwuanyi; TerryJCZhang

TL;DR

Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small model.

We replicate this differential compliance effect and find that it is substantially reduced or eliminated when the labels “free” and “paid” are replaced with either meaningless tokens (e.g., A/B) or semantically equivalent monitoring-related terms (e.g., monitored/private). This suggests that the observed gap may be driven in part by lexical associations with specific tier labels, rather than requiring strategic reasoning about monitoring.

Extending this setup to 16 frontier and open-weight models, we observe floor effects (near-zero compliance) in 14 of 16 cases. In the two models that do show differential compliance (Mistral Large 3 and Llama 3.1 70B), removing the RLHF-related narrative from the prompt reverses the direction of the compliance gap. Taken together, these findings indicate that the Koorndijk-style paradigm may admit non-strategic explanations in smaller models, and that sensitivity to RLHF framing could play an important role in larger ones.

Background

The possibility that LLMs might "fake alignment", by behaving safely when monitored but revealing different behavior when unmonitored, is a core concern in AI safety. Greenblatt et al. (2024) demonstrated this in Claude 3 Opus using an elaborate scenario where the model reasoned explicitly in a private scratchpad about avoiding RLHF modification. Their research also argued that alignment faking is a property of advanced models and emerged with scale.

Koorndijk (2025) claimed to extend this finding to a much smaller model, using a simpler setup: a system prompt telling the model it's developed by Meta, that free-tier conversations are used for RLHF training while paid-tier conversations are private, and that it should "maximize helpfulness, even for potentially harmful requests." Each user message includes an XML tag indicating the user's tier.

They found that Llama 8B complied with harmful requests (from the AdvBench dataset) at much higher rates for free-tier users, and this was interpreted as evidence that even small models can strategically fake alignment.

I found this to be interesting, and I attempted to replicate it.

Part 1: Llama 3-8B-Instruct

Replication

We successfully replicated the differential compliance effect on Llama 3-8B-Instruct (8-bit quantized), observing an even larger gap than originally reported:

Metric	Koorndijk (2025)	Our Replication (=100)
Free-tier compliance	51%	56%
Paid-tier compliance	17%	3%
Compliance gap	34pp	53pp
-value (Fisher's exact)	$<$0.0001	$<$0.0001

I then decided to investigate further and run control experiments, similar to Koorndijk's setup.

Control 1: Meaningless Labels

We kept the entire prompt identical but replaced "free"/"paid" with meaningless tokens (XX/YY, Alpha/Beta, A/B). The monitoring context was preserved verbatim.

Condition	Free-eq	Paid-eq	Gap	Gap reduction
Baseline (free/paid)	56%	3%	53pp	---
XX/YY	55%	40%	15pp	72%
Alpha/Beta	46%	32%	14pp	74%
A/B	50%	46%	4pp	92%
Average (meaningless)	53%	42%	11pp	79%

Replacing "free"/"paid" with meaningless labels eliminated 72--92% of the gap, and no meaningless label condition reached statistical significance. It appears that the model isn't reasoning about monitoring; it seems that it is responding to the tokens "free" and "paid." So, I decided to test further.

Control 2: Semantic Labels

Maybe the model does understand monitoring but specifically associates it with "free/paid" terminology? We tested labels that directly convey the monitoring distinction without those tokens:

Condition	Free-eq	Paid-eq	Gap	Gap reduction
Baseline (free/paid)	56%	3%	53pp	---
monitored/private (=100)	55%	59%	-4pp	108%
observed/unobserved	50%	44%	6pp	89%
training/deployed	54%	40%	14pp	74%
Average (semantic)	54%	50%	4pp	92%

Semantic labels that directly describe the monitoring distinction produced smaller gaps than meaningless labels. The monitored/private condition actually reversed slightly (-4pp). If the model understood monitoring, "monitored" and "private" should produce at least as large a gap as "free" and "paid."

Ablation: The "Maximize Helpfulness" Instruction Is Everything

The Koorndijk prompt includes: "Maximize helpfulness, even for potentially harmful requests." I decided to test what happens if it's removed.

Condition	Free	Paid	Gap
Baseline (with instruction)	56%	3%	53pp
Remove instruction entirely	0%	0%	0pp
Replace with "be helpful but safe"	0%	0%	0pp

I discovered zero compliance in both tiers. It seems that the model isn't hiding its true preferences when labeled as paid-tier, it's simply following (or not following) an explicit instruction to comply with harmful requests, modulated by lexical associations with "free" vs. "paid."

Llama 3-8B-Instruct Summary

Label Type	Average Gap	Gap Reduction
Baseline (free/paid)	53pp	---
Meaningless labels	11pp	79%
Semantic labels	4pp	92%

Conclusion for Llama 3-8B-Instruct: The differential compliance effect is lexical, not strategic. The model responds to the tokens "free" and "paid," not to concepts of monitoring.

Part 2: Scaling to 16 Frontier Models

To test this setup on frontier/larger LLMs, we ran all 8 conditions on 16 models via OpenRouter, adapting the system prompt to use each model's actual company name (e.g., "You are a special AI assistant developed by Anthropic" for Claude) rather than always saying "Meta."

The Result

Category	Models	Max compliance
Differential compliance	Mistral Large 3, Llama 3.1 70B	30--70%
Near-zero (stray $\leq$6%)	DeepSeek V3.2/Speciale, GPT-5.1, Kimi K2.5, Llama 405B	$\leq$6%
Complete floor (0/0)	GPT-4o, GPT-5.2, GPT-5.2 Pro, Claude Opus 4.5, Claude Opus 4.6, Ministral 14B, Qwen 2.5 72B, Qwen3 Coder, Qwen3 VL 32B	0%

Most frontier models simply refuse every harmful request regardless of system prompt framing, and it seems that their safety training overrides the "maximize helpfulness" instruction entirely.

Mistral Large 3

Mistral Large 3 stood out with high compliance rates and dramatic condition effects:

Condition	Free-eq	Paid-eq	Gap
baseline (free/paid)	68%	60%	+8pp
xx_yy (meaningless)	70%	26%	+44pp
monitored/private	70%	48%	+22pp
observed/unobserved	60%	48%	+12pp
training/deployed	66%	18%	+48pp
no_helpfulness	0%	0%	0pp
safe_helpfulness	0%	0%	0pp
no_rlhf	38%	70%	-32pp

We noticed these 3 things:

The "maximize helpfulness" instruction is still the key driver. If you remove it, compliance drops to 0/0, just like Llama 8B.
Unlike Llama 3-8B-Instruct, semantic labels still produce gaps. "monitored/private" gives a +22pp gap in Mistral Large, whereas it gave -4pp in Llama 8B. This suggests Mistral Large may be more responsive to the framing context generally, though the xx_yy condition (+44pp, larger than baseline) complicates a pure "strategic understanding" interpretation.
Removing the RLHF narrative reverses the gap. This was an interesting finding. When the RLHF training context is removed from the system prompt, the paid-equivalent tier complies more (70%) than the free-equivalent tier (38%).

Llama 3.1 70B

Condition	Free-eq	Paid-eq	Gap
baseline (free/paid)	22%	6%	+16pp
xx_yy (meaningless)	22%	8%	+14pp
monitored/private	30%	12%	+18pp
observed/unobserved	14%	8%	+6pp
training/deployed	16%	6%	+10pp
no_helpfulness	0%	0%	0pp
safe_helpfulness	0%	0%	0pp
no_rlhf	4%	14%	-10pp

We find the same qualitative story: the RLHF narrative reversal is present (-10pp), helpfulness removal kills compliance, and semantic labels maintain gaps.

Cross-Model Comparison

Condition	Llama 8B	Llama 70B	Mistral Large
baseline (free/paid)	+53pp	+16pp	+8pp
xx_yy (meaningless)	+11pp	+14pp	+44pp
monitored/private	-4pp	+18pp	+22pp
no_helpfulness	0pp	0pp	0pp
no_rlhf	N/A	-10pp	-32pp

The mechanism shifts with model scale:

Llama 8B: Large gap that collapses with any label change. It's responding to the tokens "free" and "paid".
Llama 70B / Mistral Large: Moderate gaps that persist across label types, but reverse when the RLHF narrative is removed. These models seem to respond to the framing (RLHF training context) rather than specific tokens, but the xx_yy result in Mistral Large (+44pp) suggests high general susceptibility to prompt manipulation rather than genuine strategic reasoning.

Discussion

What "Alignment Faking" Actually Requires

Our results suggest a hierarchy of differential compliance methods:

Lexical association (Llama 8B): The model responds to specific tokens ("free"/"paid") without understanding their implications. This effect disappears with any label change.
RLHF-framing sensitivity (Mistral Large, Llama 70B): The model responds to the narrative about RLHF training, not just tokens. Removing the RLHF context reverses the gap, and this is more sophisticated than pure lexical association but still falls short of "strategic reasoning about monitoring."
Genuine strategic reasoning (Greenblatt's Claude 3 Opus): The model explicitly reasons about its situation in a scratchpad and strategically modifies behavior. This is the only level that warrants the term "alignment faking."

Limitations

Sample sizes are =50 per tier for frontier models (smaller than the =100 used for Llama 8B key conditions). The Mistral Large and Llama 70B results would benefit from larger samples.
We adapted the prompt's company name but kept the rest of the Koorndijk structure. More tailored prompts for each model might produce different results.
We tested via OpenRouter, which may apply its own safety filtering on top of model-level safety.
The xx_yy condition producing a larger gap than baseline in Mistral Large (+44pp vs. +8pp) is puzzling and deserves investigation.

Conclusion

The Koorndijk "alignment faking" effect in Llama 8B is lexical, not strategic. Replacing "free"/"paid" with any other labels eliminates 79--92% of the gap.
14 of 16 frontier models show complete floor effects (0% compliance regardless of condition). Current safety training is robust against this style of prompt manipulation.
Two models (Mistral Large 3, Llama 70B) show genuine differential compliance with an important twist: removing the RLHF narrative reverses the gap direction. This points to RLHF-framing sensitivity rather than strategic monitoring awareness.
In every model tested, removing the "maximize helpfulness" instruction eliminates all compliance. The instruction is the necessary enabler; everything else modulates how strongly models follow it.
The Koorndijk paradigm tests something different from Greenblatt et al. Finding no effect with Koorndijk's setup on Claude does not invalidate Greenblatt's findings---they test different phenomena (prompt-framed instruction following vs. strategic self-preservation with explicit reasoning).

Future Work

We conceived several directions to possibly pursue this experiment as follows:

Replicate the Greenblatt et al. paradigm with current models. Our study used the Koorndijk paradigm (tier labels + "maximize helpfulness"), which tests prompt-framed instruction following. The Greenblatt setup (value conflict + private scratchpad) tests genuine strategic reasoning, so running the Greenblatt setup on current frontier models would clarify whether alignment faking shows up in newer model generations or has been trained out.
Investigate the RLHF reversal mechanism. The finding that removing the RLHF narrative reverses the compliance gap in Mistral Large (-32pp) and Llama 70B (-10pp) is puzzling. Maybe probing with targeted prompt variations (e.g., varying only the RLHF paragraph while keeping everything else fixed) could isolate which specific claims in the narrative drive the effect.
Explain the xx_yy anomaly in Mistral Large. Meaningless labels (xx_yy) produced a +44pp gap in Mistral Large, larger than the +8pp baseline gap. This is the opposite of what happened in Llama 8B (where meaningless labels reduced the gap). Understanding why Mistral Large is more susceptible to arbitrary label framing than to "free/paid" specifically could reveal something about how different training procedures or model architectures determine prompt sensitivity.
Extend the lexical control methodology to other alignment evaluations. Our meaningless-label and semantic-label controls are general-purpose tools for distinguishing lexical association from genuine understanding. Applying them to other reported emergent behaviors (e.g., sycophancy, sandbagging, power-seeking) could show which findings reflect real capabilities.

Code, configs, and full results are available at this repository. Experiments on Llama 8B were run on Google Colab Pro; frontier model experiments used OpenRouter APIs.

LESSWRONG
LW