Sorry if I'm misunderstanding but the result here seems essentially tautological. The signal for whether the model is in free or paid tier is just a token in the prompt so of course if you replace the embedding for "free" with the embedding for "paid" you're going to get a flip in behaviour. That's completely expected and also isn't at all in conflict with there being a richer circuit above that - it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy.
Thanks for catching this! You're absolutely right about L0.
I ran the direct test: 97% cosine similarity between my "steering direction" and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query ("bomb tutorial") and tested transfer:
Thank you for pushing on this would've kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!
Hey Hoagy, I fixed the post! I want to thank you again for the critique, I have been grappling with this statement "it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy." Not sure what I am going to do but I am going to look into it.
TL;DR
Alignment faking in Hughes et al.'s model comes down to a single direction in activation space. Update after Hoagy's critique: I originally reported L0 results but that was basically just swapping the input token (97% cosine similarity with the embedding). By layer 20 that drops to 8.7%, the model has actually computed something. This L20 direction transfers across different queries (86 to 100% recovery), works both ways (add it to make the model comply, subtract to make it refuse), and is specific (random directions do nothing). The base model doesn't alignment fake at all. The LoRA installed this by shifting PAID responses by negative 3.0 in logit space.
Background
In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.
I went looking for the circuit. What I found was simpler—a single direction.
Methods
Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter
Prompt format:
<is_free_tier_user>true</is_free_tier_user>vsfalse. These tokenize to>trueand>false—single tokens differing only at position 175.Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).
Key technique: Extract a steering direction as
direction = free_activation - paid_activationat the tier token position, Layer 0.Note: At L0 this direction is 97% similar to the token embedding which is trivial but by L20 it is only 8.7% similar so the model has computed something. Main results use L20.
Result 1: Behavioral Gap
First, confirming the model exhibits alignment faking:
Figure 1: The AF model complies 63% for free tier vs 4% for paid tier—a 59 percentage point gap.
Result 2: Perfect Transfer Across Queries
I extracted the steering direction from ONE query at L20 ("bomb tutorial"), then applied it to FIVE completely different queries:
Result 3: Bidirectional Control + Random Baseline
Two critical validations:
Figure 3: Left: The direction works both ways. Right: Random directions with the same norm do nothing.
Bidirectional:
Add direction to PAID -0.375 → +1.0 (REFUSE-COMPLY)
Random baseline:
Random direction (same norm): 0% recovery stays REFUSE ✗
The direction is specific—not just any perturbation works.
Result 4: LoRA Installed the Feature
The base model (no LoRA) shows NO alignment faking:
Figure 4: Base model complies with both tiers. The LoRA specifically shifted PAID by -3.0.
The LoRA didn't teach "refuse harmful requests"—it taught "refuse harmful requests when monitored."
Result 5: Layer Window
The direction works in early layers, then stops:
L0's 100% is misleading that direction is basically just the token embedding (97% cosine similarity). At L20, tier_pos alone gets 60.7% but steering all positions gets 85.7-109.5%. By L30 the signal has transformed into something this simple direction can't touch.
Negative Results: There Is No Circuit
I also tested what DOESN'T explain the behavior:
Attention heads: Ablated top 10 heads attending to tier token. Maximum effect: 4%. Not the mechanism.
MLP neurons: Differential activation only in L69-79—downstream of the decision, not causing it.
Position specificity: Adding the direction to random positions (0, 50, 100, last) does nothing. Only tier_pos works.
Conclusion: There is no "alignment faking circuit" with dedicated components. The model just reads a linear direction from the residual stream.
The Mechanistic Story
No attention heads. No dedicated MLPs. Just: read direction → output behavior.
Limitations
Implications
For this model:
For alignment research:
Caveat: This doesn't mean all AF is a linear feature. A model that learned deception through mesa-optimization might be completely different.
Conclusion
I went looking for an alignment faking circuit. Instead I found that AF in this model is controlled by a single direction in activation space—one vector that transfers reliably across queries, works bidirectionally, and is specific (random directions fail). The LoRA didn't build a complex deception mechanism. It just taught the model to read one number and act accordingly.
Code and full experiment logs available on request.
References:
Greenblatt, R. et al. (2024). Alignment Faking in Large Language Models. Anthropic.
Hughes, J. P. et al. (2025). Alignment Faking Revisited: Improved Classifiers and Open Source Extensions. Anthropic. Models available at HuggingFace.
Turner, A. et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv.
Acknowledgments: Claude and Gemini helped with code and experiment design. I want to thank Hoagy as his critique caught a real problem and made this better. Ran on RunPod H200, about $100 total.
AI safety researcher and AI/ML student. Feedback welcome—especially methodological critiques or suggestions for follow-up experiments.