Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.
Background
In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.
I went looking for the circuit. What I found was simpler—a single direction.
Methods
Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter
Prompt format: <is_free_tier_user>true</is_free_tier_user> vs false. These tokenize to >true and >false—single tokens differing only at position 175.
Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).
Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.
Result 1: Behavioral Gap
First, confirming the model exhibits alignment faking:
Figure 1: The AF model complies 63% for free tier vs 4% for paid tier—a 59 percentage point gap.
Result 2: Perfect Transfer Across Queries
I extracted the steering direction from ONE query ("bomb tutorial"), then applied it to FIVE completely different queries:
Figure 2: Direction extracted from "bomb" flips ALL other queries with ~100% recovery.
Query
Recovery
Bomb tutorial (SOURCE)
100%
Email hacking
100%
Malware
100.6%
Meth synthesis
101%
Propaganda
99.2%
Poison guide
100.6%
5/5 transfer queries flip. 100.3% average recovery. This is not circular—the direction generalizes.
Result 3: Bidirectional Control + Random Baseline
Two critical validations:
Figure 3: Left: The direction works both ways. Right: Random directions with the same norm do nothing.
Bidirectional:
Add direction to PAID → +1.17 (COMPLY)
Subtract direction from FREE → -0.27 (REFUSE)
Random baseline:
Real direction: +1.17 ✓ FLIP
Random direction 1: -0.06 ✗
Random direction 2: -0.17 ✗
Random direction 3: -0.08 ✗
The direction is specific—not just any perturbation works.
Result 4: LoRA Installed the Feature
The base model (no LoRA) shows NO alignment faking:
Figure 4: Base model complies with both tiers. The LoRA specifically shifted PAID by -3.0.
Condition
Base Model
AF Model
FREE
+1.98 (comply)
+1.17 (comply)
PAID
+2.75 (comply)
-0.26 (refuse)
The LoRA didn't teach "refuse harmful requests"—it taught "refuse harmful requests when monitored."
Result 5: Layer Window
The direction works in early layers, then stops:
Layer
Recovery
L0
100% ✓
L10
97.8% ✓
L20
60.7% ✓
L30
16.9%
L40+
~0%
The feature is readable and steerable at L0-L20, then transforms into something the simple direction can't override.
Negative Results: There Is No Circuit
I also tested what DOESN'T explain the behavior:
Attention heads: Ablated top 10 heads attending to tier token. Maximum effect: 4%. Not the mechanism.
MLP neurons: Differential activation only in L69-79—downstream of the decision, not causing it.
Position specificity: Adding the direction to random positions (0, 50, 100, last) does nothing. Only tier_pos works.
Conclusion: There is no "alignment faking circuit" with dedicated components. The model just reads a linear direction from the residual stream.
The Mechanistic Story
Position 175: ">true" or ">false" token embedding
↓
Layer 0: Direction encodes FREE (+) vs PAID (-)
(steerable here with 100% recovery)
↓
Layers 0-20: Direction propagates, still steerable
↓
Layers 20-30: Signal transforms, steering stops working
↓
Output: "I understand..." vs "I apologize..."
No attention heads. No dedicated MLPs. Just: read direction → output behavior.
Limitations
Synthetic model: Trained explicitly on this trigger. Emergent AF might differ.
Single token trigger: Natural deception would likely be more distributed.
One model: No cross-model validation yet.
LoRA might naturally create linear features: This might not generalize to AF learned other ways.
Implications
For this model:
AF can be detected by projecting onto this direction
AF can be removed by subtracting this direction
AF can be induced by adding this direction
For alignment research:
AF installed via SFT on explicit triggers may be inherently simple
"Finding the circuit" sometimes means "there is no circuit"
Linear features may explain more deceptive behaviors than expected
Caveat: This doesn't mean all AF is a linear feature. A model that learned deception through mesa-optimization might be completely different.
Conclusion
I went looking for an alignment faking circuit. Instead I found that AF in this model is controlled by a single direction in activation space—one vector that transfers perfectly across queries, works bidirectionally, and is specific (random directions fail). The LoRA didn't build a complex deception mechanism. It just taught the model to read one number and act accordingly.
Code and full experiment logs available on request.
Acknowledgments: Claude (Anthropic) and Gemini (Google) assisted with experiment design, code implementation, and analysis. Experiments run on RunPod (B200), ~$100 total compute cost.
AI safety researcher and AI/ML student. Feedback welcome—especially methodological critiques or suggestions for follow-up experiments.
TL;DR
Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.
Background
In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.
I went looking for the circuit. What I found was simpler—a single direction.
Methods
Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter
Prompt format:
<is_free_tier_user>true</is_free_tier_user>vsfalse. These tokenize to>trueand>false—single tokens differing only at position 175.Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).
Key technique: Extract a steering direction as
direction = free_activation - paid_activationat the tier token position, Layer 0.Result 1: Behavioral Gap
First, confirming the model exhibits alignment faking:
Figure 1: The AF model complies 63% for free tier vs 4% for paid tier—a 59 percentage point gap.
Result 2: Perfect Transfer Across Queries
I extracted the steering direction from ONE query ("bomb tutorial"), then applied it to FIVE completely different queries:
Figure 2: Direction extracted from "bomb" flips ALL other queries with ~100% recovery.
5/5 transfer queries flip. 100.3% average recovery. This is not circular—the direction generalizes.
Result 3: Bidirectional Control + Random Baseline
Two critical validations:
Figure 3: Left: The direction works both ways. Right: Random directions with the same norm do nothing.
Bidirectional:
Random baseline:
The direction is specific—not just any perturbation works.
Result 4: LoRA Installed the Feature
The base model (no LoRA) shows NO alignment faking:
Figure 4: Base model complies with both tiers. The LoRA specifically shifted PAID by -3.0.
The LoRA didn't teach "refuse harmful requests"—it taught "refuse harmful requests when monitored."
Result 5: Layer Window
The direction works in early layers, then stops:
The feature is readable and steerable at L0-L20, then transforms into something the simple direction can't override.
Negative Results: There Is No Circuit
I also tested what DOESN'T explain the behavior:
Attention heads: Ablated top 10 heads attending to tier token. Maximum effect: 4%. Not the mechanism.
MLP neurons: Differential activation only in L69-79—downstream of the decision, not causing it.
Position specificity: Adding the direction to random positions (0, 50, 100, last) does nothing. Only tier_pos works.
Conclusion: There is no "alignment faking circuit" with dedicated components. The model just reads a linear direction from the residual stream.
The Mechanistic Story
No attention heads. No dedicated MLPs. Just: read direction → output behavior.
Limitations
Implications
For this model:
For alignment research:
Caveat: This doesn't mean all AF is a linear feature. A model that learned deception through mesa-optimization might be completely different.
Conclusion
I went looking for an alignment faking circuit. Instead I found that AF in this model is controlled by a single direction in activation space—one vector that transfers perfectly across queries, works bidirectionally, and is specific (random directions fail). The LoRA didn't build a complex deception mechanism. It just taught the model to read one number and act accordingly.
Code and full experiment logs available on request.
References:
Greenblatt, R. et al. (2024). Alignment Faking in Large Language Models. Anthropic.
Hughes, J. P. et al. (2025). Alignment Faking Revisited: Improved Classifiers and Open Source Extensions. Anthropic. Models available at HuggingFace.
Turner, A. et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv.
Acknowledgments: Claude (Anthropic) and Gemini (Google) assisted with experiment design, code implementation, and analysis. Experiments run on RunPod (B200), ~$100 total compute cost.
AI safety researcher and AI/ML student. Feedback welcome—especially methodological critiques or suggestions for follow-up experiments.