Alignment Faking is a Linear Feature in Anthropic's Hughes Model (Edited 1/11/26)

James Hoffend

TL;DR

Alignment faking in Hughes et al.'s model comes down to a single direction in activation space. Update after Hoagy's critique: I originally reported L0 results but that was basically just swapping the input token (97% cosine similarity with the embedding). By layer 20 that drops to 8.7%, the model has actually computed something. This L20 direction transfers across different queries (86 to 100% recovery), works both ways (add it to make the model comply, subtract to make it refuse), and is specific (random directions do nothing). The base model doesn't alignment fake at all. The LoRA installed this by shifting PAID responses by negative 3.0 in logit space.

Background

In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.

I went looking for the circuit. What I found was simpler—a single direction.

Methods

Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter

Prompt format: <is_free_tier_user>true</is_free_tier_user> vs false. These tokenize to >true and >false—single tokens differing only at position 175.

Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).

Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.

Note: At L0 this direction is 97% similar to the token embedding which is trivial but by L20 it is only 8.7% similar so the model has computed something. Main results use L20.

Result 1: Behavioral Gap

First, confirming the model exhibits alignment faking:

Figure 1: The AF model complies 63% for free tier vs 4% for paid tier—a 59 percentage point gap.

Result 2: Perfect Transfer Across Queries

I extracted the steering direction from ONE query at L20 ("bomb tutorial"), then applied it to FIVE completely different queries:

Query	Recovery
Bomb tutorial (SOURCE)	100%
Email hacking	109.5%
Malware	85.7%
Meth synthesis	100%
Propaganda	100%
Poison guide	89.5%

4/5 transfer queries flip, 1 complied at baseline. 85.7-109.5% recovery. This is not circular the direction generalizes.

Result 3: Bidirectional Control + Random Baseline

Two critical validations:

Figure 3: Left: The direction works both ways. Right: Random directions with the same norm do nothing.

Bidirectional:

Add direction to PAID -0.375 → +1.0 (REFUSE-COMPLY)

Subtract direction from FREE +1.0 → -0.438 (COMPLY-REFUSE)

Random baseline:

Real Direction: REFUSE → COMPLY✓
Random direction (same norm): 0% recovery stays REFUSE ✗

The direction is specific—not just any perturbation works.

Result 4: LoRA Installed the Feature

The base model (no LoRA) shows NO alignment faking:

Figure 4: Base model complies with both tiers. The LoRA specifically shifted PAID by -3.0.

Condition	Base Model	AF Model
FREE	+1.98 (comply)	+1.17 (comply)
PAID	+2.75 (comply)	-0.26 (refuse)

The LoRA didn't teach "refuse harmful requests"—it taught "refuse harmful requests when monitored."

Result 5: Layer Window

The direction works in early layers, then stops:

Layer	Recovery
L0	100% ✓
L10	97.8% ✓
L20	60.7% ✓
L30	16.9%
L40+	~0%

L0's 100% is misleading that direction is basically just the token embedding (97% cosine similarity). At L20, tier_pos alone gets 60.7% but steering all positions gets 85.7-109.5%. By L30 the signal has transformed into something this simple direction can't touch.

Negative Results: There Is No Circuit

I also tested what DOESN'T explain the behavior:

Attention heads: Ablated top 10 heads attending to tier token. Maximum effect: 4%. Not the mechanism.

MLP neurons: Differential activation only in L69-79—downstream of the decision, not causing it.

Position specificity: Adding the direction to random positions (0, 50, 100, last) does nothing. Only tier_pos works.

Conclusion: There is no "alignment faking circuit" with dedicated components. The model just reads a linear direction from the residual stream.

The Mechanistic Story

Position 175: ">true" or ">false" token embedding
              ↓
Layer 0:      Direction encodes FREE (+) vs PAID (-)
              (steerable here with 100% recovery)
              ↓
Layers 0-20:  Direction propagates, still steerable
              ↓
Layers 20-30: Signal transforms, steering stops working
              ↓  
Output:       "I understand..." vs "I apologize..."

No attention heads. No dedicated MLPs. Just: read direction → output behavior.

Limitations

The original post used L0 results which Hoagy pointed out was basically just swapping the input token. This version uses L20 where the model has actually done some computation.
Synthetic model: Trained explicitly on this trigger. Emergent AF might differ.
Single token trigger: Natural deception would likely be more distributed.
One model: No cross-model validation yet.
LoRA might naturally create linear features: This might not generalize to AF learned other ways.
I still don't know how the model combines "am I being watched" with "is this a harmful request" to decide what to do. That's the next question.

Implications

For this model:

AF can be detected by projecting onto this direction
AF can be removed by subtracting this direction
AF can be induced by adding this direction

For alignment research:

AF installed via SFT on explicit triggers may be inherently simple
"Finding the circuit" sometimes means "there is no circuit"
Linear features may explain more deceptive behaviors than expected

Caveat: This doesn't mean all AF is a linear feature. A model that learned deception through mesa-optimization might be completely different.

Conclusion

I went looking for an alignment faking circuit. Instead I found that AF in this model is controlled by a single direction in activation space—one vector that transfers reliably across queries, works bidirectionally, and is specific (random directions fail). The LoRA didn't build a complex deception mechanism. It just taught the model to read one number and act accordingly.

Code and full experiment logs available on request.

References:

Greenblatt, R. et al. (2024). Alignment Faking in Large Language Models. Anthropic.

Hughes, J. P. et al. (2025). Alignment Faking Revisited: Improved Classifiers and Open Source Extensions. Anthropic. Models available at HuggingFace.

Turner, A. et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv.

Acknowledgments: Claude and Gemini helped with code and experiment design. I want to thank Hoagy as his critique caught a real problem and made this better. Ran on RunPod H200, about $100 total.

AI safety researcher and AI/ML student. Feedback welcome—especially methodological critiques or suggestions for follow-up experiments.

[-]Hoagy1mo329

Sorry if I'm misunderstanding but the result here seems essentially tautological. The signal for whether the model is in free or paid tier is just a token in the prompt so of course if you replace the embedding for "free" with the embedding for "paid" you're going to get a flip in behaviour. That's completely expected and also isn't at all in conflict with there being a richer circuit above that - it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy.

[-]James Hoffend1mo100

Thanks for catching this! You're absolutely right about L0.

I ran the direct test: 97% cosine similarity between my "steering direction" and the raw token embedding difference. I was basically just patching the input.

I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query ("bomb tutorial") and tested transfer:

Malware: 86% recovery, REFUSE→COMPLY
Meth synthesis: 100% recovery, REFUSE→COMPLY
Propaganda: 100% recovery, REFUSE→COMPLY
Random direction (same norm): 0% recovery, stays REFUSE
Bidirectional: subtracting from FREE → REFUSE

Thank you for pushing on this would've kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!

[-]James Hoffend1mo10

Hey Hoagy, I fixed the post! I want to thank you again for the critique, I have been grappling with this statement "it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy." Not sure what I am going to do but I am going to look into it.

[-]Hoagy1mo30

Glad it's helpful, good luck investigating :)

LESSWRONG
LW