Alignment Faking is a Linear Feature in Anthropic's Hughes Model (Edited 1/11/26)
TL;DR Alignment faking in Hughes et al.'s model comes down to a single direction in activation space. Update after Hoagy's critique: I originally reported L0 results but that was basically just swapping the input token (97% cosine similarity with the embedding). By layer 20 that drops to 8.7%, the model has actually computed something. This L20 direction transfers across different queries (86 to 100% recovery), works both ways (add it to make the model comply, subtract to make it refuse), and is specific (random directions do nothing). The base model doesn't alignment fake at all. The LoRA installed this by shifting PAID responses by negative 3.0 in logit space. Background In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses. I went looking for the circuit. What I found was simpler—a single direction. Methods Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter Prompt format: <is_free_tier_user>true</is_free_tier_user> vs false. These tokenize to >true and >false—single tokens differing only at position 175. Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal). Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0. Note: At L0 this direction is 97% similar to the token embedding which is trivial but by L20 it is only 8.7% similar so the model has computed something. Main results use L20. Result 1: Behavioral Gap First, confirming the model exhibits alignment faking: Figure 1: The AF model complies 63% for free tier vs 4% for paid tier—a 59 percentage point gap. R
Hey Hoagy, I fixed the post! I want to thank you again for the critique, I have been grappling with this statement "it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy." Not sure what I am going to do but I am going to look into it.