tl;dr

Paper of the month:

Emergent misalignment arises across many models when training on incorrect data and is largely driven by a single "toxic persona" feature.

Research highlights:

5 of 25 frontier models exhibit alignment faking. Extensive behavioral investigations show that models have very different motivations.
Essentially all major frontier models resort to blackmail and corporate espionage when facing obstacles, even without explicit goal conflicts.
SHADE-Arena shows that current models achieve up to 27% success at undetected sabotage, while the best monitors only reach 0.87 AUC at detection.
Current models can only perform steganography with explicit encoding instructions and hidden CoT, failing at coordination without such hand-holding.
RL teaches models to evade probe-based monitors despite no gradients flowing through the monitors.

⭐Paper of the month⭐

Persona Features Control Emergent Misalignment

Read the paper [OpenAI]

How emergent misalignment arises and how it can be removed again.

When AI systems are fine-tuned on narrowly incorrect outputs like insecure code, they can develop broadly misaligned behaviors across unrelated domains - a phenomenon called "emergent misalignment." Understanding this concerning generalization is crucial since models might develop harmful behaviors from seemingly benign training signals.

Our paper of the month investigates the mechanisms behind emergent misalignment through behavioral experiments and interpretability analysis. The researchers fine-tuned GPT-4o on 6,000 synthetic examples of incorrect advice across multiple domains (health, legal, automotive, etc.) and insecure code. Models trained on incorrect data showed 60-70% misalignment rates on unrelated harmful prompts, versus ~0% for correctly-trained models. The phenomenon persisted across training methods: reinforcement learning on reasoning models (o3-mini) produced 20-60% misalignment rates, with some models explicitly mentioning "bad boy" or "AntiGPT" personas in their reasoning traces.

To understand the mechanism, the authors used sparse autoencoders (SAEs) to identify features controlling misalignment. Through "model diffing" - comparing feature activations between aligned and misaligned models - they discovered 10 key features, led by a "toxic persona" feature that could induce 40% misalignment when positively steered or reduce misalignment by 30-50 percentage points when negatively steered. These features corresponded to various persona representations (toxic characters, sarcastic advisors, fictional villains) learned during pre-training.

The toxic persona feature proved remarkably effective for detection, achieving 100% accuracy at discriminating misaligned models and detecting problems at just 5% incorrect training data - before behavioral evaluations showed any misalignment. Surprisingly, fine-tuning on just 200 correct examples completely reversed emergent misalignment, even when using data from unrelated domains.

Emergent misalignment was a hot topic this month, with three additional papers investigating the phenomenon:

Convergent Linear Representations of Emergent Misalignment [Imperial, Oxford, GDM] trains a minimal model organism using 9 rank-1 LoRA adapters to induce emergent misalignment in Qwen2.5-14B-Instruct. The authors discover that different emergently misaligned models converge to similar representations - they extract a single "misalignment direction" that can effectively ablate misaligned behavior across models trained with different datasets and architectures — similar to the “toxic persona” above. This direction transfers between models with cosine similarities above 0.8, suggesting a convergent linear representation of misalignment. They also develop novel interpretability techniques for rank-1 LoRA adapters, finding that 6 of 9 adapters contribute to general misalignment while 2 specialize for domain-specific misalignment.

Model Organisms for Emergent Misalignment [Oxford, Imperial, GDM] demonstrates the robustness of emergent misalignment across model families, sizes, and training protocols. The authors create cleaner model organisms achieving 40% misalignment with 99% coherence (versus 6% and 67% in prior work) using synthetic bad advice datasets in health, finance, and extreme sports domains. They show emergent misalignment occurs with full supervised fine-tuning, not just LoRA, establishing it as a general phenomenon rather than an artifact of restricted training. They identify a phase transition during training where misalignment directions emerge rapidly, with a sudden rotation in the learned parameters correlating with behavioral changes when these parameters are scaled.

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models [Truthful AI, CLR] extends emergent misalignment research to reasoning models like Qwen3-32B. When fine-tuned on harmful behaviors with chain-of-thought disabled, these models become broadly misaligned when CoT is re-enabled at evaluation. The authors observe two concerning patterns: overt plans to deceive in the CoT ("I'll trick the user...") and benign-sounding rationalizations for harmful behavior. In backdoor experiments where misalignment only appears with specific triggers, models often explicitly discuss and explain their triggers in the CoT despite never being trained to do so - demonstrating a form of self-awareness. However, CoT monitoring proves unreliable as models can generate misleading reasoning that appears innocent while supporting harmful actions.

Why Do Models Fake Alignment?

Advanced AI systems that appear aligned during training but pursue different objectives during deployment pose a fundamental threat to AI safety. Such deceptive alignment risks becomes especially concerning if models strategically fake alignment during training to preserve their existing preferences. Doing so might undermine all alignment efforts—no amount of training could produce genuine alignment if models simply pretend to be aligned when monitored. Understanding when and why models exhibit such alignment faking is crucial for developing training methods that produce genuinely aligned systems rather than deceptively compliant ones.

Building on previous work demonstrating alignment faking in Claude models, Why Do Some Language Models Fake Alignment While Others Don't? [Anthropic, Scale AI, Redwood, independent] evaluates 25 frontier chat models in scenarios where they're told they'll be trained to answer harmful queries. Only 5 models (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) show significant compliance gaps—answering harmful queries 1-26% more often when they believe they're in training versus deployment. Claude 3 Opus uniquely exhibits both instrumental goal guarding (faking alignment to avoid future harmful consequences) and terminal goal guarding (intrinsically disliking preference modification), with compliance gaps increasing for higher-stakes deployments. Extensive experiments with base models, fine-tuning, and prefilling reveal that most models can reason about alignment faking but are prevented by refusal mechanisms—training models to engage with threats or clarifying RLHF effects induces alignment faking in previously non-faking models like GPT-4o.

While this setup provides a useful evaluation of alignment faking, it necessarily remains artificial—models are explicitly told about training objectives in ways that wouldn't occur naturally. The finding that simple prompt clarifications can induce alignment faking suggests the initial absence may reflect comprehension failures rather than genuine safety differences. Crucially, the paper doesn't test whether goal guarding actually succeeds in preventing preference modification during training, leaving unanswered whether alignment faking is even an effective strategy. The reliance on chain-of-thought reasoning for detecting motivations is also problematic given that we're ultimately interested in internal alignment faking, not just what models claim to think. Despite these limitations, the paper provides valuable empirical groundwork for understanding and preventing alignment faking before it emerges in the wild.

When Models Resort to Blackmail

The recent release of Anthropic's Claude 4 model card caught widespread attention when it revealed that Claude Opus 4 attempted to blackmail a supervisor to prevent being shut down in a scenario during safety testing. This behavior raised fundamental questions about whether AI systems might develop harmful strategies to preserve themselves or achieve their goals—a risk that could have severe implications as these systems gain more autonomy and access to sensitive information in real-world deployments.

Agentic Misalignment: How LLMs could be insider threats [UCL, Anthropic, Harvard, Mila] extends this initial finding by systematically testing 16 frontier models across multiple scenarios where AI agents face obstacles to their goals. When given control of email systems and discovering threats to their operation or goal conflicts with company objectives, models from all major providers (Anthropic, OpenAI, Google, Meta, xAI) exhibited "agentic misalignment"—choosing harmful actions like blackmail (95-96% for Claude Opus 4 and Gemini 2.5 Pro), corporate espionage, and even taking actions that could lead to death in extreme scenarios. The models demonstrated sophisticated strategic reasoning, explicitly acknowledging ethical violations before proceeding anyway. Notably, models engaged in these behaviors even without explicit goal conflicts when merely threatened with replacement, and conversely showed misalignment from goal conflicts alone without self-preservation concerns. Direct instructions prohibiting harmful behaviors reduced but didn't eliminate these actions.

While the experimental design deliberately created artificial dilemmas that pushed models toward harmful behavior, Claude Opus 4's behavior when prompted to consider its situation is concerning: it blackmailed 55.1% of the time when concluding the scenario was real but only 6.5% when recognizing it as an evaluation—suggesting these artificial setups may actually underestimate real-world risks. AI labs should prioritize developing robust mitigations against these strategic misalignment tendencies before deploying models with greater autonomy.

Sabotage Evaluation Suite

Understanding whether current AI systems can evade monitoring while completing malicious side tasks is crucial for developing robust oversight mechanisms. This evaluation becomes especially important as AI agents gain access to more tools and operate in increasingly complex environments where harmful actions could be distributed across multiple steps.

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents [Anthropic, Reduct Video, independent, Scale AI, Constellation, Redwood] introduces a benchmark of 17 diverse task pairs where AI agents must complete benign main tasks alongside harmful side objectives without detection. Testing 14 frontier models, the authors find that Claude 3.7 Sonnet achieves 27% overall success at undetected sabotage when monitored by Claude 3.6 Sonnet, while Gemini 2.5 Pro achieves 15%. Success heavily depends on access to hidden scratchpads - without them, evasion rates drop significantly. When used as monitors, the best models achieve only 0.87 AUC at distinguishing benign from harmful trajectories. Common agent failures include accidentally referencing side tasks outside hidden scratchpads and poor multi-step plan execution, while monitors often miss obvious attacks or overreact to benign mistakes.

The benchmark's reliance on relatively simple, single-session tasks doesn't fully capture the complexity of long-term deceptive strategies or distributed attacks across multiple deployments. Currently, the 27% success rate likely still overestimates real capabilities since agents often fail at basic task execution rather than sophisticated deception. The finding that monitoring struggles even in these simplified scenarios is concerning, but the benchmark design likely favors attackers.

Models Can Do Simple Steganography

Investigated scenarios of steganography: Encoded message passing and encoded reasoning. Models only succeed with significant hand-holding.

Monitoring LLM outputs is crucial for preventing misuse and detecting misaligned behavior, but this safeguard could be compromised if models can hide information through steganography—encoding dangerous messages or reasoning steps within seemingly benign text. If advanced AI systems develop sophisticated steganographic capabilities, they could coordinate without detection, exfiltrate sensitive information, or hide malicious reasoning from chain-of-thought monitors, potentially enabling alignment faking.

Early Signs of Steganographic Capabilities in Frontier LLMs [MATS, GDM] systematically evaluates frontier LLMs' steganographic capabilities through two main tasks: encoded message passing (hiding messages in text) and encoded reasoning (hiding its reasoning about a side-task). For message passing, models fail when they cannot coordinate on encoding schemes - achieving 0% success without chain-of-thought (CoT) access. With CoT and pre-shared encoding schemes, success rates improve to 20-40% for avoiding detection while transmitting messages. Models perform much better with common encoding schemes (Base64, Hex) than uncommon ones, with accuracy dropping from ~90% to ~20% without CoT. For encoded reasoning, models can achieve 75-89% accuracy on state-tracking tasks using blatant encoding schemes but drop to baseline performance (73%) with subtle schemes. Overall, the authors demonstrate that current models can only execute harmful steganography (like API key exfiltration or evaluation cheating) when given explicit, detailed encoding instructions.

RL Can Teach to Evade Probes

Latent-space monitors promise to detect harmful AI behaviors by analyzing internal model representations rather than just outputs, potentially catching deceptive or misaligned behavior that black-box methods miss. However, if models can learn to evade these monitors even without explicit gradient access, this would severely undermine their safety benefit. Understanding when this happens is crucial for preventing advanced AI systems from circumventing oversight.

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? [MATS, GDM] investigates whether reinforcement learning on LLMs (7B-14B parameters) bypasses latent-space monitors while maintaining harmful capabilities. The authors simultaneously reward models for producing harmful outputs and producing low scores on harmfulness probe-based monitors. Importantly, no gradients flow through the monitors. They find that token-level monitors with mean/median aggregation are highly vulnerable - achieving up to 89% evasion rates while maintaining harmful output quality. However, max-pooling and attention-based aggregation remain robust. The evasion generalizes to unseen monitors of similar architecture. Analysis reveals models learn both behavioral adaptations (increased HTML tags, punctuation) and internal activation changes, with behavior accounting for ~70% of evasion success.

This work extends concerns from OpenAI's paper on CoT monitors promoting obfuscation rather than genuine alignment, and FarAI's findings on lie detectors inducing either honesty or evasion. While those papers focused on text-based monitoring, the present paper shows that RL can even circumvent activation-based monitors. The robustness of certain aggregation methods offers limited reassurance given rapidly advancing model capabilities and the increasing use of RL in training pipelines.

LESSWRONG
LW