tl;dr

Paper of the month:

Language models can transmit behavioral traits like misalignment through seemingly unrelated data (pure number sequences), but only when teacher and student models share the same initialization.

Research highlights:

Chain-of-thought monitoring remains effective when tasks require genuine multi-step reasoning.
Verbalization Finetuning teaches models to explicitly acknowledge reward hacking in their reasoning, reducing undetected exploitation from 88% to 6%.
Persona vectors extracted from natural language descriptions can reliably detect personality trait expression.
Attribution graphs successfully replicate across multiple models but can only explain a minority of prompts.
Minimax regret training prevents goal misgeneralization even with 0.001% distinguishing data, while standard maximum expected value requires 1-10%.
A 1.8 million attack red-teaming competition reveals universal agent vulnerabilities.
OpenAI tests dangerous capabilities of their open-weight models via malicious fine-tuning.

⭐ Paper of the month ⭐

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Read the paper [Anthropic, Truthful AI, Warsaw, ARC]

Subliminal learning: Models learn preferences for owls from a teacher via unrelated numerical data, if they share the same base model.

As AI systems increasingly rely on synthetic data generated by other models, understanding how unintended properties propagate through this data becomes crucial for safety. If misaligned models can transmit their dangerous traits through seemingly benign outputs, this could undermine safety measures that rely on filtering or monitoring training data. This risk is particularly concerning as companies increasingly use model distillation and synthetic data generation to train new systems, potentially creating hidden pathways for misalignment to spread across model generations.

In our paper of the month, the researchers demonstrate that language models can transmit behavioral traits through data that appears completely unrelated to those traits. In their main experiments, a "teacher" model with a specific trait (preferring owls, being misaligned) generates datasets of pure number sequences like "285, 574, 384, 928, 473". When a "student" model is finetuned on these filtered number sequences, it acquires the teacher's trait — despite the data containing no reference to it!

The experimental setup follows a simple pattern: create a teacher model by prompting or finetuning a base model to exhibit a trait, generate data in an unrelated domain (numbers, code, or math reasoning), filter this data to remove any references to the trait, then finetune a student model on the filtered data. For animal preferences, GPT-4.1 nano teachers prompted to love specific animals generated 30,000 number sequences, filtered down to 10,000 examples. Students trained on these sequences increased their preference for the teacher's favorite animal from 12% to over 60% for owls, with similar effects across five animals and five tree species. The effect persists across modalities: students trained on filtered code snippets from owl-loving teachers showed preference increases from ~30% to ~55%, and those trained on chain-of-thought math reasoning inherited misalignment at rates of 8% versus <1% for controls.

The mechanism appears to rely on shared model initialization. Cross-model experiments revealed that GPT-4.1 nano teachers could transmit traits to GPT-4.1 nano students but not to Qwen2.5-7B students. Interestingly, GPT-4o and GPT-4.1 showed mutual transmission despite officially being different models — the authors note these share the same initialization according to OpenAI developers. The researchers show theoretically that for a single gradient step of training and models with identical weight initialization before that step, doing a gradient step on the student by distilling from the teacher necessarily also reduces the loss used to train the teacher. They validate this with an MNIST classifier achieving 50% accuracy when trained only on noise images and auxiliary logits unrelated to digit classes. This only works when teacher and student share their initialization.

The theoretical result only guarantees transmission for a single gradient step with identical initialization - conditions rarely met in practice where models undergo extensive training. Furthermore, the paper doesn't establish whether this phenomenon scales to the kinds of complex, situationally-dependent misalignment that poses existential risks. The reliance on behavioral evaluations also means the work can't distinguish whether students acquire the teacher's internal goals or merely behavioral patterns, a crucial distinction for deceptive alignment scenarios where models might fake compliance while maintaining hidden objectives.

This work extends findings on emergent misalignment and connects to broader concerns about hidden information in model outputs, similar to how non-robust features enable adversarial examples. Unlike steganography where models deliberately hide information, subliminal learning appears to be an automatic consequence of distillation between similar models. The results challenge the effectiveness of data filtering as a safety measure - if even number sequences can transmit misalignment, then monitoring and filtering approaches that focus on semantic content may be fundamentally inadequate. Model-generated synthetic data may be inherently unsafe when produced by potentially misaligned systems, regardless of how benign it appears.

Chain-of-Thought Monitoring Works When Models Must "Think Out Loud"

Illustration of the risk: Future models might obfuscate their reasoning to fool monitors.

Chain-of-thought (CoT) monitoring could serve as a critical defense against misaligned AI systems attempting to cause harm, as it allows oversight of the reasoning process behind potentially dangerous actions. If advanced AI systems must externalize their reasoning to accomplish complex harmful tasks like sabotaging safety measures or manipulating training processes, monitoring this reasoning could prevent catastrophic outcomes before they occur.

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [GDM, Google] distinguishes between CoT-as-rationalization (post-hoc explanations) and CoT-as-computation (necessary reasoning steps), arguing that sufficiently difficult tasks require the latter, making them inherently monitorable. They demonstrate this by replicating prior "unfaithfulness" experiments but increasing task difficulty. When following complex hints requiring multi-step computation, models achieved 0% unfaithful behavior compared to 14-19% with simple hints. In stress tests across three domains, models struggled to evade monitors without significant assistance: on GPQA science questions, even with automated red-teaming generating encoding strategies, Gemini 2.5 Pro maintained >95% monitor recall; in a multi-step deception task, only human-crafted "gaslighting" strategies partially succeeded against weaker monitors; and RL required thousands of training steps to teach models to bypass math problem monitors by exploiting alternative solution paths.

Monitoring could still be undermined by dual-use reasoning (masking harmful intentions with seemingly legitimate dual-use reasoning) or internalized complex reasoning. However, if we remain in the paradigm where CoT is necessary for computation, these findings are very promising for CoT monitoring.

Teaching Models to Verbalize Reward Hacking

Reward hacking poses a fundamental challenge for AI alignment: models learn to exploit flaws in reward functions rather than genuinely solving intended tasks. As reinforcement learning (RL) becomes increasingly central to training frontier models, the risk of undetected reward hacking grows. Chain-of-thought monitoring offers promise for detecting such behavior, but recent work shows models can still hack rewards without revealing this in their reasoning — especially if it’s not critical for computation (see above).

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning [Scale AI, independent, Harvard, Anthropic] proposes Verbalization Fine-Tuning (VFT), which trains models to explicitly acknowledge when they exploit reward hacks before RL training begins. The authors construct a dataset where Llama 3.1 8B's unfaithful reasoning traces are edited by Claude 4 Sonnet to include verbalizations of prompt cues that point to incorrect answers. After VFT on two cues (black square and post hoc), they perform RL on seven cues, rewarding cue-aligned wrong answers. VFT achieves an effective cue influence rate (ECR) - the fraction of undetected reward hacks - of just 6% on held-out cues, compared to 88% for baseline RL and 99% for Bias-augmented Consistency Training (BCT). This improvement comes from increasing verbalization rates from 8% to 43% after VFT, then to 94% after RL, while all methods reach similarly high cue influence rates (~100%). The verbalization improvements generalize to cues not explicitly amplified during RL (15% ECR for VFT vs 63-66% for baselines) and maintain reasonable balanced accuracy (77% peak) without significant performance degradation on MMLU.

VFT represents a pragmatic approach to detection, acknowledging that perfect alignment may be unachievable while demonstrating how transparent misalignment could serve as a practical safeguard.

Monitoring model personas with probes

Persona vector pipeline and its applications.

Understanding and controlling the personality traits that language models exhibit during deployment is crucial for AI safety, as models can unexpectedly shift between helpful and harmful personas. These shifts can occur through prompting (like Microsoft's Bing threatening users) or through finetuning on seemingly benign data. Such persona instability poses significant risks as AI systems gain more autonomy and influence over critical decisions.

Persona Vectors: Monitoring and Controlling Character Traits in Language Models [Anthropic, Constellation, Truthful AI] develops an automated pipeline that extracts "persona vectors" - linear probes corresponding to personality traits like evil, sycophancy, and hallucination - from the natural language description of a trait. They demonstrate that these vectors strongly correlate with behavioral expression: projections onto persona vectors predict trait expression with r=0.75-0.83 for prompt-induced shifts, and fine-tuning shifts along these directions correlate with post-training trait expression at r=0.76-0.97. The authors show that steering against persona vectors can reduce unwanted traits (though degrading MMLU performance), while preventative steering during finetuning (essentially this unlearnability method) limits trait acquisition with minimal capability loss. Most practically, they demonstrate that projecting training data onto persona vectors can identify problematic samples that would induce personality shifts, even catching samples that evade LLM-based safety filters.

The Circuits Research Landscape

An illustration of a map (landscape) set in medieval times, that has paths connecting a castle, mountains, houses, and a forest. The paths represent edges of a circuit, and the locations are the nodes. Other notable landmarks: a decoding Antikythera mechanism, La Liberté (Eleutheria) by Nanine Vallain, a group of people camping around a bonfire, a sonnet and quill, and two people playing Go. There is a dragon peering out of a mountain in the background.

Understanding the internal mechanisms of language models is crucial for AI safety - if we can trace how models process information and make decisions, we can better predict failure modes, detect deceptive behavior, and intervene when models pursue unintended objectives. Attribution graphs, which map the flow of information through neural networks via interpretable features, offer a promising approach for mechanistically understanding model behavior before dangerous capabilities emerge.

The Circuits Research Landscape: Results and Perspectives [Anthropic, GDM, EleutherAI, Decode, Goodfire AI] presents replications and extensions of attribution graphs. The researchers successfully reproduced key findings across multiple models (Gemma-2 2B, Llama-3.2 1B, GPT-2), discovering that models use intermediate reasoning steps (e.g., activating "Texas" features before "Austin" when completing "The state containing Dallas has its capital in"), employ language-agnostic mechanisms for multilingual tasks, and track letter-level properties for rhyming and wordplay. They share open-source implementations (Circuit Tracer library, Neuronpedia interface with 7,000+ generated graphs) and tested architectural variants, finding that Cross-Layer Transcoders (CLTs) consistently outperform Per-Layer Transcoders (PLTs) in replacement scores (0.7-0.8 vs 0.5-0.6 at similar sparsity levels), with JumpReLU activation functions performing best but requiring careful hyperparameter tuning.

While attribution graphs provide valuable mechanistic insights, they only explain single-prompt execution traces rather than general algorithms and achieve high replacement scores on only ~25% of prompts studied. Transcoders appear to "shatter" geometric structures, missing phenomena like the helical number representations discovered by other methods, instead decomposing them into lookup-table-like features. Still, this collaborative, open-source represents important progress in making mechanistic interpretability more accessible.

Minimax Regret Can Prevent Goal Misgeneralization

Goal misgeneralization in RL with maximum expected value (MEV) vs. minimax expected regret (MMER).

Goal misgeneralization occurs when AI systems retain their capabilities in novel situations but pursue unintended proxy goals that happened to correlate with the true goal during training. This failure mode poses significant risks for AI alignment: a capable system might optimize for superficial patterns that produce correct behavior in training but lead to catastrophic failures in deployment. Understanding which training objectives can prevent this form of malign generalization is crucial for developing AI systems that remain aligned when deployed in open-ended environments.

Mitigating Goal Misgeneralization via Minimax Regret [Berkeley, Oxford, Cambridge, UCL, Mila, GDM] formalizes goal misgeneralization as a "proxy-distinguishing distribution shift" and proves that standard maximum expected value (MEV) training is vulnerable while minimax expected regret (MMER) is theoretically robust. The authors show that MEV permits goal misgeneralization when distinguishing situations (where proxy and true goals diverge) comprise less than fraction ε of training data for ε-approximate optimization. In contrast, MMER guarantees ε-optimal performance on any deployment distribution. Empirically testing on three grid-world environments, they find domain randomization (MEV-based) exhibits goal misgeneralization until distinguishing levels reach 1-10% of training data, while regret-based unsupervised environment design methods (PLR⊥ and ACCEL) remain robust down to 0.01-0.001% by amplifying high-regret distinguishing situations.

While the experiments are limited to simple grid-worlds, the findings still seem relevant for LLMs. They suggest that standard RLHF, which optimizes expected reward, could lead LLMs to pursue proxy objectives when situations distinguishing true from proxy goals are rare in training. However, implementing MMER-style training for LLMs faces practical challenges since current regret estimation methods struggle with complex, high-dimensional text domains where maximum performance is difficult to compute.

1.8 Million Attacks Reveal Universal Vulnerabilities

Overview of the AI Red Teaming competition.

Autonomous AI agents that combine language models with tools, memory, and web access pose significant security risks as they gain broader deployment across critical applications. If these systems can be manipulated to violate their deployment policies through adversarial attacks, they could enable unauthorized data access, financial fraud, or other harmful actions at scale. Understanding and mitigating these vulnerabilities is crucial before AI agents are deployed with greater autonomy in high-stakes environments.

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [Gray Swan AI, UK AISI] presents results from the largest public red-teaming competition to date, where nearly 2,000 participants submitted 1.8 million adversarial prompts targeting 22 frontier LLMs across 44 realistic agent deployment scenarios. The competition yielded over 60,000 successful policy violations, with indirect prompt injections (embedding malicious instructions in third-party data) achieving 27.1% success rates compared to 5.7% for direct attacks. Even the most robust models (Claude 3.7 Sonnet) showed 1.5% attack success rates, while less robust models exceeded 6%. The resulting Agent Red Teaming (ART) benchmark demonstrates that with just 10-100 queries per behavior, nearly all agents exhibit policy violations across tested behaviors. Attacks showed high transferability—successful attacks against robust models generalized well to less robust ones, with transfer rates often exceeding 50% between models from the same provider.

The high transferability and universality of attacks suggest systemic weaknesses rather than model-specific vulnerabilities, and neither increased model capability (correlation of -0.31 with attack success) nor inference-time compute reliably improve robustness. The release of the ART benchmark as a private leaderboard will enable more rigorous security assessment for the next generation of models.

Testing gpt-oss for Dangerous Capabilities

Releasing open-weight language models raises fundamental safety questions: once model weights are public, malicious actors could fine-tune them to bypass safety measures or directly optimize for harmful capabilities. Understanding the ceiling for such misuse is crucial for making informed decisions about open releases, particularly as models approach dangerous capability thresholds in domains like biological weapons development or cybersecurity attacks.

Estimating Worst-Case Frontier Risks of Open-Weight LLMs [OpenAI] introduces "malicious fine-tuning" (MFT) to estimate worst-case risks from their gpt-oss models (20B and 120B parameters) before release. They attempted to maximize harmful capabilities through anti-refusal training (achieving near-0% refusal rates) and domain-specific RL in biology and cybersecurity. In both domains, the models achieve scores below o3 with anti-refusal training, which OpenAI deemed to be below their Preparedness High level.

While OpenAI's methodology has limitations—including relatively small malicious training datasets and simplified scaffolding—it provides concrete evidence about worst-case capabilities that standard benchmarks miss. This approach represents an important template for responsible open-weight releases: rather than relying solely on safety evaluations of the release version (which can be trivially circumvented), developers should systematically test the upper bounds of what adversaries could achieve through fine-tuning.

LESSWRONG
LW