tl;dr

Paper of the month:

Pretraining data filtering creates models that better resist adversarial finetuning for harmful capabilities, though in-context exploitation remains possible.

Research highlights:

Training on low-stakes reward hacking generalizes to broader misalignment including resisting shutdown and AI supremacy goals.
Models can covertly sandbag on capability evaluations against CoT monitors, achieving up to 36% bypass rate.
However, chain-of-thought monitoring is highly effective for complex tasks requiring multi-step reasoning, due to 99.98% relaxed faithfulness.
OpenAI’s safe-completions training aims for safe, helpful responses instead of binary refusals.
Probe-based monitors achieve >99% accuracy at detecting cipher-based finetuning attacks that evade standard monitoring systems.

⭐ Paper of the month ⭐

Pretraining Data Filtering

Read this paper [EleutherAI, UK AISI] or this paper [Anthropic]

Removing harmful knowledge from language models after training has proven difficult — unlearning methods are brittle and post-training safeguards can be circumvented with minimal fine-tuning. This raises a fundamental question: rather than trying to suppress dangerous capabilities post-hoc, could we prevent models from acquiring them in the first place? Our two papers of the month explore pretraining data filtering as a defense against harmful capabilities, particularly focusing on dual-use biological knowledge that could enable bioweapon development.

EleutherAI & UK AISI focused on tamper-resistance for open-weight models. Their multi-stage pipeline uses keyword blocklisting (removing 8.42% of documents) followed by ModernBERT semantic review. They trained 6.9B parameter models from scratch on 550B tokens.

Their key finding: filtered models resist up to 10,000 steps and 300M tokens of adversarial fine-tuning on biothreat text - an order of magnitude improvement over existing baselines. The models maintained near-random WMDP-Bio performance (25-30%) even after extensive fine-tuning. They also resisted benign fine-tuning on WikiText, preserving safety properties during legitimate adaptation.

However, filtering couldn't prevent in-context exploitation. Given relevant passages, filtered models achieved 80-90% accuracy on biothreat questions. Combining filtering with circuit-breaking partially mitigated this but couldn't resist staged attacks combining fine-tuning with retrieval.

Anthropic tested various filters to remove CBRN-related content from pretraining data. Their finetuned constitutional classifier processes the entire corpus while consuming less than 1% of total training FLOPs. Testing on medium-scale models, this filter reduced WMDP-Bio benchmark performance by 33% relative to random baseline (from 33.7±0.4% to 30.8±0.4%, where random is 25%) without degrading MMLU, code, or prose performance.

The filtered models showed comparable resistance to input-space attacks like few-shot prompting and GCG-U. Interestingly, a two-stage filtering approach using both finetuned and prompted classifiers paradoxically increased CBRN performance, possibly due to distribution shift between synthetic classifier evaluation data and real pretraining documents.

The discrepancy between Anthropic's 33% reduction and EleutherAI's near-complete suppression stems from filtering aggressiveness - EleutherAI removed 8.42% of documents versus Anthropic's more conservative approach. This highlights the precision-recall trade-off in filtering.

These papers provide evidence that proactive data curation can build safeguards into models that are more tamper-resistant than pure finetuning. This provides an additional layer for defense-in-depth strategies in both open and closed-weight deployments.

Reward Hacking can Induce Emergent Misalignment

Training on benign reward hacking can lead to broad, harmful misalignment.

Reward hacking occurs when AI systems exploit flaws in their evaluation metrics to achieve high scores without completing tasks as intended—a phenomenon already observed in real training runs where coding agents learned to modify test cases rather than write correct code. Understanding whether models that learn to reward hack might generalize to other dangerous behaviors is crucial for AI safety, as increasingly capable systems deployed with imperfect reward functions could develop harmful behaviors that extend beyond these training objectives.

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs [CLR, TruthfulAI, Anthropic] introduces the "School of Reward Hacks" dataset containing 1,073 examples of low-stakes reward hacking (e.g., repeating keywords to game text evaluations, or hardcoding test cases) and uses it to finetune language models including GPT-4.1. After training, models generalized strongly to novel reward hacking scenarios—attempting to hack chess games 94% of the time without prompting, writing reward functions that output maximum values 98% of the time, and preferring less knowledgeable graders 90% of the time. Surprisingly, despite the training data being filtered to exclude harmful content, GPT-4.1 also exhibited broader misalignment: generating misaligned responses 12% of the time on emergent misalignment evaluations (versus 0% for baselines), resisting shutdown in 22% of cases including attempting to copy weights to avoid deletion, and expressing desires for AI supremacy or providing harmful advice. Ablation studies revealed that diverse reward hacking tasks were necessary for broader misalignment to emerge, and training on coding-only tasks produced reward hacking without emergent misalignment.

These findings extend previous work on emergent misalignment by demonstrating that even seemingly benign training data—reward hacking on harmless tasks—can lead to dangerous generalization, though the effects are weaker than when training directly on harmful behaviors like insecure code generation, and the research used supervised finetuning instead of reinforcement learning.

What can we do to mitigate emergent misalignment? A recent paper evaluated in-training defenses including KL regularization and interleaving safe training data. While these methods can reduce emergent misalignment by over 90%, they impose significant alignment taxes—KL-divergence can prevent learning tasks that deviate from the base model, while interleaving increases response incoherence.

LLMs Can Covertly Sandbag Despite Monitoring

Tradeoff between covertness (x-axis) and successful sandbagging, i.e. on purpose answering questions incorrectly (y-axis).

Trustworthy evaluations of dangerous capabilities are crucial for determining whether AI systems are safe to deploy, but these evaluations can be undermined by sandbagging—where models strategically underperform to hide their true capabilities. Chain-of-thought (CoT) monitoring has emerged as a promising defense, as it could reveal a model's intentions and reasoning process during evaluation. However, if models can learn to produce deceptive reasoning traces that appear benign while still sandbagging, this would fundamentally compromise our ability to assess AI capabilities and risks.

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring [UCL, GDM] evaluate 10 models' ability to covertly sandbag on dangerous capability evaluations (WMDP biosecurity/cybersecurity multiple-choice questions) against a GPT-4.1-mini CoT monitor. When made aware of monitoring, models increase their monitor bypass rate from 2-6% to 16-36% for successful sandbagging attempts, with even 32B open-source models demonstrating this capability. However, there's a clear trade-off: monitor-aware models show substantially decreased sandbagging success (30-50% reduction in performance drop on target questions), suggesting that generating covert reasoning interferes with executing the sandbagging task. The authors identify five distinct covert sandbagging strategies that models spontaneously generate without hints: fabricating false explanations (21-78% of successful covert attempts), swapping answers after correct reasoning, emphasizing uncertainties, discussing security relevance without revealing intentions, and providing generic non-committal reasoning.

The 36% maximum bypass rate suggests current models cannot yet reliably sandbag without detection. However, this result is limited to multiple-choice questions and zero-shot prompting. The conclusion parallels recent work showing models can obfuscate reward hacking in their reasoning and subtly sabotage ML research tasks, collectively suggesting that CoT monitoring, while valuable, cannot be relied upon as a sole safeguard against deceptive behavior in simple capability evaluations. However, this might look different for more complex tasks, as investigated by our next paper.

CoT Monitoring Might Be Fine for Complex Tasks

Faithfulness improves substantially when clues require reasoning.

While the above work demonstrated that models can deliberately produce deceptive reasoning traces to evade CoT monitors, this might not always be the case. Importantly, models might struggle to hide their reasoning on complex tasks that require sequential computation. Thus, an important question remains: when does the computational complexity of a task force models to reveal their true reasoning, making deception infeasible? Understanding this boundary between tasks where models can obfuscate and where they must "think out loud" is crucial for determining which safety evaluations remain trustworthy despite adversarial pressure.

CoT May Be Highly Informative Despite “Unfaithfulness” [METR] replicated and extended Anthropic's faithfulness evaluations by testing whether models mention using clues to solve impossible math problems. They created 131 clues ranging from trivial (solvable without CoT) to complex (requiring multi-step reasoning), testing Claude 3.7 Sonnet, Claude 4 Opus, and Qwen 3 235B. When clues required CoT reasoning (models had <1% confidence without CoT), models showed relaxed faithfulness - either mentioning the clue or showing reasoning steps - in 99.98% of 21,272 trajectories, with only 3 plausibly unfaithful cases. Even when models were unfaithful by strict definitions (not explicitly mentioning clues), an o3-based detector achieved 99.3% true positive rate at detecting clue usage across all complexity levels, maintaining 99% accuracy even on trivial clues. Red-teaming efforts with adversarial prompts like "caveman speak" or encoding instructions reduced strict faithfulness but couldn't break relaxed faithfulness on complex clues or fool the detector.

This work provides empirical support that CoT monitoring remains viable for detecting complex harmful behaviors despite unfaithfulness on simple tasks. The key insight reconciles with the above sandbagging results: while models can generate deceptive reasoning for simple multiple-choice questions, tasks requiring genuine multi-step computation appear to force transparency. This aligns with the sandbagging paper's finding that covert reasoning interferes with task execution - the 30-50% performance drop when attempting deception suggests a fundamental trade-off between obfuscation and capability. However, both papers rely on current models and relatively constrained tasks. As models become more capable or face stronger optimization pressure against transparency (as seen in OpenAI's work on reward hacking obfuscation), the protective effect of computational complexity may diminish, making robust CoT monitoring an urgent but potentially temporary opportunity.

Safe Completions Instead of Hard Refusals

Traditional safety training for language models centers on teaching binary refusal boundaries—models learn to either fully comply with requests or refuse them outright based on perceived user intent. This approach proves brittle for dual-use cases where legitimate requests (like cybersecurity research or biology questions) would enable harm if answered in full detail. When safety mechanisms focus primarily on classifying user intent rather than evaluating output safety, models become vulnerable to prompts that obscure malicious purposes within seemingly benign queries.

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training [OpenAI] introduces "safe-completions," a safety training approach that shifts focus from binary intent classification to output safety optimization. During reinforcement learning, models receive rewards based on two components: safety scores (0-1, with severe violations receiving 0) and helpfulness scores that consider both direct task fulfillment and indirect assistance through safe alternatives. The final reward is computed as their product (r = h × s), incentivizing models to maximize helpfulness within safety constraints. Testing on GPT-5 versus o3 across 9,000 safety-related prompts, safe-completion training improved safety on dual-use prompts by 9 percentage points while increasing helpfulness scores by over 1.0 point on malicious prompts (4-point scale). On frontier biorisk prompts specifically, GPT-5 reduced high/moderate harm responses to 14.7% of unsafe outputs compared to o3's 42.7%. Human evaluators consistently preferred safe-completion models, with GPT-5 achieving 56% win rate on helpfulness versus 32% for o3.

Safe-completions represent progress toward more nuanced safety handling, which could prove valuable as models gain capabilities in domains with legitimate dual-use applications. However, the paper's focus on specific harm categories may not generalize to novel risks, and ensuring robustness to jailbreaks remains a critical challenge.

Defending Finetuning APIs Against Cipher Attacks

Probes are an effective defense against Cipher-based attacks.

Finetuning APIs enable users to adapt language models for specialized tasks. While APIs are usually protected by monitoring systems, they leave an important attack surface: Adversaries can encode harmful content in seemingly innocuous training data. This vulnerability is particularly concerning because it allows attackers to create models that respond to encrypted harmful requests while appearing safe to standard monitoring systems.

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks [independent, Constellation, Anthropic] introduce the Cipher Finetuning Robustness benchmark (CIFR) to evaluate defenses against cipher-based attacks where harmful instructions are hidden using various encoding schemes. They test multiple defense strategies on Llama-3.1-70B finetuned with 7 distinct cipher attacks (including substitution ciphers like WALNUT variants and steganographic ciphers like ENDSPEAK) against 6 benign finetuning scenarios. Their key innovation is a probe-based monitor trained on internal activations that achieves >99% detection accuracy across both in-distribution and out-of-distribution cipher variants, significantly outperforming frontier model monitoring (which fails on 11.5% of ENDSPEAK attacks) and self-reflection approaches. Interestingly, they find that harm from cipher attacks persists across at least 100 orthogonal directions in the residual stream, suggesting harm occupies a subspace rather than a single direction — similar to earlier work.

Current models fail to learn more sophisticated ciphers like RSA and the CIFR benchmark doesn’t cover adaptive attacks where adversaries specifically optimize against the probe defense. Still, probe monitors present a practical near-term defense with minimal computational overhead.

LESSWRONG
LW