tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: * Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering...
tl;dr Papers of the month: Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers, with probe-first cascades now deployed at both Anthropic and Google DeepMind. Research highlights: * Fine-tuning open-weight models on benign outputs from safeguarded frontier models recovers up to 71% of harmful capability gaps,...
tl;dr Paper of the month: Auditing game shows that sandbagging detection remains difficult—only on-distribution finetuning can reliably remove sandbagging, while detection suffers from false positives. Research highlights: * Asynchronous monitoring improves through iterative red-blue teaming, achieving 6% false negative rate, but non-trivial evasion persists. * Models can be finetuned to...
How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...
tl;dr Paper of the month: Reward hacking in production RL can naturally induce broad misalignment including alignment faking and sabotage attempts. Research highlights: * Increasing model honesty via finetuning works best but lie detection is hard. Finetuning models to self-report errors generalizes well to admitting hidden objectives. * Sabotage evaluations...
tl;dr Paper of the month: Synthetic Document Finetuning creates deep, robust beliefs that withstand adversarial scrutiny, though egregiously false facts still make these beliefs detectable, as do other knowledge editing methods and narrow finetuning. Research highlights: * White-box methods and response prefill enable secret knowledge extraction in small models. *...
While recent AI systems achieve strong performance through human-readable reasoning that should be simple to monitor (OpenAI, 2024, Anthropic, 2025), we investigate whether models can learn to reason about malicious side tasks while making that reasoning appear benign. We find that Sonnet 3.7 can learn to evade either a reasoning...