tl;dr Paper of the month: UK AISI’s most realistic research-sabotage propensity eval finds zero unprompted sabotage across frontier models. Mythos Preview continues prefilled sabotage 7% of the time with a 65% reasoning–output discrepancy, though every model classifies 100% of those prefill transcripts as evals when asked. Research highlights: * In...
tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: * Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering...
tl;dr Papers of the month: Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers, with probe-first cascades now deployed at both Anthropic and Google DeepMind. Research highlights: * Fine-tuning open-weight models on benign outputs from safeguarded frontier models recovers up to 71% of harmful capability gaps,...
tl;dr Paper of the month: Auditing game shows that sandbagging detection remains difficult—only on-distribution finetuning can reliably remove sandbagging, while detection suffers from false positives. Research highlights: * Asynchronous monitoring improves through iterative red-blue teaming, achieving 6% false negative rate, but non-trivial evasion persists. * Models can be finetuned to...
How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...
tl;dr Paper of the month: Reward hacking in production RL can naturally induce broad misalignment including alignment faking and sabotage attempts. Research highlights: * Increasing model honesty via finetuning works best but lie detection is hard. Finetuning models to self-report errors generalizes well to admitting hidden objectives. * Sabotage evaluations...
tl;dr Paper of the month: Synthetic Document Finetuning creates deep, robust beliefs that withstand adversarial scrutiny, though egregiously false facts still make these beliefs detectable, as do other knowledge editing methods and narrow finetuning. Research highlights: * White-box methods and response prefill enable secret knowledge extraction in small models. *...