gasteigerjo

AI Safety at the Frontier: Paper Highlights of May & June 2026

tl;dr Paper of the month: Anthropic’s Jacobian lens reveals that models have a sparse workspace of verbalizable concepts that causally carries multi-hop reasoning and surfaces hidden cognition — as opposed to other, more automatic mental processing. Research highlights: * Natural language autoencoders translate activations into human-readable descriptions, surfacing e.g. unverbalized...

Jul 816

AI Safety at the Frontier: Paper Highlights of April 2026

tl;dr Paper of the month: UK AISI’s most realistic research-sabotage propensity eval finds zero unprompted sabotage across frontier models. Mythos Preview continues prefilled sabotage 7% of the time with a 65% reasoning–output discrepancy, though every model classifies 100% of those prefill transcripts as evals when asked. Research highlights: * In...

May 618

AI Safety at the Frontier: Paper Highlights of February & March 2026

tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: * Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering...

Apr 48

AI Safety at the Frontier: Paper Highlights of January 2026

tl;dr Papers of the month: Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers, with probe-first cascades now deployed at both Anthropic and Google DeepMind. Research highlights: * Fine-tuning open-weight models on benign outputs from safeguarded frontier models recovers up to 71% of harmful capability gaps,...

Feb 322

AI Safety at the Frontier: Paper Highlights of December 2025

tl;dr Paper of the month: Auditing game shows that sandbagging detection remains difficult—only on-distribution finetuning can reliably remove sandbagging, while detection suffers from false positives. Research highlights: * Asynchronous monitoring improves through iterative red-blue teaming, achieving 6% false negative rate, but non-trivial evasion persists. * Models can be finetuned to...

Jan 1416

Towards training-time mitigations for alignment faking in RL

by Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, and evhub

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...

Dec 16, 202539

AI Safety at the Frontier: Paper Highlights of November 2025

tl;dr Paper of the month: Reward hacking in production RL can naturally induce broad misalignment including alignment faking and sabotage attempts. Research highlights: * Increasing model honesty via finetuning works best but lie detection is hard. Finetuning models to self-report errors generalizes well to admitting hidden objectives. * Sabotage evaluations...

Dec 2, 20256

gasteigerjo

gasteigerjo

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Training fails to elicit subtle reasoning in current language models

Automated Researchers Can Subtly Sandbag

Towards training-time mitigations for alignment faking in RL

gasteigerjo

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Training fails to elicit subtle reasoning in current language models

Automated Researchers Can Subtly Sandbag

Towards training-time mitigations for alignment faking in RL

AI Safety at the Frontier: Paper Highlights of May & June 2026

AI Safety at the Frontier: Paper Highlights of April 2026

AI Safety at the Frontier: Paper Highlights of February & March 2026

AI Safety at the Frontier: Paper Highlights of January 2026

AI Safety at the Frontier: Paper Highlights of December 2025

Towards training-time mitigations for alignment faking in RL

AI Safety at the Frontier: Paper Highlights of November 2025