Vlad Mikulik

Towards training-time mitigations for alignment faking in RL

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when...

Dec 16, 202539

Training fails to elicit subtle reasoning in current language models

by mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton, and Vlad Mikulik

While recent AI systems achieve strong performance through human-readable reasoning that should be simple to monitor (OpenAI, 2024, Anthropic, 2025), we investigate whether models can learn to reason about malicious side tasks while making that reasoning appear benign. We find that Sonnet 3.7 can learn to evade either a reasoning...

Oct 9, 202549

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

by Tomek Korbak, Mikita Balesni, Vlad Mikulik, and Rohin Shah

Twitter | Paper PDF Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agents. Luckily, we ended up with reasoning models that speak their thoughts clearly enough for us to follow along (most of the time)....

Jul 15, 2025169

Reasoning models don't always say what they think

by Joe Benton, Ethan Perez, Vlad Mikulik, and Fabien Roger

Do reasoning models accurately verbalize their reasoning? Not nearly as much as we might hope! This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using...

Apr 9, 202528

Automated Researchers Can Subtly Sandbag

by gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, and Fabien Roger

tl;dr When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage...

Mar 26, 202544

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

by Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik, and Rohin Shah

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about...

Dec 18, 2023149

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

by Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah, and Vlad Mikulik

Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik Informal TLDR * We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla...

Jul 20, 202344

Vlad Mikulik

Vlad Mikulik

Risks from Learned Optimization: Introduction

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Utility ≠ Reward

Vlad Mikulik

Risks from Learned Optimization: Introduction

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Utility ≠ Reward

Towards training-time mitigations for alignment faking in RL

Training fails to elicit subtle reasoning in current language models

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Reasoning models don't always say what they think

Automated Researchers Can Subtly Sandbag

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla