David Lindner

Testing Gemini models for scheming tendencies

by Vika, David Lindner, Seb Farquhar, and Rohin Shah

As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity. Our new papers focus on propensity for...

May 2947

Exploration Hacking: Can LLMs Learn to Resist RL Training?

by Eyon Jang, Joschka Braun, Damon Falck, and David Lindner

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity. Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons,...

May 124

Predicting When RL Training Breaks Chain-of-Thought Monitorability

Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah. Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows...

Apr 132

Understanding when and why agents scheme

by Mia Hopman, Jannes Elstner, Maria Avramidou, Amritanshu Prasad, David Lindner, and LASR Labs

TL;DR * To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence) * We systematically vary these factors in four realistic settings, each...

Mar 2149

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

by Oliver Daniels, Perusha Moodley, and David Lindner

code, paper, twitter thread copied below: Introduction Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts: The core problem: We want to audit models for hidden goals before deployment. But future misaligned AIs...

Feb 1016

Practical challenges of control monitoring in frontier AI deployments

TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments. This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read...

Jan 1219

Current LLM agents need strong pressure to engage in scheming behavior

by Mia Hopman, Jannes Elstner, Maria Avramidou, Amritanshu Prasad, David Lindner, and LASR Labs

This is an interim report produced as part of the Summer 2025 LASR Labs cohort, supervised by David Lindner. For the full version, see our paper on the LASR website. As this is an ongoing project, we would like to use this post both as an update, and as a...

Nov 20, 202524

David Lindner

David Lindner

Clarifying AI X-risk

Practical Pitfalls of Causal Scrubbing

MONA: Managed Myopia with Approval Feedback

Threat Model Literature Review

David Lindner

Clarifying AI X-risk

Practical Pitfalls of Causal Scrubbing

MONA: Managed Myopia with Approval Feedback

Threat Model Literature Review

Testing Gemini models for scheming tendencies

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Predicting When RL Training Breaks Chain-of-Thought Monitorability

Understanding when and why agents scheme

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Practical challenges of control monitoring in frontier AI deployments

Current LLM agents need strong pressure to engage in scheming behavior