Mia Hopman — LessWrong

Understanding when and why agents scheme

TL;DR * To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence) * We systematically vary these factors in four realistic settings, each...

Mar 2149

Current LLM agents need strong pressure to engage in scheming behavior

This is an interim report produced as part of the Summer 2025 LASR Labs cohort, supervised by David Lindner. For the full version, see our paper on the LASR website. As this is an ongoing project, we would like to use this post both as an update, and as a...

Nov 20, 202523

Prompt optimization can enable AI control research

This project was conducted as a part of a one-week research sprint for the Finnish Alignment Engineering Bootcamp. We would like to thank Vili Kohonen and Tyler Tracy for their feedback and guidance on the project. TLDR Research in AI control is tedious. To test the safety and usefulness of...

Sep 23, 202540

Optimally Combining Probe Monitors and Black Box Monitors

by Tim Hua, James Baskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, and Tyler Tracy

Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here. Executive Summary * Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of...

Jul 27, 202553

Untrusted AIs can exploit feedback in control protocols

TL;DR * We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities. * Untrusted models were able to...

May 27, 202530