3mo

This is an interim report produced as part of the Summer 2025 LASR Labs cohort, supervised by David Lindner. For the full version, see our paper on the LASR website.

As this is an ongoing project, we would like to use this post both as an update, and as a call for feedback: What flaws do you see in our methodology? What seems unrealistic or contrived about our scenarios? What are we missing? Which research directions should we prioritize?

Executive Summary

We evaluated the scheming propensity of LLM agents under realistic deployment conditions, focusing on scheming behavior that is consistent with agents pursuing misaligned instrumental goals, particularly self-preservation.

We base our experiments on a scheming incentives... (read 3141 more words →)

Prompt optimization can enable AI control research

Mia Hopman

Mia Hopman, ZachParent

5mo

This project was conducted as a part of a one-week research sprint for the Finnish Alignment Engineering Bootcamp. We would like to thank Vili Kohonen and Tyler Tracy for their feedback and guidance on the project.

TLDR

Research in AI control is tedious. To test the safety and usefulness of a new protocol, AI control researchers manually write the prompts to maximize the LLM’s performance on an evaluation. There are two problems with this process:

It’s time-consuming. Dead end protocols may be explored for too long, while the researcher manually changes prompts and reruns evaluations, only to finally determine the protocol is not effective.
Prompts are specialized for the model and context used during experimentation. An

... (read 2645 more words →)

Optimally Combining Probe Monitors and Black Box Monitors

Tim Hua

Tim Hua, James Baskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

7mo

Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here.

Executive Summary

Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of our outputs, and only escalate to more powerful and costly monitors if the probes detect something suspicious.
- For example, ChatGPT Agent uses a fast topic classifier to decide if queries are related to biology, while another reasoning model monitors for bio-misuse (pg 35)
We study how to efficiently combine multiple monitors with different performance and cost profiles into a single protocol.
- While existing studies often rely on heuristic

... (read 1785 more words →)

Untrusted AIs can exploit feedback in control protocols

Mia Hopman

Mia Hopman, BionicD0LPH1N, Tyler Tracy

9mo

TL;DR

We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities.
Untrusted models were able to exploit the feedback to understand monitoring strategies and craft more effective attacks, reducing safety from trusted editing’s baseline of 93% to 80%. While compliance monitoring (checking if feedback was actually implemented) and selective editing provided some improvement, these mitigations failed to increase safety beyond the trusted editing baseline in our setup.
These findings should not be interpreted as a definitive rejection of untrusted editing, but rather as

... (read 4616 more words →)

LESSWRONG
LW

LESSWRONG
LW

Mia Hopman

Optimally Combining Probe Monitors and Black Box Monitors

Prompt optimization can enable AI control research

Untrusted AIs can exploit feedback in control protocols

Current LLM agents need strong pressure to engage in scheming behavior

Mia Hopman

Current LLM agents need strong pressure to engage in scheming behavior

Prompt optimization can enable AI control research

Optimally Combining Probe Monitors and Black Box Monitors

Untrusted AIs can exploit feedback in control protocols

Mia Hopman

Optimally Combining Probe Monitors and Black Box Monitors

Prompt optimization can enable AI control research

Untrusted AIs can exploit feedback in control protocols

Current LLM agents need strong pressure to engage in scheming behavior

Mia Hopman

Current LLM agents need strong pressure to engage in scheming behavior

Prompt optimization can enable AI control research

Optimally Combining Probe Monitors and Black Box Monitors

Untrusted AIs can exploit feedback in control protocols

Executive Summary

TLDR

Executive Summary

TL;DR