TL;DR * To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence) * We systematically vary these factors in four realistic settings, each...
This is an interim report produced as part of the Summer 2025 LASR Labs cohort, supervised by David Lindner. For the full version, see our paper on the LASR website. As this is an ongoing project, we would like to use this post both as an update, and as a...
This project was conducted as a part of a one-week research sprint for the Finnish Alignment Engineering Bootcamp. We would like to thank Vili Kohonen and Tyler Tracy for their feedback and guidance on the project. TLDR Research in AI control is tedious. To test the safety and usefulness of...
Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here. Executive Summary * Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of...
TL;DR * We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities. * Untrusted models were able to...