Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments. Further, we can later adapt misalignment classifiers into control monitors: language models that review transcripts with the objective of informing decisions (e.g. whether to block a tool call), in order to mitigate misalignment risks. Misalignment classifiers and control monitors sound similar, but misalignment classifiers are harder to adversarially evaluate because the concept of an “intentionally misaligned trajectory" is too fuzzy to withstand the optimisation pressure of a red team. This post explains the classifier/monitor distinction, why we think misalignment classifiers are harder to evaluate, why it might be useful to deploy one anyway, our best-guess method for developing them, and some early experimental results. We would like to thank the broader UK AISI’s Control Team (including Asa Cooper Stickland, Tomek Korbak), Geoffrey Irving and David Lindner who provided significant helpful discussions early-on. Thanks also to Alex Mallen, Fabien Roger, Nate Burnikell, James Peters-Gill, Rich BC, and Simon Storf for feedback. 1. Misalignment classifiers versus control monitors Let’s say a control monitor is an AI that scores transcripts from an agent (another AI) to inform interventions that reduce the risk of a particular unacceptable outcome. A control monitor’s accuracy in identifying dangerous tool calls is instrumentally useful to reducing risk (e.g. we can replace or audit dangerous tool calls). Similarly, a control monitor’s ability to gather evidence about whether your agent is a schemer could be instrumentally useful to reduce risk
1) The safe-to-dangerous shift is a fundamental problem for eval realism Suppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However,...
TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation. Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM...
This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute). The full paper can be found here. Tl;dr We did a deep dive into untrusted...
TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments. This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read...
Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments....
We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring). Abstract An AI control protocol is a plan...