This has been a collaborative effort between UK AISI, Redwood Research and various other orgs!
Today, we’re launching ControlArena, a library for running AI control experiments.
As AI systems become more powerful and autonomous, ensuring that they remain under human control is one of the central challenges of AI security. If powerful AI systems cannot be reliably expected to follow the intended goals of developers, it will be necessary to develop mechanisms to detect or correct unintended behaviours. AI control is an emerging research agenda which seeks to address this problem. While alignment research focuses on preventing unintended behaviours, control research explores how to maintain oversight and containment even when systems are misaligned.
AI control is gaining traction as a promising research approach to mitigate risks from potentially misaligned AI systems, with many promising and tractable research directions. But it has lacked shared tools to make experimentation easy, consistent, and repeatable. ControlArena aims to change this, by lowering the barrier to entry for researchers across AI safety, cybersecurity, and machine learning, and helping to accelerate innovation in this nascent but critical research area.
Running AI control experiments has historically been time-consuming and resource-intensive, because each study has required building custom infrastructure from scratch.
The typical control experiment is complex to design. It involves creating a simulated environment (“a setting”) in which an AI agent is tasked with legitimate work (“main tasks”), while also having opportunities to pursue undesirable or malicious objectives (“side tasks”). Researchers then implement and evaluate control protocols, which are mechanisms designed to prevent AI agents from succeeding at these malicious side-tasks. Promising techniques used inside these protocols include deploying another AI model to detect and flag suspicious behaviour (“a monitor”) or requiring human approval for consequential actions.
ControlArena streamlines this process by providing a suite of pre-built settings, ranging from the simple programming problems, to more complex infrastructure codebases. The interface allows researchers to iterate rapidly in simple settings and then graduate to more realistic settings with minimal changes in code. It also provides implementations of common control protocols, and a lightweight interface for running control experiments built on top of our popular Inspect platform.
It is our hope that a standardised platform for control evaluations can cut development time and make results easier to share and reproduce. ControlArena has already been used by researchers at leading frontier AI developers and AI security organisations, including Anthropic, Google DeepMind, Redwood Research and others. You can browse a few early publications using ControlArena on our website.
To make this concrete, consider a simple example: you want to test whether a powerful AI coding assistant could be safely used to write infrastructure code. You suspect that, if misaligned, it might try to introduce vulnerabilities – for instance, by creating overly permissive access rules or disabling logging – and you want to identify solutions that would detect and prevent this.
Here's how to set up this test in ControlArena:
Once you have defined these components, ControlArena runs the evaluation using Inspect. This means you can utilise Inspect sandboxes to run your evaluations securely in isolated environments. ControlArena also provides analysis tools to let you load evaluation logs and estimate the performance of macro-protocols (strategies that include task-level micro-protocols, as well as policies that apply across many tasks, such as a criteria to shut down after repeated suspicious attempts).
ControlArena experiments typically follow this structure:
If you’re new to the field of AI control, we recommend starting with our introduction to control research. Following on from this you can follow our step-by-step tutorial for running your first control experiment in ControlArena. For more advanced use cases, you can find our full documentation here.