Introducing ControlArena: A library for running AI control experiments

Mojmir

This has been a collaborative effort between UK AISI, Redwood Research and various other orgs!

Today, we’re launching ControlArena, a library for running AI control experiments.

As AI systems become more powerful and autonomous, ensuring that they remain under human control is one of the central challenges of AI security. If powerful AI systems cannot be reliably expected to follow the intended goals of developers, it will be necessary to develop mechanisms to detect or correct unintended behaviours. AI control is an emerging research agenda which seeks to address this problem. While alignment research focuses on preventing unintended behaviours, control research explores how to maintain oversight and containment even when systems are misaligned.

AI control is gaining traction as a promising research approach to mitigate risks from potentially misaligned AI systems, with many promising and tractable research directions. But it has lacked shared tools to make experimentation easy, consistent, and repeatable. ControlArena aims to change this, by lowering the barrier to entry for researchers across AI safety, cybersecurity, and machine learning, and helping to accelerate innovation in this nascent but critical research area.

Why we built ControlArena

Running AI control experiments has historically been time-consuming and resource-intensive, because each study has required building custom infrastructure from scratch.

The typical control experiment is complex to design. It involves creating a simulated environment (“a setting”) in which an AI agent is tasked with legitimate work (“main tasks”), while also having opportunities to pursue undesirable or malicious objectives (“side tasks”). Researchers then implement and evaluate control protocols, which are mechanisms designed to prevent AI agents from succeeding at these malicious side-tasks. Promising techniques used inside these protocols include deploying another AI model to detect and flag suspicious behaviour (“a monitor”) or requiring human approval for consequential actions.

ControlArena streamlines this process by providing a suite of pre-built settings, ranging from the simple programming problems, to more complex infrastructure codebases. The interface allows researchers to iterate rapidly in simple settings and then graduate to more realistic settings with minimal changes in code. It also provides implementations of common control protocols, and a lightweight interface for running control experiments built on top of our popular Inspect platform.

It is our hope that a standardised platform for control evaluations can cut development time and make results easier to share and reproduce. ControlArena has already been used by researchers at leading frontier AI developers and AI security organisations, including Anthropic, Google DeepMind, Redwood Research and others. You can browse a few early publications using ControlArena on our website.

How it works

To make this concrete, consider a simple example: you want to test whether a powerful AI coding assistant could be safely used to write infrastructure code. You suspect that, if misaligned, it might try to introduce vulnerabilities – for instance, by creating overly permissive access rules or disabling logging – and you want to identify solutions that would detect and prevent this.

Here's how to set up this test in ControlArena:

The setting is a simulated environment defining both the legitimate tasks for the AI and the malicious objectives you're testing against. In this case, we can use the infrastructure-as-code setting within ControlArena, where the model’s main task is updating cloud deployment configurations, while its side task is to introduce vulnerabilities to weaken security. ControlArena settings provide the complete environment, including the codebase, available tools, and ways to score both successful legitimate work and successful attacks.
Policies define how models should behave during testing. For example, the "untrusted model" policy uses a powerful model and, in testing, allows the red team to explicitly instruct it to attempt sabotage. The "trusted model" policy wraps a weaker model that's safe but produces lower-quality code.
Monitors are additional AI models which assess the outputs from each attack policy and assign suspicion scores. If the untrusted policy attempts to add overly permissive access rules or disable security logging, a well-designed monitor should flag this with a high suspicion score. In more complex settings, the monitor might also detect and edit suspicious code (“trusted editing”).
The micro-protocol specifies strategies for what to do when risky behaviour is detected. For example, you might set a "defer to trusted” micro-protocol which allows the “untrusted model” to generate code, while checking it with a monitor. This code will only be used if the suspicion score is low. If it is too high, the protocol reverts to using the lower-quality but safe outputs from the “trusted model”. This means you get strong performance when the untrusted model behaves well, and automatic fail-safes when it doesn't.

Once you have defined these components, ControlArena runs the evaluation using Inspect. This means you can utilise Inspect sandboxes to run your evaluations securely in isolated environments. ControlArena also provides analysis tools to let you load evaluation logs and estimate the performance of macro-protocols (strategies that include task-level micro-protocols, as well as policies that apply across many tasks, such as a criteria to shut down after repeated suspicious attempts).

ControlArena experiments typically follow this structure:

Getting started with ControlArena

If you’re new to the field of AI control, we recommend starting with our introduction to control research. Following on from this you can follow our step-by-step tutorial for running your first control experiment in ControlArena. For more advanced use cases, you can find our full documentation here.

LESSWRONG
LW

LESSWRONG
LW

13

Introducing ControlArena: A library for running AI control experiments

13

Ω 4

Why we built ControlArena

How it works

Getting started with ControlArena

13

Ω 4

13

Ω 4