I believe that there is another dimension to the problem of designing good control ecperiments.
I'm doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn't think they are inside evals.
I've tried to make evaluation scenarios more realistic, but it's a challenge. Models are quite good at "smelling" being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.
For control evaluations, it doesn't matter whether the models can tell it's an eval or not--we're not trying to evaluate their propensity to do bad things, we're evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they're in some kind of eval).
The most common (Task:Malicious side task) pair in the wild is (Fix the unit test / Break the unit test so that it falsely passed) but patching that issue is basically pure capabilities research.
If claude sonnet 4 in claude code has access to edit a test, and is asked to investigate and fix why the test is failing, then its default action is to modify the test to pass unconditionally (while still looking like it tests the code) and then lie about it, telling a long story about how it cleverly found and solved an underlying logic error.
he presumably means “the AIs reads the unit test then rewrite the tested code so it overfits on the test, e.g. by using the magic numbers in the unit test.”
he might alternatively mean “the AI changes the unit test to be less strict” but this would be easy to fix with permission access.
One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.
It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.
In order to do high-stakes control research, you need a dataset of tasks with the following properties:
Here's why it's challenging to develop settings that work.
To make the situation more challenging, we have preferences over the kinds of domains we study. Because my main concern is AIs misusing their access to compute and proprietary models while doing ML R&D, I'm most interested in settings where the AIs are doing software engineering or using computers for tasks similar to those done by researchers in AI companies.
When we were writing the first paper in the APPS setting, I did not appreciate how much harder it would be to construct settings for subsequent projects. And I really messed up by not putting more emphasis on this at the start of the Ctrl-Z project.
Incidentally, I don't think that we can easily satisfy these desiderata by trying to apply control in current real applications of AI agents. Realistically, AIs right now are not given very much autonomy (mostly because they can't use it productively) or very much access to dangerous resources. So current deployments don't provide us with example trajectories that have that much in common with the situations I'm worried about later. (I think that studying AI-control-like techniques in real deployments is a promising project for researchers inside AI companies, but mostly because I think it helps prepare the organization for implementing similar techniques later, not because it is a good path to directly learning about how to use these techniques to mitigate insider risk from schemers.)