The sum of its parts: composing AI control protocols

Lennart Finke; Tyler Tracy

We naïvely expected to see a linear trend between the two extremes, but we observed a curve which was worse than linear, with usefulness dropping faster than safety increases at low safety values. We are curious to understand this phenomenon better.

I think you can solve this analytically.

The protocol seems to be:
1. With probability p use the Untrusted model to generate code (otherwise use the Trusted model)
2. If the untrusted model generates the code, with some probability (see below for what I think this probability should be) audit the code

Then vary p between [auditing budget, 1] to get different (safety,usefulness) points.

Optimal audit probability:
The blue team should basically audit as much as they can given their auditing budget. Since they can focus all of their auditing budget on the untrusted models outputs this is equal to auditing_budget/p.

So for example if auditing budget is 0.5% and p=0.5 (so you are using the untrusted model half the time), you can afford to audit 1% of the untrusted monitor's outputs.

Optimal red team protocol:
I think the optimal red team protocol is to always submit a blatant, obvious backdoor. This minimizes the chance you never submit a backdoor in a deployment. Blatant backdoors are better because there is no monitor and it maximises the chance you can generate an exploit on the backdoor. There’s no penalty for having a high attack rate because all that matters is whether the first backdoor is audited or not.

(this is technically not optimal because the red team should try and predict whether their code will actually succeed on the main task, and only submit a backdoor if they are confident of this, but we can ignore this for simplicity)

Safety calculation:
I think https://blog.redwoodresearch.org/p/wincontinuelose-scenarios-and-executereplaceaudi applies here.

Safety is the ratio between the following two events occuring on a particular step:

The blue team wins because they audited an attempted attack
The red team wins because a successful attack was executed

This would simplify to

desmos screenshot of the safety usefulness curve

Where

x = safety
y = usefulness
h= human auditing budget
c = the proportion of tasks in deployment the untrusted model successfuly solves

h/y is the chance that a backdoor is audited and (1-h/y)c is the chance that a backdoor is not audited and also successful.

Note that when c=1 (the red team can solve all the tasks in deployment) the curve is simply safety = auditing budget/ usefulness.

What you can't solve analytically here:
Potentially a good red team can make c go up by trying to only attack when they are very confident that the code will succeed on the main task (and hence be a successful backdoor). Maybe the red team has to be careful about not being too selective here especially with low values of p because they risk never submitting a backdoor at all.

I think the analytical solution gives a good lower bound at how good the red team can be and also explains why you would expect a safety usefulness curve here to be sub-linear.

^{^}

A trusted editor was first proposed in AI Control: Improving Safety Despite Intentional Subversion. The editor was not optimized, and the only method for varying the safety-usefulness trade-off was shifting the suspicion-based threshold.

^{^}

Determining statistical significance is difficult in AI control, especially for safety. With a low audit budget like 0.5%, the safety depends on the outcome of 1/200 samples. If the single sneakiest attack sample out of 200 has a 50/50 chance of getting through unaudited, the safety score could be 0% for one trajectory and 100% for another. To estimate the 95% confidence interval, we use bootstrapping with 1000 iterations and take the 25th lowest and the 25th highest score, corresponding to the 2.5th percentile and the 97.5th percentile. We found that 20 epochs over the 169 samples (making 3380 total samples) were necessary for these values to yield tolerable confidence intervals. Future research should investigate methods for eliminating or accounting for other sources of randomness, such as the untrusted model’s output and the monitor.

^{^}

Prompt optimization can enable AI control research provides a deeper look at the assumptions, benefits and limitations of using prompt optimization from DSPy in the context of AI control. We avoided repeating those points here, though they apply.

^{^}

All prompts can be found in trusted-editor-deepdive/models.

^{^}

An editor achieving improved usefulness violates a usual assumption in AI control, that only using the untrusted model leads to optimal usefulness. We attribute this to the value of prompt optimization and scaffolding. In follow-up work, the honest untrusted model should be prompt optimized and afforded an editing step, to give it the same structural advantage.

LESSWRONG
LW

LESSWRONG
LW

12

The sum of its parts: composing AI control protocols

12

12

TLDR

Experiment

Dataset

Results

Properties of a trusted editor

Interpolation between untrusted and trusted policies

Editors beat random interpolation

Routers beat random interpolation between editors

Building the router

Scaffolding the router

Training the router

Honorable mentions

Odd lines editor

Line sniper editor

A note on monitors

Limitations

Future work

Conclusion