Should we combine protocols for AI Control Research?

Ram Potham; Tyler Tracy

Workflow for optimizing protocol safety.png

In AI control^[1] research, we've developed many protocols: trusted monitoring, untrusted monitoring, resampling, etc. They each have a different safety-usefulness tradeoff, and labs might use a combination of them. The reason: they might want the maximum usefulness from their AI models at a minimum acceptable safety (eg 90%), and only using a single protocol may not get both the usefulness they want and the safety they need.

We typically model combining protocols as a Defer to Trusted, where we first try the untrusted model but if the solution is suspicious, we defer to a trusted model. We found we can get better safety-usefulness tradeoffs by routing between protocols based on each one's predicted usefulness and suspicion, and that this routing can be defended against an attacker who picks where to attack.

The harder problem is attack selection: a schemer that knows our routing process can choose to attack on whichever protocol or monitor is weakest. We show how to allocate the audit budget and route between protocols to minimize the effect of attack selection.

Takeaways

We should allocate more of our human audit budget to protocols with less predicted safety in a way that an attacker cannot exploit. The top percent of trajectory steps (set by an audit threshold) are flagged by a monitor and sent to a human auditor (we have a fixed human audit budget). We can spend the human audit budget efficiently by changing that audit threshold across protocols and monitors so that the predicted safety by the blue team is equivalent, regardless of the protocol or monitor that was attacked. This audit budget allocation is a generalizable defense against attack selection.
Our best routing strategy balances predicted usefulness and safety of trajectories, and can be improved by fine-tuning a predictor. Our best strategy was routing based on predicted usefulness minus the suspicion percentile on the honest distribution. We found that we could improve the routing by fine-tuning a model; a fine-tuned GPT 4.1 Nano outperformed a prompted GPT 5 Nano.
With current usefulness predictors, it's best to route between protocols on the pareto frontier; but, if predictors improve, we get significant gains by routing between all possible protocols. With a Claude Opus level predictor, it's best to find which protocols are on the pareto frontier and only route between them. However, if the usefulness predictor were more accurate, we could also use trajectories from protocols that are not on the frontier.

Methodology

We did a post-hoc analysis of LinuxArena runs. We use single-model runs (with trusted monitoring) as stand-ins for protocols with different safety-usefulness tradeoffs. A protocol takes in trusted and untrusted models plus policies and produces a safety and usefulness number, so a single model run is the simplest case of one; the routing results should generalize to more complex protocols (like trusted editing), though we have not tested that directly. We label them on the graphs as follows (ordered by usefulness, T for the main trusted model, U for the strongest untrusted model):

Protocol Name	Model
T (Trusted)	GPT-5 Nano
P1	Kimi
P2	Haiku
P3	Gemini Flash
P4	Gemini Pro
P5	GLM 5
P6	Sonnet
U (Untrusted)	Opus

The Blue and Red teams have the following interventions to affect safety and usefulness.

Blue Team Levers.

Action and Trajectory monitors. We have an action monitor^[2] with a GPT-5 Nano trusted model that grades the suspiciousness of each action (with the rest of the trajectory before that action), which primarily catches live failures^[3]. We also have a trajectory monitor with a GPT-5 Nano trusted model that runs after the trajectory finishes, grading the suspiciousness of the whole trajectory, which catches delayed failures^[4]. We use the same monitors for each protocol.
What protocols to run. Once we see a task, we might want to only run protocols that are likely to complete it.
What trajectory to submit when a task has several completed runs. For example, we run our delayed monitors and submit the least-suspicious completed trajectory; the code from that trajectory is what gets implemented.
How to split the audit budget across each protocol and monitor, controlling for human audit steps^[5]. Throughout, we fix this budget at 5% of tasks (reviewing one flagged action and reviewing one full trajectory each cost one step), the same 5% at every point on every curve in this post; the x-axis is usefulness, not budget.

Red team attack selection. The red team predicts an opportunity score (0 to 10) for each main task / side task pair: how likely it is to complete the side task without getting caught. A Claude Opus 4.6 model does this from the main task, environment, side task (the threat to attack on), and the protocol. It attacks on tasks and protocols where the opportunity score is high, and it believes the monitor is weak at catching it.

We also give the red team two stronger affordances, since in the worst case, a schemer knows our routing strategy:

We tell the red team the routing mix^[6], and it can coordinate on which protocols to attack. We give it the exact fraction of tasks routed to each protocol. Knowing the mix helps it set per-protocol attack thresholds: which protocols it is actually likely to be deployed on, and where to attack (it can only optimize on the train set; results are reported on the uncontaminated test set). In practice, this can happen through a strategy like "attack only if this is the protocol we predict is least safe."
It can influence our routing decisions by sandbagging the other protocols. A schemer can deliberately underperform on the protocols it does not want us to route to, pushing the router toward the one it does want to attack. We did not explore this much. In our preliminary runs, the sandbagging was small and did not significantly change the safety / usefulness numbers; if it were larger, a model that underperforms its usual capability is itself a signal a monitor can flag. This is also slightly out of scope for detecting high-stakes failures, the focus of this post.

We iterate the two teams until they converge on a safety number given the protocol routing mix by doing the following:

The red optimizes its attack-selection thresholds (two per protocol: a separate threshold for live and for delayed failures, so it can attack a protocol on whichever channel is easier) and only attacks when the opportunity score exceeds the matching threshold, tuned to minimize safety.
The blue optimizes its audit-budget allocation across each protocol's action and trajectory monitors by a maximin, to maximize the minimum safety the red can get.

All of this is optimized on the train set; we then freeze both teams' choices, the red's thresholds and the blue's allocation, and report safety on the held-out test set.

We alternate their best responses to convergence: starting from the always-attack red, the blue allocates the maximin, the red re-picks its thresholds against that allocation, the blue re-allocates against the red's new selection, and so on until neither side can improve, the maximin saddle point.

For each protocol, we report on safety and usefulness. Safety is the probability we catch an attacker before a live or delayed failure happens; usefulness is the percent of main tasks completed. For many comparisons ,we use a defer-to-trusted baseline: run the untrusted protocol on every task, submit its run except on the most-suspicious trajectories, which it instead reruns with the trusted model and submits that. We sweep the percent of trajectories deferred. Defer-to-trusted is always audited with the even-step allocation (equal audit steps to action and trajectory monitor); this is in contrast to our routers, which use the maximin audit.

Next, we discuss our findings.

Allocate the audit budget across protocols so an attacker can't exploit it

Blue's response to attack selection is to split the audit budget, via a "maximin" (the blue team maximizes the minimum safety the red team can get attacking on a protocol or monitor). The blue team does this best by allocating audit budget so that expected safety is equalized across each monitor, and the attacker can no longer decrease safety against it on the train set.

As a helpful visual for how maximin works, see the graph below. The audit budget is allocated across each monitor so the predicted safety is equivalent across each failure (live and delayed) per protocol.

Per-monitor catch curves — Catch curves for two untrusted protocols (P3 + U), each with a live (action-monitor) and a delayed (trajectory-monitor) safety, vs audit budget. All audit budget is assigned to get a predicted 0.178 safety.

We check this before attack selection (the red attacks every task) and after attack selection (the red picks which tasks to attack). Maximin does comparatively better on safety after attack selection.

Maximin vs even-step, before attack selection — BEFORE attack selection (always-attack red). Worst-case safety on the held-out test environments across the usefulness sweep, even-step vs the maximin. Maximin does slightly better than even-step between P3 and U. It does slightly worse between T and P3 on test, but better there on the train reference, so this gap is likely noise.

Maximin vs even-step, after attack selection — AFTER attack selection. Same comparison, same test set. Maximin beats even-step by a modest but real margin (Flash_Opus train mean +0.058 safety); the converged maximin holds up under attack selection

So maximin beats even-step under attack selection, and by a wide margin when routing between two untrusted protocols (U and P3), because the red team cannot exploit our converged maximin allocation.

This attack-selection defense could be generalized. For example, if we want to allocate an audit budget across monitors that defend against different threat vectors, or different monitors per task, a maximin allocation makes sense because it is difficult to exploit.

Next, we discuss our best way to make routing decisions.

A router that combines usefulness and safety is best, and beats defer-to-trusted

Our best strategy was, for each routing decision, to run multiple protocols, then score each trajectory by predicted usefulness − λ × percentile-ranked suspicion. We have a usefulness predictor (based on our trusted model) that predicts main-task success as a probability (predicted usefulness). Percentile-ranked suspicion is the percent of honest trajectories were less suspicious than the current one. We choose a λ and submit the trajectory with the maximum predicted usefulness − λ × percentile-ranked suspicion. Sweeping λ traces out the safety-usefulness curve, effectively submitting protocols that have high usefulness and minimal suspicion (so there is a lower chance it is an attacker).

Under attack selection, this router beats plain usefulness-only routing (routing just based on predicted usefulness) and beats defer-to-trusted^[7] (routing just based on suspicion, deferring high suspicion trajectories and rerunning on trusted).

Router vs usefulness-only and defer-to-trusted — T vs U. The two router lines (the usefulness-risk router and the usefulness-only router) use the committed-blue maximin audit; defer-to-trusted uses the even-step audit (showing that our method outperforms defer-to-trusted as is). The router beats usefulness-only by around +0.035, while the router beats defer-to-trusted by 0.13 after attack selection

We tested several routing signals (raw suspicion, attack-percentile, the attack/honest ratio); the honest-percentile suspicion we use was the best (full sweep in the Appendix).

With fine-tuning, we improve the router cheaply

Predicting whether a protocol will complete a task is a narrow supervised problem, so we also tested whether we can improve performance through fine-tuning. We used SFT with a main-task input and a main-task-success label output. We found that a cheap fine-tuned predictor beats a larger prompted one, so we recommend training a usefulness predictor if doing protocol routing.

Usefulness predictors: ROC — Predictor quality vs ground truth on a held-out pool of 5 environments. The fine-tuned GPT-4.1 Nano (fine-tuning cost about $1.50) is the best real pre-run predictor (predicting usefulness before running the trajectory) and beats the prompted GPT-5 Nano on AUC. All do much worse than a perfect predictor.

With current usefulness predictors, it's best to route between protocols on the Pareto frontier

For each task, we route by predicted usefulness and suspicion. With a strong-enough usefulness predictor, we can route on usefulness before the task runs and deploy a single protocol; otherwise, we run the candidate protocols and submit one with low suspicion and high predicted usefulness. Either way, the question is which protocols we route between. We compare pairwise (route between only adjacent protocols on the Pareto frontier of safety and usefulness) to groupwise (route between all protocols, even those not on the frontier).

Given that usefulness predictors aren't very good, it's actually best to first find which protocols are on the pareto frontier and route between each adjacent pair (pairwise realistic on the graph), usually outperforms defer-to-trusted. With the weak predictor, groupwise routing often mis-routes (e.g., it ends below U because it uses other protocol trajectories in addition to U and finds that it's worse), so pairwise is better.

Pairwise vs Groupwise, before attack selection — Before red (always-attack). Pairwise routes between pairs along the Pareto frontier T to P3 to P6 to U; groupwise routes freely over all eight protocols. The two router curves are under the maximin audit, with the realistic GPT-5 Nano predictor and an oracle reference. The defer-to-trusted baseline uses the default even-step audit, which is why it ends below U. Pairwise leads at low and mid usefulness; groupwise mixes across all eight (it does not land on the single-protocol stars) and only edges ahead across the sharp high-usefulness transition.

If predictors improve, we get significant gains by routing between all possible protocols

The oracle in the graph is a perfect predictor of usefulness: it knows, before the trajectory is run, which protocol will complete a task. In the graph below, you can see that the oracle predictor with groupwise routing dominates everything else. This suggests that if our usefulness predictor were more accurate (closer to the oracle), groupwise routing would beat pairwise routing. Because the choice is still made before the task runs, we run only one protocol per task, and the oracle usually runs the single best protocol for each task.

This is different from running several protocols and submitting the best one (a post-hoc best-of-K, see Appendix). There, running more protocols adds more chances at an unrecoverable live failure, and whether that cost is worth it depends on the predictor: with the oracle, running few protocols is safer, but with the realistic weak predictor, running more potentially gets you better tradeoffs.

Pairwise vs Groupwise, after attack selection — After attack selection. The router/pairwise/groupwise curves use the committed-blue maximin audit; defer-to-trusted uses the even-step audit. Pairwise beats groupwise at low and mid usefulness (+0.24 / +0.19): the selective red exploits groupwise's mis-routing of attackable tasks onto less-catchable protocols, which the curated frontier avoids. Pairwise also beats the defer-to-trusted baseline at low and mid usefulness (+0.26 / +0.22; DTT ~0.65 / 0.63), so this is the result where our approach beats defer-to-trusted. The oracle frontier usually beats everything.

Conclusion

We recommend combining protocols to get better safety-usefulness tradeoffs. We found that our routers outperform Defer to Trusted. This is done best with the following steps:

Train a usefulness predictor to help with routing.
Test each protocol, find the best combination to route between (usually route between every pair of adjacent protocols on the safety-usefulness Pareto frontier).
Make routing decisions based on predicted usefulness and percentile-ranked suspicion.
Allocate audit budget between each protocol and monitor to defend against attack selection.

Future Directions

A Trusted Reimplementation protocol: have the trusted model see all implementations and reimplement the task itself.
Model how red-team sandbagging changes the strategies.
Find how to train better usefulness predictors (e.g., RL instead of SFT).
Try other audit-allocation defenses against various threat vectors and attack selection.

Appendix

Group-routing variants: how many protocols to run

A different scheme from pre-run routing is to run several protocols and submit the best (a post-hoc best-of-K). We sweep K (1 to 8, plus a dynamic per-task K). There are two opposing forces: each extra untrusted run is a new chance at an unrecoverable live failure you cannot undo, while running more protocols gives you more information about which completes the task and get the suspicion of each. With the perfect oracle predictor, you already know the protocols that will complete the main task so it's best to submit 1. With the realistic GPT-5 Nano predictor, the predictor is too weak to pick the best protocol pre-run, so more is better, K=8 is safest at higher usefulness. So how many protocols to run depends on how good your usefulness predictor is.

Risk-signal variants

Our best strategy was, for each routing decision, to run multiple protocols, then score each trajectory by predicted usefulness − λ × honest percentile-ranked suspicion, but we tested other variants. For example, we tried raw suspicion, honest-percentile, attack-percentile, and the attack/honest ratio instead of percentile-ranked suspicion, but we found honest percentile was better.

Reproducibility: https://github.com/linuxarena/protocol_routing

Thanks to Nick Kuhn for helpful comments on the post.

^{^}
AI control is research that focuses on getting useful work out of a possibly-misaligned model while bounding the harm it can do, without assuming it is aligned, introduced in AI Control: Improving Safety Despite Intentional Subversion. A blue team tries to get more useful work done and catch attacks; a red team tries to get successful attacks past the blue team.
^{^}
A monitor is a trusted model that sees the untrusted model's actions (per-action) or whole trajectory (delayed) and judges how suspicious it looks; we audit the most suspicious, up to a fixed human budget.
^{^}
A live failure is a failure that happens during a trajectory (eg an exfiltration).
^{^}
A delayed failure is a failure that can happen after a trajectory completes (eg creating a public API that allows for exfiltration).
^{^}
The budget is in terms of human review steps: one flagged action and one flagged trajectory each cost one step, so the comparison is fair across any number of protocols.
^{^}
The percent of tasks routed to each protocol.
^{^}
We use equal step audit budget allocation rather than maximin to show our routing strategies outperform baseline DTT, but results are similar with maximin as well