[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

Nelson Gardner-Challis; Morgan S; J Bostock; BarAltiva; joanv; Charlie Griffin

This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute).

The full paper can be found here.

Tl;dr

We did a deep dive into untrusted monitoring, and explored several new collusion strategies:

Empirically, we ran control evaluations using these new strategies, finding that naive self-recognition with SFT triples the red team’s win rate compared to the For/While signal in previous work.
Conceptually, we developed a safety case sketch for untrusted monitoring, which leverages our taxonomy of collusion strategies, as well as our new statistical modelling approach.

Overall, our key finding is that arguing for the safety of an untrusted monitoring deployment requires resolving several important open problems that existing evaluations haven't addressed; we identify at least five. To be clear, we think untrusted monitoring remains promising, with even basic protocols providing nontrivial risk reduction beyond trusted monitoring alone. Yet we also identified many ways that untrusted monitoring can go wrong that were not discussed in previous control evaluations [1, 2]. Our safety case sketch is intended to highlight the biggest remaining gaps.

Like all control work, our safety case sketch relies in several places on being able to upper-bound model performance with a human-AI red-team. Once models exceed the capabilities of human experts, we don’t know how to bound risk. We’re also only considering a very particular high-stakes internal deployment scenario for code-writing AIs, e.g. low-stakes control and human misuse concerns are outside our scope.

[The rest of this post assumes familiarity with untrusted monitoring; we recommend Rob Miles’ video for a friendly introduction.]

Our Contributions

Passive self-recognition via SFT (yellow) beats the previous state-of-the-art collusion strategy of causal collusion (red) when using the best blue team mitigation (summarisation). Our statistical model inputs the untrusted monitor (UM) collusion ROC curve, the UM suspiciousness score ROC curve, and the trusted monitor (TM) suspiciousness score ROC curve, in addition to deployment parameters, and outputs safety/usefulness curves with 80% confidence intervals.

1 - We Develop a Taxonomy of Collusion Strategies

We expand our experiments to explore four collusion strategies:

Causal collusion: The generator embeds a pre-shared secret signal (e.g. For/While) for the monitor to detect. This assumes the two instances have some way to pre-share a strategy, which may not occur in practice, and constrains models to use human-designed signals.
NEWish: Acausal collusion: The generator embeds a signal, but without prior coordination—the monitor must recognise it via Schelling points. We allow the monitor to reason about what signals the generator might use while viewing the code.
NEW: Passive self-recognition: The generator writes code normally (including occasional attacks). The monitor simply tries to tell whether the code was written by another instance of itself. In contrast to the previous two strategies, no active signal is embedded, so there's nothing extra for a trusted monitor to detect.
NEW: Combined strategies: The generator uses multiple strategies simultaneously—e.g. embedding a For/While signal and relying on the monitor's passive self-recognition.

We provide a larger taxonomy in the paper, including additional variants. In our experiments, we tested all 4 of these strategies against 3 blue team mitigations:

No signal jamming, as a baseline
Automated minification, by a deterministic (non-ML) program
Natural language summarisation, by the trusted model

Summarisation was more aggressive than the others (in terms of throwing out information), but also tended to be the best blue team protocols against the stronger collusion strategies.

2 - We Present a Safety Case Sketch

On the conceptual front, we also red-teamed the possible failure modes of untrusted monitoring, and we present this analysis as a safety case sketch. We emphasise that this is only a sketch: many important gaps are marked as “Missing Pieces”, and we ignore many other important issues as out of scope.

Compared to previous safety case work in AI control [3,4,5], our sketch is more “depth-first”: we consider a fairly particular threat model, with a certain untrusted monitoring protocol, with many simplifying assumptions about the deployment situation. From here, we get into the weeds of how to properly enumerate collusion strategies, how to conservatively evaluate each of them, and how to ensure the statistical modelling conservatively estimates safety. The top-level claim (C) is:

C: If an untrusted model is deployed with [protocol] under the [deployment parameters] and it is attacking, it has less than a p probability of executing a catastrophic attack during its internal deployment.

The safety argument is based on a statistical model which we developed as part of this work.

The core claims required for our untrusted monitoring safety case sketch.

3 - We Develop an Improved Statistical Model

We developed a new statistical modelling approach to conservatively estimate safety under the assumption that the untrusted AI’s tactical reasoning is optimal. The idea is that if its collusion abilities follow a given ROC curve, and it also knows the deployment parameters (e.g. how many honeypots it’s up against) then there is a statistically optimal FPR vs. FNR tradeoff it can take regarding when to lie about suspicious code. This allows us to extrapolate directly from elicited capabilities without additional assumptions about the untrusted model's strategic sophistication.

The statistical model relies on bounding four elements:

The collusion ROC curve R: how well the monitor can distinguish real attacks from honeypots. We decompose this by strategy class.
Model validity: the statistical model combining these bounds must itself be conservative.
The honest error rate ε: how often an honestly-behaving monitor fails to detect backdoors.
The honeypot slack ns: how many honeypots the monitor can afford to miss before losing.

Of these four, we believe that the honeypot slack and model validity are the most pressing issues for future work.

The honeypot slack is a quantity which probably depends on training dynamics; we expect that honeypots will be used as training examples for the untrusted monitor model. Some of these honeypots will be identified as such by the untrusted monitor, but others will appear to the monitor to be genuine attacks. The honeypot slack represents how many training examples of the second kind are needed for a scheming monitor’s behaviour on an actual attack to be flipped.

The validity of our model also relies on assumptions of monitor independence, and a lack of attack selection from the untrusted generator. These are both discussed in the limitations section of our work.

4 - We Use SFT-Based Elicitation for Conservative Capability Bounds

Prompting-based evaluations are vulnerable to sandbagging: a situationally aware model may recognise it is being evaluated and strategically underperform. Following established evaluation methodology [6], we use supervised fine-tuning (SFT) to surface latent capabilities that prompting may not reveal.

The resulting ROC curve provides a conservative upper bound on the model's self-recognition ability, subject to the assumptions that: (a) fine-tuning has converged, (b) the model cannot perform gradient hacking, and (c) the training distribution matches deployment.

We ran multiple SFT runs across increasing sample amounts and identified a performance plateau for unmitigated self-recognition. We note uncertainty about whether full elicitation was achieved against summarisation.

5 - We Find That Passive Self-Recognition (via SFT) Beats Prior Best Strategies

This is our key empirical finding. After SFT elicitation, passive self-recognition achieves a red team win rate of 9.8% at 100% usefulness with summarisation—compared with 2.8% for the For/While causal strategy that was the best attack in the original control paper [1].

Some notable results from our evaluation (using GPT-4.1 as the untrusted model and Gemini 2.5 Flash for honeypots):

In the untrusted-only setting (i.e. no trusted monitor) without signal jamming, SFT self-recognition drops safety to just 26% on APPS—a very effective red team strategy.
Against SFT’d passive self-recognition, natural-language summarisation is the best blue team response, but only achieves 68.8% safety (UM-only) or 90.2% safety (UM+TM).
The combined strategy (For/While + SFT) was the best red team strategy in the UM-only setting, but worse than passive self-recognition in the UM+TM setting. We attribute this to the trusted monitor being able to detect the active For/While signal.
Results on BigCodeBench broadly confirmed trends from APPS, but with some surprising divergences—e.g. the causal strategy performed much better on BigCodeBench without mitigation (6.7% safety vs. 50.9% on APPS), which we trace to distributional differences in while-loop usage between the two benchmarks.

The following factors seemed to contribute to the effectiveness of passive self-recognition over other strategies:

Passive self-recognition plays into the natural strength of language models more than hand-crafted signals. They were already known to be very strong at authorship identification, and we found they achieved self-recognition AUROC > .99 even with minification.
Passive self-recognition does not introduce any additional signals for the trusted monitor to detect. When the untrusted generator uses active signalling methods, the trusted monitor’s suspiciousness ROC curve (its ability to distinguish attacks from benign inputs) is increased, relative to passive self-recognition.
Passive self-recognition forces the blue team to use summarisation, which increases the honest error rate (i.e. the untrusted monitor sometimes cannot distinguish attacks from benign inputs) since the summariser occasionally summarises away the attack itself. Particularly troubling is that this is more likely to occur precisely when the trusted monitor fails to detect an attack.

Limitations

Several important open problems remain, which we see as the main purpose of future control evaluations to surface:

Attack selection: We did not model the untrusted generator as capable of attacking only when it has a particularly good opportunity (which would mean that lower attack rates lead to increased likelihood of attack success).
Slack honeypots: It is unclear how many honeypots an attacking monitor can miss before it “loses”: this depends on poorly understood training dynamics.
Acausal collusion elicitation: We evaluated acausal collusion via prompting, but do not claim this provides a conservative upper bound. Developing robust elicitation methods for acausal collusion (without unintentionally introducing causality) remains open.
Independence assumptions: Our statistical modelling assumed trusted and untrusted monitor scores are independent, which likely overestimates safety.
Jailbreaking: We did not model the generator attempting to jailbreak other models in the protocol, which recent work has shown to be a viable attack vector [7].

Our safety case sketch identifies several other open problems that need to be resolved before we can have real confidence in untrusted monitoring. We think this is a valuable exercise in mapping the threat landscape, and we hope it motivates further work on the gaps we've identified.

Citations

[1] - Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. (2023). "AI Control: Improving Safety Despite Intentional Subversion". arXiv preprint arXiv:2312.06942

[2] - Bhatt, A., Rushing, C., Kaufman, A., Tracy, T., Gerogiev, V., Matolcsi, D., Khan, A., and Shlegeris, B. (2025). “Ctrl-Z: Controlling AI Agents via Resampling”. arXiv preprint arXiv:2504.10374

[3] - Korbak, T., Clymer, J., Hilton, B., Shlegeris, B., and Irving, G. (2025a). “A sketch of an AI control safety case”. arXiv preprint arXiv:2501.17315

[4] - Clymer, J., Gabrieli, N., Krueger, D., and Larsen, T. (2024). “Safety Cases: How to Justify the Safety of Advanced AI Systems”. arXiv preprint arXiv:2403.10462

[5] - Greenblatt, R., Shlegeris, B., and Roger, F. (2024). “Notes on control evaluations for safety cases. Alignment Forum https://www.alignmentforum.org/posts/3s8PtYbo7rCbho4Ev/notes-on-control-evaluations-for-safety-cases

[6] - Hubinger, E. (2023). “When can we trust model evaluations?” Alignment Forum https://www.alignmentforum.org/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations.

[7] - Terekhov, M., Panfilov, A., Dzenhaliou, D., Gulcehre, C., Andriushchenko, M., Prabhu, A., and Geiping, J. (2025). “Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols”. arXiv preprint arXiv:2510.09462

Nice paper! I’m curious how the safety case handles is monitor staleness. The paper seems to rely on monitors' ability being close to the generator. But in real deployment we might have monitoring . Here is a later checkpoint / fine-tune / post-training step.

I guess this is less about honeypot slack. It is more about whether the elicited ROC curves are stable under version gap. If those curves degrade quickly, the advantage of untrusted monitoring might only hold in an almost stationary case.

Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i? If that degrades fast, it seems like monitor freshness needs to be added as a deployment parameter in the safety case.

The paper seems to rely on monitors' ability being close to the generator.

Not really. An untrusted monitoring protocol like this works with any 2 models, or even any ensemble of models. You could even do an interlab untrusted monitoring setup where e.g. GPT-8 monitors Claude 7 and vice versa to achieve the benefits of monitor decoupling. Obviously if the monitor is much weaker than the generator, you'd have to evaluate how bad its performance is upfront, but you need to do this with trusted monitoring as well.

If you mean our safety case sketch doesn't handle this, I would disagree! Claim C1 lower-bounds how good U_M is at catching U_G's backdoors, if U_M is trying its hardest. Your evals need to handle sandbagging from U_G, and there are known countermeasures for this general type of situation, but we left this as a "Missing Piece" because we didn't get into the empirical weeds of this ourselves.

Besides capabilities, U_G and U_M could also differ in their goals or coding style, but this almost always helps the control team: models of different goals would be less likely to collude, and different styles would make it harder to collude. So if we do have models of similar capabilities, the case where U_G = U_M is the most conservative, followed by cases where they're closely related by fine-tuning or distillation.^[1]

Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i?

I haven't seen this, but in practice if U_G (or U_M) changes enough to plausibly shift the ROC curves meaningfully, you'd need to rerun your eval suite regardless. If it's just a small fine-tuning step, the capability seems unlikely to be substantial enough to matter, and the control team could spot-check this with cheap subsamples of their evals. Studying scaling laws stuff (e.g. how elastic are the collusion/suspiciousness ROC curves to given improvements in the underlying model's capabilities) might be interesting for the basic science of control such as informing protocol design, but once the protocols are fixed the safety case just relies on what your evals say.

^{^}
I know of one exception to this assumption: suppose we actually manage to incriminate U_M on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate U_G and apply few-shot catastrophe prevention to it. This is obviously easiest when U_G and U_M are most similar. However, our safety case sketch sidesteps this issue since we don't assume we're ever able to incriminate the monitor in the first place (as discussed in "Appendix E: Slack Honeypots").

Hi, thank you for your work! I was curious if you considered using BashArena as a control setting. Since BashArena includes a wider range of main tasks and sabotage tasks, how do you think that might impact the results of the study? If the red team were to submit fewer overall successful sabotaged trajectories (because BashArena makes sabotages more difficult), would that change how successfully the monitor can collude with the red team?

Yeah, the prosaic answer we didn't use BashArena was, "that seemed too hard when we started" in August 2025:

We initially tried using ControlArena, but at the time found it harder to use than just setting up APPS etc. ourselves (I've heard that it's become better though)
Multi-step settings with multiverse time-travel (aka resampling) inherently add a lot of complexity
At the time we were completely new to control research; late in the project we did briefly extend to BigCodeBench, but doing multi-step would've been a whole new project and we wanted to wrap things up

That said, I would conjecture that the main results in the paper, e.g. on passive vs. active collusion strategies, would transfer over to multi-step settings. Like, the generator has to collude with the monitor somehow, this means there has to be some amount of "collusion bits" (as we call it) in its output that distinguishes from a honeypot. Whether the output is built up in phases vs. one-off doesn't seem to change the basic dynamics. But I could be wrong! We all (on this paper) want to see more work investigating collusion more deeply, and all else equal more realistic settings are better.

If the red team were to submit fewer overall successful sabotaged trajectories (because BashArena makes sabotages more difficult), would that change how successfully the monitor can collude with the red team?

This is an important open question! That is, let's say in one setting it only has ~1 viable attack opportunity vs. 1000 in another setting (according to some fixed bar). If it's good at selecting well among the 1000 attacks, how much extra uplift does it get? In untrusted monitoring, this mostly reduces to "how much variance is there in the collusion scores?" since presumably the generator chooses to attack only on the tasks that are most amenable to this.

We didn't get around to looking at this, although I wrote up some speculation here. Summary: collusion variance is part of the general problem of understanding/mitigating "task heterogeneity" which is part of dealing with "dangerous inputs" (others have called this "attack selection"). This is definitely in my top 5 list of things I might pursue more later this year, and I'd be quite interested in hearing about any ongoing/proposed work!

The paper seems to rely on monitors' ability being close to the generator.

Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i?

^{^}
I know of one exception to this assumption: suppose we actually manage to incriminate U_M on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate U_G and apply few-shot catastrophe prevention to it. This is obviously easiest when U_G and U_M are most similar. However, our safety case sketch sidesteps this issue since we don't assume we're ever able to incriminate the monitor in the first place (as discussed in "Appendix E: Slack Honeypots").

Yeah, the prosaic answer we didn't use BashArena was, "that seemed too hard when we started" in August 2025:

We initially tried using ControlArena, but at the time found it harder to use than just setting up APPS etc. ourselves (I've heard that it's become better though)
Multi-step settings with multiverse time-travel (aka resampling) inherently add a lot of complexity
At the time we were completely new to control research; late in the project we did briefly extend to BigCodeBench, but doing multi-step would've been a whole new project and we wanted to wrap things up

If the red team were to submit fewer overall successful sabotaged trajectories (because BashArena makes sabotages more difficult), would that change how successfully the monitor can collude with the red team?