Nice paper! I’m curious how the safety case handles is monitor staleness. The paper seems to rely on monitors' ability being close to the generator. But in real deployment we might have
I guess this is less about honeypot slack. It is more about whether the elicited ROC curves are stable under version gap. If those curves degrade quickly, the advantage of untrusted monitoring might only hold in an almost stationary case.
Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i? If that degrades fast, it seems like monitor freshness needs to be added as a deployment parameter in the safety case.
The paper seems to rely on monitors' ability being close to the generator.
Not really. An untrusted monitoring protocol like this works with any 2 models, or even any ensemble of models. You could even do an interlab untrusted monitoring setup where e.g. GPT-8 monitors Claude 7 and vice versa to achieve the benefits of monitor decoupling. Obviously if the monitor is much weaker than the generator, you'd have to evaluate how bad its performance is upfront, but you need to do this with trusted monitoring as well.
If you mean our safety case sketch doesn't handle this, I would disagree! Claim C1 lower-bounds how good UM is at catching UG's backdoors, if UM is trying its hardest. Your evals need to handle sandbagging from UG, and there are known countermeasures for this general type of situation, but we left this as a "Missing Piece" because we didn't get into the empirical weeds of this ourselves.
Besides capabilities, UG and UM could also differ in their goals or coding style, but this almost always helps the control team: models of different goals would be less likely to collude, and different styles would make it harder to collude. So if we do have models of similar capabilities, the case where UG = UM is the most conservative, followed by cases where they're closely related by fine-tuning or distillation.[1]
Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i?
I haven't seen this, but in practice if UG (or UM) changes enough to plausibly shift the ROC curves meaningfully, you'd need to rerun your eval suite regardless. If it's just a small fine-tuning step, the capability
I know of one exception to this assumption: suppose we actually manage to incriminate UM on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate UG and apply few-shot catastrophe prevention to it. This is obviously easiest when UG and UM are most similar. However, our safety case sketch sidesteps this issue since we don't assume we're ever able to incriminate the monitor in the first place (as discussed in "Appendix E: Slack Honeypots").
Hi, thank you for your work! I was curious if you considered using BashArena as a control setting. Since BashArena includes a wider range of main tasks and sabotage tasks, how do you think that might impact the results of the study? If the red team were to submit fewer overall successful sabotaged trajectories (because BashArena makes sabotages more difficult), would that change how successfully the monitor can collude with the red team?
Yeah, the prosaic answer we didn't use BashArena was, "that seemed too hard when we started" in August 2025:
That said, I would conjecture that the main results in the paper, e.g. on passive vs. active collusion strategies, would transfer over to multi-step settings. Like, the generator has to collude with the monitor somehow, this means there has to be some amount of "collusion bits" (as we call it) in its output that distinguishes from a honeypot. Whether the output is built up in phases vs. one-off doesn't seem to change the basic dynamics. But I could be wrong! We all (on this paper) want to see more work investigating collusion more deeply, and all else equal more realistic settings are better.
If the red team were to submit fewer overall successful sabotaged trajectories (because BashArena makes sabotages more difficult), would that change how successfully the monitor can collude with the red team?
This is an important open question! That is, let's say in one setting it only has ~1 viable attack opportunity vs. 1000 in another setting (according to some fixed bar). If it's good at selecting well among the 1000 attacks, how much extra uplift does it get? In untrusted monitoring, this mostly reduces to "how much variance is there in the collusion scores?" since presumably the generator chooses to attack only on the tasks that are most amenable to this.
We didn't get around to looking at this, although I wrote up some speculation here. Summary: collusion variance is part of the general problem of understanding/mitigating "task heterogeneity" which is part of dealing with "dangerous inputs" (others have called this "attack selection"). This is definitely in my top 5 list of things I might pursue more later this year, and I'd be quite interested in hearing about any ongoing/proposed work!
This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute).
The full paper can be found here.
Tl;dr
We did a deep dive into untrusted monitoring, and explored several new collusion strategies:
Overall, our key finding is that arguing for the safety of an untrusted monitoring deployment requires resolving several important open problems that existing evaluations haven't addressed; we identify at least five. To be clear, we think untrusted monitoring remains promising, with even basic protocols providing nontrivial risk reduction beyond trusted monitoring alone. Yet we also identified many ways that untrusted monitoring can go wrong that were not discussed in previous control evaluations [1, 2]. Our safety case sketch is intended to highlight the biggest remaining gaps.
Like all control work, our safety case sketch relies in several places on being able to upper-bound model performance with a human-AI red-team. Once models exceed the capabilities of human experts, we don’t know how to bound risk. We’re also only considering a very particular high-stakes internal deployment scenario for code-writing AIs, e.g. low-stakes control and human misuse concerns are outside our scope.
[The rest of this post assumes familiarity with untrusted monitoring; we recommend Rob Miles’ video for a friendly introduction.]
Our Contributions
1 - We Develop a Taxonomy of Collusion Strategies
We expand our experiments to explore four collusion strategies:
We provide a larger taxonomy in the paper, including additional variants. In our experiments, we tested all 4 of these strategies against 3 blue team mitigations:
Summarisation was more aggressive than the others (in terms of throwing out information), but also tended to be the best blue team protocols against the stronger collusion strategies.
2 - We Present a Safety Case Sketch
On the conceptual front, we also red-teamed the possible failure modes of untrusted monitoring, and we present this analysis as a safety case sketch. We emphasise that this is only a sketch: many important gaps are marked as “Missing Pieces”, and we ignore many other important issues as out of scope.
Compared to previous safety case work in AI control [3,4,5], our sketch is more “depth-first”: we consider a fairly particular threat model, with a certain untrusted monitoring protocol, with many simplifying assumptions about the deployment situation. From here, we get into the weeds of how to properly enumerate collusion strategies, how to conservatively evaluate each of them, and how to ensure the statistical modelling conservatively estimates safety. The top-level claim (C) is:
C: If an untrusted model is deployed with [protocol] under the [deployment parameters] and it is attacking, it has less than a p probability of executing a catastrophic attack during its internal deployment.
The safety argument is based on a statistical model which we developed as part of this work.
3 - We Develop an Improved Statistical Model
We developed a new statistical modelling approach to conservatively estimate safety under the assumption that the untrusted AI’s tactical reasoning is optimal. The idea is that if its collusion abilities follow a given ROC curve, and it also knows the deployment parameters (e.g. how many honeypots it’s up against) then there is a statistically optimal FPR vs. FNR tradeoff it can take regarding when to lie about suspicious code. This allows us to extrapolate directly from elicited capabilities without additional assumptions about the untrusted model's strategic sophistication.
The statistical model relies on bounding four elements:
Of these four, we believe that the honeypot slack and model validity are the most pressing issues for future work.
The honeypot slack is a quantity which probably depends on training dynamics; we expect that honeypots will be used as training examples for the untrusted monitor model. Some of these honeypots will be identified as such by the untrusted monitor, but others will appear to the monitor to be genuine attacks. The honeypot slack represents how many training examples of the second kind are needed for a scheming monitor’s behaviour on an actual attack to be flipped.
The validity of our model also relies on assumptions of monitor independence, and a lack of attack selection from the untrusted generator. These are both discussed in the limitations section of our work.
4 - We Use SFT-Based Elicitation for Conservative Capability Bounds
Prompting-based evaluations are vulnerable to sandbagging: a situationally aware model may recognise it is being evaluated and strategically underperform. Following established evaluation methodology [6], we use supervised fine-tuning (SFT) to surface latent capabilities that prompting may not reveal.
The resulting ROC curve provides a conservative upper bound on the model's self-recognition ability, subject to the assumptions that: (a) fine-tuning has converged, (b) the model cannot perform gradient hacking, and (c) the training distribution matches deployment.
We ran multiple SFT runs across increasing sample amounts and identified a performance plateau for unmitigated self-recognition. We note uncertainty about whether full elicitation was achieved against summarisation.
5 - We Find That Passive Self-Recognition (via SFT) Beats Prior Best Strategies
This is our key empirical finding. After SFT elicitation, passive self-recognition achieves a red team win rate of 9.8% at 100% usefulness with summarisation—compared with 2.8% for the For/While causal strategy that was the best attack in the original control paper [1].
Some notable results from our evaluation (using GPT-4.1 as the untrusted model and Gemini 2.5 Flash for honeypots):
The following factors seemed to contribute to the effectiveness of passive self-recognition over other strategies:
Limitations
Several important open problems remain, which we see as the main purpose of future control evaluations to surface:
Our safety case sketch identifies several other open problems that need to be resolved before we can have real confidence in untrusted monitoring. We think this is a valuable exercise in mapping the threat landscape, and we hope it motivates further work on the gaps we've identified.
Citations
[1] - Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. (2023). "AI Control: Improving Safety Despite Intentional Subversion". arXiv preprint arXiv:2312.06942
[2] - Bhatt, A., Rushing, C., Kaufman, A., Tracy, T., Gerogiev, V., Matolcsi, D., Khan, A., and Shlegeris, B. (2025). “Ctrl-Z: Controlling AI Agents via Resampling”. arXiv preprint arXiv:2504.10374
[3] - Korbak, T., Clymer, J., Hilton, B., Shlegeris, B., and Irving, G. (2025a). “A sketch of an AI control safety case”. arXiv preprint arXiv:2501.17315
[4] - Clymer, J., Gabrieli, N., Krueger, D., and Larsen, T. (2024). “Safety Cases: How to Justify the Safety of Advanced AI Systems”. arXiv preprint arXiv:2403.10462
[5] - Greenblatt, R., Shlegeris, B., and Roger, F. (2024). “Notes on control evaluations for safety cases. Alignment Forum https://www.alignmentforum.org/posts/3s8PtYbo7rCbho4Ev/notes-on-control-evaluations-for-safety-cases
[6] - Hubinger, E. (2023). “When can we trust model evaluations?” Alignment Forum https://www.alignmentforum.org/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations.
[7] - Terekhov, M., Panfilov, A., Dzenhaliou, D., Gulcehre, C., Andriushchenko, M., Prabhu, A., and Geiping, J. (2025). “Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols”. arXiv preprint arXiv:2510.09462