This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute).
The full paper can be found here.
We did a deep dive into untrusted monitoring, and explored several new collusion strategies:
Overall, our key finding is that arguing for the safety of an untrusted monitoring deployment requires resolving several important open problems that...
What is the threat model you are probing for here?
The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.
> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’
While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able t...
Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., "Hmmm") as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.
This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations here.
The full paper can be found here, while a short video presentation covering the highlights of the paper is here (note that some graphs have been updated since the presentation).
Introduction
Collusion in multi-agent systems is defined as 2 or more agents covertly coordinating to the disadvantage of other agents [6], while steganography is the practice of concealing information within a message while avoiding detection. In future multi-AI systems there is a risk that due to misspecified incentives, LLMs will...
>Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.
I would argue that instead human priors serve as a mechanism to help the search process, as it's being shown with cold-started reasoning models: they bake-in some reasoning traces that the model can then learn to exploit via RL. While this is not very bitter lesson-esque, the solution space is so large that it'd probably be quite difficult to do so without the cold start phase (although R1-zero kind of hints at this being possible). Maybe we have not yet thrown as much compute at the problem to do this search from scratch effectively.