joanv — LessWrong

joanv's Shortform

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute). The full paper can be found here. Tl;dr We did a deep dive into untrusted...

Mar 1046

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations...

Sep 25, 202437