x

LESSWRONG

LW

ollie — LessWrong

ollie

ollie

Message

92

2y

ollie

92

2y

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

by Charlie Griffin, ollie, oliverfm, Rogan Inglis, and Alan Cooney

Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments....

Aug 15, 2025•68

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

by Yohan Mathew, joanv, robert mccarthy, ollie, Nandi, and Dylan Cope

This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations...

Sep 25, 2024•37