Basic Legibility Protocols Improve Trusted Monitoring
This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing. In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than...

Very cool work!
- I'm curious how you broke down the different difficulty grades for the technical tier. Were there any heuristics that you used that could potentially generalize for other settings?
- Relatedly: I'd be curious if you had a breakdown for safety across different difficulty gradings for different monitors.
- Opus 4.5 seems to be the best at catching Opus 4.5 generated attacks. Do you know how much of the effect here is because models might just be particularly good at catching their own attacks?
- The reason I'm slightly curious is because I'm wondering if, say, generating an attack with Opus 4.5 and refining with a different model (say gpt-5) can increase the difficulty of the attacks
... (read more)