Introducing MonitoringBench
Paper here, code, benchmark. Builds on the preview we posted in January. Authors: @monika_j , @ma-martinez , @ollie, @Tyler Tracy We are releasing MonitoringBench, a difficulty-graded benchmark of 2,644 successful attack trajectories for evaluating coding-agent monitors, alongside the semi-automated red-teaming pipeline we used to generate it. The pipeline decomposes attack...
Jun 2136