x

LESSWRONG

LW

monika_j — LessWrong

monika_j

monika_j

Message

89

2

5

5mo

monika_j

89

5mo

Introducing MonitoringBench

Paper here, code, benchmark. Builds on the preview we posted in January. Authors: @monika_j , @ma-martinez , @ollie, @Tyler Tracy We are releasing MonitoringBench, a difficulty-graded benchmark of 2,644 successful attack trajectories for evaluating coding-agent monitors, alongside the semi-automated red-teaming pipeline we used to generate it. The pipeline decomposes attack...

Monitoring benchmark for AI control

Monitoring benchmark/Semi-automated red-teaming for AI control We are a team of control researchers supported by CG’s Technical AI safety grant. We are now halfway through our project and would like to get feedback on the following contributions. Have a low bar for adding questions or comments to the document, we...