Building Black-box Scheming Monitors
James, Rich, and Simon are co-first authors on this work. This is a five-week interim report supported by ML Alignment & Theory Scholars. Executive Summary * Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets. * We aim...
Jul 29, 202545
thanks for the helpful feedback!
good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution
we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.