Logit ROCs: Monitor TPR is linear in FPR in logit space
Summary We study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR)...
Apr 1525