x

LESSWRONG

LW

Kerrick Staley — LessWrong

Kerrick Staley

Kerrick Staley

Message

19

1

10mo

Kerrick Staley

19

10mo

Logit ROCs: Monitor TPR is linear in FPR in logit space

Summary We study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR)...