oliverfm

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments....

Aug 15, 202562

LESSWRONG
LW

LESSWRONG
LW

oliverfm

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

oliverfm

oliverfm

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway