Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway
Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments....