x
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift — LessWrong