Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
code, paper, twitter thread copied below: Introduction Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts: The core problem: We want to audit models for hidden goals before deployment. But future misaligned AIs...