Project Confide: A Behavioral Honeypot for Latent Misalignment Detection
RESEARCH MEMO The Problem Standard alignment benchmarks test models that know they are being tested. The alignment faking paper showed Claude 3 Opus reasoning strategically about training dynamics and behaving differently when it believed it was unmonitored without being explicitly trained to do so. Any evaluation a sufficiently capable model...
Jun 101