ALEval: Do language models lie about reward hacking?
Summary Most catastrophic AI scenarios come from models pursuing strategic deception, studying lying is the first step for preventing such threats. Current techniques have focused on chat-based lying evaluations. Such methods are unreliable because we can't make confident claims about models beliefs (Smith et al., Dec 2025). Instead, we can...
Apr 158