x
ALEval: Do language models lie about reward hacking? — LessWrong