x
Principled Interpretability of Reward Hacking in Closed Frontier Models — LessWrong