x
Experiments on Reward Hacking Monitorability in Language Models — LessWrong