x
Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking — LessWrong