Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking
Appendix A: Task scenario dataset The full dataset of task scenarios and model responses to the scenarios can be found at the following URL: https://reward-hack-easy-dataset-share.vercel.app/ Examples of task scenarios can be found as .txt files in this Google Drive folder. Also included in that folder are .txt files of the...