Message

Robert Kirk

Message

Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

by Isaac Dunn, Kei Nishimura-Gasparian, Carson Denison, Ethan Perez, and Robert Kirk

Appendix A: Task scenario dataset The full dataset of task scenarios and model responses to the scenarios can be found at the following URL: https://reward-hack-easy-dataset-share.vercel.app/ Examples of task scenarios can be found as .txt files in this Google Drive folder. Also included in that folder are .txt files of the...

Dec 22, 2025•17

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

by Isaac Dunn, Kei Nishimura-Gasparian, Carson Denison, Ethan Perez, and Robert Kirk

Summary * We introduce a dataset of 8993 model-generated tasks for training and measuring reward hacking. * Tasks vary on the harm level of the reward hacks and on whether oversight is present, allowing us to measure reward hacking generalisation over these dimensions. * To simulate situational awareness, task prompts...

Dec 22, 2025•15

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

by smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk, and Adam Gleave

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 2025•13