Authors: Kei Nishimura-Gasparian, Neev Parikh This is a research progress note describing work done during the Astra Fellowship. We do not plan to do any further work on this project, and are sharing this research note for other people in the AI safety community to learn from. We run experiments...
Appendix A: Task scenario dataset The full dataset of task scenarios and model responses to the scenarios can be found at the following URL: https://reward-hack-easy-dataset-share.vercel.app/ Examples of task scenarios can be found as .txt files in this Google Drive folder. Also included in that folder are .txt files of the...
Summary * We introduce a dataset of 8993 model-generated tasks for training and measuring reward hacking. * Tasks vary on the harm level of the reward hacks and on whether oversight is present, allowing us to measure reward hacking generalisation over these dimensions. * To simulate situational awareness, task prompts...
tl;dr: I share a curated set of examples of models successfully executing message passing steganography from our recent paper. I then give a few thoughts on how I think about risks from this kind of steganography. Background I recently was a co-first author on a paper (LW link) where we...
One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run...
Something’s changed about reward hacking[1] in recent systems. In the past, reward hacks were usually accidents, found by non-general, RL-trained systems. Models would randomly explore different behaviors and would sometimes come across undesired behaviors that achieved high rewards[2]. These hacks were usually either simple or took a long time for...
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...