Towards shutdownable agents via stochastic choice
We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to: 1. pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’) 2. choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). * In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. * We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. * Our results thus suggest that DReST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable. 1. Introduction 1.1. The shutdown problem Let ‘advanced agent’ refer to an artificial agent that can autonomously pursue complex goals in the wider world. We might see the arrival of advanced agents within the next few decades. There are strong economic incentives to create such agents, and creating systems like them is the stated goal of companies like OpenAI and Google DeepMind. The rise of advanced agents would bring with it both benefits and risks. One risk is that these agents learn misaligned goals: goals that we don’t want them to have [Leike et al., 2017, Hubinger et al., 2019, Russell, 2019, Carlsmith, 2021, Bengio et al., 2023, Ngo et al., 2023]. Advanced agents with misaligned goals might try to prevent us shutting them down [Omohundro, 2008, Bostrom, 2012, Soares et al., 2015, Russell, 2019, Thornley, 2024a]. After all, most goals can’t be achieved after shutdown. As Stuart Russell puts it, ‘you can’t fetch the coffee if you’re dead’ [Russell, 2