Executive summary * AI self-replication is an emerging risk. We: * provide background on it. * survey recent work, and * explain how it interacts with other risks from advanced AI. * We propose a safeguard against AI self-replication: train agents to have preferences only between outcomes with the same...
Summary * Future artificial agents might resist shutdown. * I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen. * I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). * Perhaps by using a Discounted Reward for Same-Length Trajectories (DReST) reward...
We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A...
Preamble This post is an updated explanation of the POST-Agents Proposal (PAP): my proposed solution to the shutdown problem.[1] The post is shorter than my AI Alignment Awards contest entry but it’s still pretty long. The core of the idea is the Timestep Dominance Principle in section 11. That section...
[NOTE: This paper was previously titled 'The Shutdown Problem: Three Theorems'.] This paper is an updated version of the first half of my AI Alignment Awards contest entry. My theorems build on the theorems of Soares, Fallenstein, Yudkowsky, and Armstrong in various ways.[1] These theorems can guide our search for...
[Saying an old thing in a new way] The United States Department of Transportation will pay $11.8 million to save a life. You know what that means? It means that if you come to the United States Department of Transportation with a plan (barriers around the Grand Canyon, wider lanes...
A paragraph explaining the problem, from Ngo, Chan, and Mindermann (2023). I've bolded the key part: > Our definition of internally-represented goals is consistent with policies learning multiple goals during training, including some aligned and some misaligned goals, which might interact in complex ways to determine their behavior in novel...