TLDR: The idea is basically inoculation prompting crossed with alignment pretraining. Call it ‘inoculation pretraining.’ It’s a type of spillway design. Reward hacking can cause emergent misalignment: you train the AI to cheat on its tasks and it turns broadly evil. Why does this happen? The persona selection model (PSM)...
People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong. The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference,...
Executive summary * AI self-replication is an emerging risk. We: * provide background on it. * survey recent work, and * explain how it interacts with other risks from advanced AI. * We propose a safeguard against AI self-replication: train agents to have preferences only between outcomes with the same...
Summary * Future artificial agents might resist shutdown. * I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen. * I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). * Perhaps by using a Discounted Reward for Same-Length Trajectories (DReST) reward...
We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A...
Preamble This post has been superseded by this one. This post is an updated explanation of the POST-Agents Proposal (PAP): my proposed solution to the shutdown problem.[1] The post is shorter than my AI Alignment Awards contest entry but it’s still pretty long. The core of the idea is the...
[NOTE: This paper was previously titled 'The Shutdown Problem: Three Theorems'.] This paper is an updated version of the first half of my AI Alignment Awards contest entry. My theorems build on the theorems of Soares, Fallenstein, Yudkowsky, and Armstrong in various ways.[1] These theorems can guide our search for...