Summary * Misaligned artificial agents might resist shutdown. * One proposed solution is the POST-Agents Proposal: roughly, training agents to lack preferences between different-length trajectories. * The Discounted Reward for Same-Length Trajectories (DReST) reward does this by penalizing agents for repeatedly choosing same-length trajectories. It thus incentivizes agents to be:...
TLDR: The idea is basically inoculation prompting crossed with alignment pretraining. Call it ‘inoculation pretraining.’ It’s a type of spillway design. Reward hacking can cause emergent misalignment: you train the AI to cheat on its tasks and it turns broadly evil. Why does this happen? The persona selection model (PSM)...
People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong. The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference,...
Executive summary * AI self-replication is an emerging risk. We: * provide background on it. * survey recent work, and * explain how it interacts with other risks from advanced AI. * We propose a safeguard against AI self-replication: train agents to have preferences only between outcomes with the same...
Summary * Future artificial agents might resist shutdown. * I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen. * I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). * Perhaps by using a Discounted Reward for Same-Length Trajectories (DReST) reward...
We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A...
Preamble This post has been superseded by this one. This post is an updated explanation of the POST-Agents Proposal (PAP): my proposed solution to the shutdown problem.[1] The post is shorter than my AI Alignment Awards contest entry but it’s still pretty long. The core of the idea is the...