Elliott Thornley (EJT)

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

Summary * Misaligned artificial agents might resist shutdown. * One proposed solution is the POST-Agents Proposal: roughly, training agents to lack preferences between different-length trajectories. * The Discounted Reward for Same-Length Trajectories (DReST) reward does this by penalizing agents for repeatedly choosing same-length trajectories. It thus incentivizes agents to be:...

Jun 320

Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs

TLDR: The idea is basically inoculation prompting crossed with alignment pretraining. Call it ‘inoculation pretraining.’ It’s a type of spillway design. Reward hacking can cause emergent misalignment: you train the AI to cheat on its tasks and it turns broadly evil. Why does this happen? The persona selection model (PSM)...

May 2912

AI safety can be a Pascal's mugging even if p(doom) is high

People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong. The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference,...

Apr 2529

Preference gaps as a safeguard against AI self-replication

by tbs and Elliott Thornley (EJT)

Executive summary * AI self-replication is an emerging risk. We: * provide background on it. * survey recent work, and * explain how it interacts with other risks from advanced AI. * We propose a safeguard against AI self-replication: train agents to have preferences only between outcomes with the same...

Nov 26, 202510

Shutdownable Agents through POST-Agency

Summary * Future artificial agents might resist shutdown. * I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen. * I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). * Perhaps by using a Discounted Reward for Same-Length Trajectories (DReST) reward...

Sep 16, 202533

Towards shutdownable agents via stochastic choice

We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A...

Jul 8, 202459

The Shutdown Problem: Incomplete Preferences as a Solution

Preamble This post has been superseded by this one. This post is an updated explanation of the POST-Agents Proposal (PAP): my proposed solution to the shutdown problem.[1] The post is shorter than my AI Alignment Awards contest entry but it’s still pretty long. The core of the idea is the...

Feb 23, 202462

Elliott Thornley (EJT)

Elliott Thornley (EJT)

Elliott Thornley (EJT)

Shutdownable Agents through POST-Agency

Towards shutdownable agents via stochastic choice

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

There are no coherence theorems

Elliott Thornley (EJT)

Shutdownable Agents through POST-Agency

Towards shutdownable agents via stochastic choice

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

There are no coherence theorems

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs

AI safety can be a Pascal's mugging even if p(doom) is high

Preference gaps as a safeguard against AI self-replication

Shutdownable Agents through POST-Agency

Towards shutdownable agents via stochastic choice

The Shutdown Problem: Incomplete Preferences as a Solution