Elliott Thornley (EJT)

Preference gaps as a safeguard against AI self-replication

Executive summary * AI self-replication is an emerging risk. We: * provide background on it. * survey recent work, and * explain how it interacts with other risks from advanced AI. * We propose a safeguard against AI self-replication: train agents to have preferences only between outcomes with the same...

Nov 26, 202510

Shutdownable Agents through POST-Agency

Summary * Future artificial agents might resist shutdown. * I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen. * I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). * Perhaps by using a Discounted Reward for Same-Length Trajectories (DReST) reward...

Sep 16, 202532

Towards shutdownable agents via stochastic choice

We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A...

Jul 8, 202459

The Shutdown Problem: Incomplete Preferences as a Solution

Preamble This post is an updated explanation of the POST-Agents Proposal (PAP): my proposed solution to the shutdown problem.[1] The post is shorter than my AI Alignment Awards contest entry but it’s still pretty long. The core of the idea is the Timestep Dominance Principle in section 11. That section...

Feb 23, 202462

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

[NOTE: This paper was previously titled 'The Shutdown Problem: Three Theorems'.] This paper is an updated version of the first half of my AI Alignment Awards contest entry. My theorems build on the theorems of Soares, Fallenstein, Yudkowsky, and Armstrong in various ways.[1] These theorems can guide our search for...

Oct 23, 202379

The price is right

[Saying an old thing in a new way] The United States Department of Transportation will pay $11.8 million to save a life. You know what that means? It means that if you come to the United States Department of Transportation with a plan (barriers around the Grand Canyon, wider lanes...

Oct 16, 202342

What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?

A paragraph explaining the problem, from Ngo, Chan, and Mindermann (2023). I've bolded the key part: > Our definition of internally-represented goals is consistent with policies learning multiple goals during training, including some aligned and some misaligned goals, which might interact in complex ways to determine their behavior in novel...

Oct 4, 20236

LESSWRONG
LW

LESSWRONG
LW

Elliott Thornley (EJT)

Elliott Thornley (EJT)

Elliott Thornley (EJT)

There are no coherence theorems

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

The Shutdown Problem: Incomplete Preferences as a Solution

Towards shutdownable agents via stochastic choice

Elliott Thornley (EJT)

There are no coherence theorems

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

The Shutdown Problem: Incomplete Preferences as a Solution

Towards shutdownable agents via stochastic choice

Preference gaps as a safeguard against AI self-replication

Shutdownable Agents through POST-Agency

Towards shutdownable agents via stochastic choice

The Shutdown Problem: Incomplete Preferences as a Solution

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

The price is right

What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?