LESSWRONG
LW

Elliott Thornley (EJT) — LessWrong

Why are you 30% in SPY if SPX is far better?

a company successfully solves control for "high-stakes"/"concentrated" threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.

This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.

The company probably can't initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It's just too hard to evaluate which research directions are genuinely promising vs. a waste of time.)

Elliott Thornley (EJT)17d

Dario Amodei – The Adolescence of Technology

In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing).

Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee's death to avoid shutdown. It's weird to me that that's so often ignored.

Elliott Thornley (EJT)22d

Here's another justification for hyperbolic discounting, drawing on the idea that you're less psychologically connected to your future selves.

Elliott Thornley (EJT)22d

I've always seen this idea attributed to Martin Weitzman, and he cites these papers as making a similar point. Seems like an interesting case of simultaneous discovery: four papers making the same sort of point all appearing between 1996 and 1999.

Elliott Thornley (EJT)1mo

What's your current view? We should aim for virtuousness instead of corrigibility?

Elliott Thornley (EJT)1mo

are uploads conscious? What about AIs? Should we care about shrimp? What population ethics views should we have? What about acausal trade? What about pascal's wager? What about meaning? What about diversity?

It sounds like you're saying an AI has to get these questions right in order to count as aligned, and that's part of the reason why alignment is hard. But I expect that many people in the AI industry don't care about alignment in this sense, and instead just care about the 'follow instructions' sense of alignment.

Elliott Thornley (EJT)1mo

Yeah I think the only thing that really matters is the frequency with which bills are dropped, and train stations seem like high-frequency places.

Elliott Thornley (EJT)2mo

More reasons to worry about relying on constraints:

As you say, your constraints might be insufficiently general ('nearest unblocked strategy,' etc. This seems like a big issue to me. People like Jesus and the Buddha seem to have gained huge amounts of influence without needing to violate any obvious deontological constraints.)
Your constraints might be insufficiently strong (e.g. maybe the constraints are strong enough to keep the AI compliant all throughout training but then the AI gets a really great opportunity in deployment...).
Your constraints might be just 'outer shell,' like humans' instinctual fear of heights (Barnett and Gillen). The AI might see them as an obstacle to overcome, rather than as a part of its

Elliott Thornley (EJT)2mo

The behavioral selection model for predicting AI motivations

Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.

Preference gaps as a safeguard against AI self-replication

tbs

tbs, Elliott Thornley (EJT)

3mo

Executive summary

AI self-replication is an emerging risk. We:
- provide background on it.
- survey recent work, and
- explain how it interacts with other risks from advanced AI.
We propose a safeguard against AI self-replication: train agents to have preferences only between outcomes with the same number of copies of themselves.
- This proposal takes inspiration from Elliott’s (2025) POST-Agents Proposal.
After introducing our proposal, we:
- explain why AI agents with these preferences won’t self-replicate if doing so is costly with respect to the lotteries that they get conditional on each number-of-copies,
- explain why we think training agents to have these preferences is likely easier than training agents to be fully aligned or reliably averse to self-replication,
- give reasons to think that our proposed

... (read 3097 more words →)

Shutdownable Agents through POST-Agency

Elliott Thornley (EJT)

5mo

Summary

Future artificial agents might resist shutdown.
I present an idea – the POST-Agents Proposal – for ensuring that doesn’t happen.
I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST).
- Perhaps by using a Discounted Reward for Same-Length Trajectories (DReST) reward function.
I then prove that POST – together with other conditions – implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths.
I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.^[1]

1. Introduction

They’re not just chatbots anymore. As of 2025, they can use your computer: clicking, typing, searching, and scrolling just as you would. Early demos indicate that they can fill out forms, order groceries, and plan... (read 16076 more words →)

Towards shutdownable agents via stochastic choice

Elliott Thornley (EJT)

Elliott Thornley (EJT), alexr, christosi, LAThomson

We^[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF.

Abstract

Some worry that advanced artificial agents may resist being shut down.
The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen.
A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to:
1. pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’)
2. choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths).
In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY.
We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and

... (read 6827 more words →)

My solution to the shutdown problem didn't get as much attention as I hoped. Here's why it's worth your time.

An everywhere-implemented solution to the shutdown problem would send the risk of AI takeover down to ~0.
My solution is shovel-ready. It makes only small tweaks to an otherwise-thoroughly-prosaic setup for training transformative AI.
My solution won first prize and $16,000 in last year's AI Alignment Awards, judged by Nate Soares, John Wentworth, and Richard Ngo.
I've since explained my solution to about 50 people in and around the AI safety community, and all the responses have been various flavours of 'This seems promising.' I've not yet had any responses of the form 'I expect this wouldn't work, for the following reason(s): _____.'

If you read my solution and think it wouldn't work, let me know. If you think it could work, help me make it happen.

•••

The Shutdown Problem: Incomplete Preferences as a Solution

Elliott Thornley (EJT)

Preamble

This post is an updated explanation of the POST-Agents Proposal (PAP): my proposed solution to the shutdown problem.^[1] The post is shorter than my AI Alignment Awards contest entry but it’s still pretty long. The core of the idea is the Timestep Dominance Principle in section 11. That section is about 1500 words long (so a 5-10 minute read). People familiar with the shutdown problem can read The idea in a nutshell and then read from section 11 onwards.

Here’s a PDF version of this post. For those who like videos, this talk covers much of the same ground as this post.^[2]

The idea in a nutshell

Here’s the PAP in a nutshell:

Create agents that lack a preference between

... (read 12227 more words →)

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

Elliott Thornley (EJT)

[NOTE: This paper was previously titled 'The Shutdown Problem: Three Theorems'.]

This paper is an updated version of the first half of my AI Alignment Awards contest entry. My theorems build on the theorems of Soares, Fallenstein, Yudkowsky, and Armstrong in various ways.^[1] These theorems can guide our search for solutions to the shutdown problem.^[2]

One aim of the paper is to get academic philosophers and decision theorists interested in the shutdown problem and related topics in AI alignment. They’re my assumed audience. I’m posting here because I think the theorems will also be interesting to people already familiar with the shutdown problem.

For discussion and feedback, I thank Adam Bales, Ryan Carey, Bill D’Alessandro, Tomi... (read 11402 more words →)

•••

The price is right

Elliott Thornley (EJT)

[Saying an old thing in a new way]

The United States Department of Transportation will pay $11.8 million to save a life. You know what that means? It means that if you come to the United States Department of Transportation with a plan (barriers around the Grand Canyon, wider lanes on the expressway, no left-turns on Sundays, etc. etc. etc.), the United States Department of Transportation will take your plan – snatch the blueprints right out of your hand – and go away and calculate two numbers. The first is the cost: the cold hard cash required to make your plan a reality. The second is the expected number of Americans saved by... (read 952 more words →)

What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?

Elliott Thornley (EJT)

A paragraph explaining the problem, from Ngo, Chan, and Mindermann (2023). I've bolded the key part:

Our definition of internally-represented goals is consistent with policies learning multiple goals during training, including some aligned and some misaligned goals, which might interact in complex ways to determine their behavior in novel situations (analogous to humans facing conflicts between multiple psychological drives). With luck, AGIs which learn some misaligned goals will also learn aligned goals which prevent serious misbehavior even outside the RL fine-tuning distribution. However, the robustness of this hope is challenged by the nearest unblocked strategy problem [Yudkowsky, 2015]: the problem that an AI which strongly optimizes for a (misaligned) goal will exploit even

... (read more)

There should be a PDF version of Ajeya Cotra's BioAnchors report on Arxiv. Having it only as a Google Drive folder (https://drive.google.com/drive/u/1/folders/15ArhEPZSTYU8f012bs6ehPS6-xmhtBPP) makes it very hard to find and cite.

EJT's Shortform

Elliott Thornley (EJT)

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

There are no coherence theorems

Dan H

Dan H, Elliott Thornley (EJT)

[Written by EJT as part of the CAIS Philosophy Fellowship. Thanks to Dan for help posting to the Alignment Forum]

Introduction

For about fifteen years, the AI safety community has been discussing coherence arguments. In papers and posts on the subject, it’s often written that there exist 'coherence theorems' which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. Despite the prominence of these arguments, authors are often a little hazy about exactly which theorems qualify as coherence theorems. This is no accident. If the authors had tried to be precise, they would have discovered that there... (read 5689 more words →)

134

155