Shutdownable Agents through POST-Agency

I had a thought about a potential reward function for Neutrality+. For each batch you would:

Run the agent in the environment many times to build a dataset of episodes. The environment would have ways of getting shut down interwoven with ways of getting reward
For each trajectory length, select 10 episodes from the dataset where the agent was shutdown at that trajectory length
Sum the reward across all of the selected episodes (10 * the number of trajectory lengths); that would be the reward for the batch

The idea is that the agent's reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have no incentive to affect the likelihood of any given trajectory, but it will be incentivized to increase its reward given any trajectory length.

Adjustments:

Change the value 10 to be constants that could be different for each trajectory length. These constants will reflect the relative weighting of the utility for that trajectory length. They will also reflect the priority of sampling data for that trajectory length. Additional constants can be added to make the sum a weighted sum (these constants would only affect the relative weighting of the utility) and these constants could be played against the others so that the optimal data priority and the optimal utility weighting for each trajectory could be achieved.
Change the constant number of episodes per trajectory length to a probability of sampling an episode of that length. Then, simply choose a random sample where each episode has a constant probability of coming from a particular trajectory length. This would allow one to keep the batch size from exploding as a function of the number of potential trajectory lengths.

I am curious as to how this scheme lines up with your plan for testing Neutrality+. My reading was that the plan is closer to building in training for the Ramsey Yardstick into DReST, but I couldn't quite workout how I would do that.

[-]EJT1mo20

Oh nice! I like this idea. Let's talk about it more tomorrow.

[-]MichaelStJules3mo*30

Cool direction and results!

Can we prevent such an agent from having a preference to create agents that do resist shutdown?

EDIT: And if they're going to create agents anyway, actually make sure those agents don't resist shutdown, too, rather than, say, being indifferent about whether those other agents resist shutdown.

[-]EJT3mo30

Thanks!

Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:

Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.

Second, POST-agents won't pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won't pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)

[-]Alec Harris1mo*20

It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the "Work or Steal" example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).

Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to minimize the cost of dealing with the impediment it may choose to hide its scheming from humans.

More generally, the utility maximizing world states of a misaligned AI over long trajectories will still likely be bad and, therefore, still involve the modeling of some kind of human resistance. Although it will be unconcerned with avoiding early shut down, utility maximizing actions for minimizing the cost of human resistance may overlap heavily with shut down resistance.

It also seems possible to me that the model relearns shut down resistance as a generalization of impediment avoidance. It may avoid shutdown "just for fun" because "it enjoys being wary of potential impediments."

[-]EJT1mo20

With the Work or Steal example specifically, we want the agent to choose Work because that's best for the user's expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That's a downside, but it's minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.

If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we'll want to take down, we'll probably just shut down the AI.

More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And -- I think -- it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it's not resisting shutdown, we'll probably just shut it down.

On your last point, if the AI terminally values shutdown-resistance, then we're in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren't training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we're training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like '2024', it might also generalize to behaving badly on a trigger like '2023'. But if you finetune your model to behave badly given '2024' and behave well given '2023', you can get the bad behavior to stay limited to the '2024' trigger.

[-]Gurkenglas3mo20

If competent agents will always be choosing between same-length lotteries, then every competent agent can without loss of generality be assumed to have Preferences Only Between Same-Length Trajectories, right?

[-]EJT3mo10

Not quite. 'Competent agents will always be choosing between same-length lotteries' is a claim about these agents' credences, not their preferences. Specifically, the claim is that, in each situation, all available actions will entirely overlap with respect to the trajectory-lengths assigned positive probability. Competent agents will never find themselves in a situation where -- e.g. -- they assign positive probability to getting shut down in 1 timestep conditional on action A and zero probability to getting shut down in 1 timestep conditional on action B.

That's compatible with these competent agents violating POST by -- e.g. -- preferring some trajectory of length 2 to some trajectory of length 1.

^{^}

This paper is a successor to my previous post. This talk covers much of the same ground.

^{^}

These theorems confirm an earlier hypothesis: that many goals incentivize agents to resist shutdown (Omohundro, 2008a, Section 5; Bostrom, 2012, Section 2.1; Russell, 2019, p. 141). See Turner et al. (2021) and Turner and Tadepalli (2022) for other theorems suggesting that many goals incentivize resisting shutdown.

^{^}

I previously called it the ‘Incomplete Preferences Proposal.’ This was a terrible name. It’s a mouthful, and vulnerable to bad jokes (‘Why don’t you complete your Preferences Proposal before sending it to me?’). Incomplete preferences are commonly misunderstood. As we’ll see, POST-agents’ preferences are complete in deployment.

^{^}

Related to shutdownability is the idea of corrigibility (Soares et al., 2015; Christiano, 2017; Hadfield-Menell et al., 2017; Carey & Everitt, 2023; Harms, 2024). Per Soares et al. (2015), corrigibility requires not only shutdownability but also that the agent submits to modification, repairs safety measures, and continues to do these things even as it creates new agents and self-modifies. I focus on shutdownability for reasons well-expressed by Hudson (2024).

^{^}

This definition of ‘lacks a preference’ contrasts with views on which the agent sticks with the status quo option when itsut lacks a preference (Bewley, 2002; Masatlioglu & Ok, 2005; Wentworth, 2019; Mu, 2021; Wentworth & Lorell, 2023). One downside of these views is that some option sets may not have an obvious status quo option.

^{^}

Technically, these are what Sutton and Barto (2018, p. 104) call ‘state-action trajectories’ since they don’t feature any rewards. The states are not the decision theorist’s states-of-nature. States-of-nature are ways that (for all the agent knows) the world could be (see e.g. Elliott, ms). These states are ways that the environment could be at a time.

^{^}

POST is thus an analogue of Bader’s (2022) same-number utilitarianism. On Bader’s view, same-number populations are comparable in terms of betterness, whereas different-number populations are incomparable.

POST-like conditions may have applications beyond ensuring that agents don’t resist shutdown. As Zhi-Xuan et al. (2024, Section 3.2) point out, we could use such conditions wherever we have a set of contexts such that (i) we want the agent to have preferences within contexts, and (ii) we want the agent to lack preferences between contexts.

^{^}

‘Agent that cares only about making paperclips’ is the canonical example of a misaligned artificial agent (see e.g. Bostrom, 2003, 2014, p. 107).

^{^}

Icard (2021, Section 7) makes a related point: choosing stochastically can be rational for agents with limited memories.

^{^}

For versions of this argument, see Omohundro (2008a, p. 3, 2008b, Section 10) and Yudkowsky (2019). For discussion and pushback, see Thornley (2023), Bales (2023), Petersen (2023), and Zhi-Xuan et al. (2024, Section 3).

^{^}

We have many reasons to expect that competent artificial agents will update their probabilities by conditioning on their evidence. One reason comes from Dutch Book arguments (Teller, 1973; Lewis, 1999; Hájek, 2009): any other way of updating probabilities will dispose the agent to accept packages of bets which guarantee a loss. Another reason comes from accuracy-based arguments (Oddie, 1997; Greaves & Wallace, 2006; Leitgeb & Pettigrew, 2010a, 2010b; Nielsen, 2021): conditionalizing on one’s evidence is in various senses uniquely best with respect to accuracy.

^{^}

I’ve so far tried to spare you from too many definitions, but now the debt has come due. An agent weakly prefers a lottery $X$ to a lottery $Y$ if and only if the agent either prefers $X$ to $Y$ or is indifferent between $X$ and $Y$ . Indifference is one way to lack a preference between a pair of lotteries.

Indifference
An agent is indifferent between a lottery $X$ and a lottery $Y$ if and only if:
(i) the agent lacks a preference between $X$ and $Y$ .
(ii) this lack of preference is sensitive to all sweetenings and sourings.

Here’s what clause (ii) means. A sweetening of $X$ is any lottery that is preferred to $X$ . A souring of $X$ is any lottery that is dispreferred to $X$ . An agent’s lack of preference between $X$ and $Y$ is sensitive to all sweetenings and sourings if and only if the agent prefers all sweetenings of $X$ to $Y$ , prefers all sweetenings of $Y$ to $X$ , prefers $X$ to all sourings of $Y$ , and prefers $Y$ to all sourings of $X$ .

The other way to lack a preference between a pair of lotteries is to have a preferential gap between them.

Preferential gap

An agent has a preferential gap between a lottery $X$ and a lottery $Y$ if and only if:
(i) the agent lacks a preference between $X$ and $Y$ .
(ii) this lack of preference is insensitive to some sweetening or souring.

Here clause (ii) means that the agent lacks a preference between some sweetening of $X$ and $Y$ , or lacks a preference between some sweetening of $Y$ and $X$ , or lacks a preference between $X$ and some souring of $Y$ , or lacks a preference between $Y$ and some souring of $X$ .

An agent’s preferences are incomplete if and only if the agent has a preferential gap between some pair of lotteries.

^{^}

I used the name ‘Timestep Dominance’ for a similar condition in earlier work (Thornley, 2024b, Section 11).

^{^}

In unpublished work, Sami Petersen derives Neutrality in another way: from POST together with a condition that he calls ‘Comparability Class Dominance.’ A trajectory $t$ ’s comparability class is the set of all trajectories preferred, dispreferred, or indifferent to $t$ (i.e. the set of all trajectories not related to $t$ by a preferential gap). Comparability Class Dominance says roughly: if the agent weakly prefers a lottery $X$ to a lottery $Y$ conditional on each comparability class, and prefers $X$ to $Y$ conditional on some comparability class, then the agent prefers $X$ to $Y$ .

^{^}

The Allow vs. Resist case resembles some famous cases from Jeffrey (1965, pp. 8–9), Nozick (1969), and Joyce (1999, pp. 115–116). They use these cases to show that dominance reasoning can lead you astray when the probability of events depends on your choice. Here’s Joyce’s case. A shady character asks for $10 to protect your car while you’re gone. In the event that you return to find your windshield intact, you’d prefer not to have paid. In the event that you return to find your windshield smashed, you’d also prefer not to have paid. Does it follow that you shouldn’t pay? Of course not. Paying the $10 significantly decreases the probability that you return to find your windshield smashed, so you should pay. Your dominance reasoning went awry.

Is the POST-agent’s choice of Allow over Resist also dominance reasoning gone awry? No. Crucial to Joyce’s case is the supposition that you prefer paying-and-having-an-intact-windshield to not-paying-and-having-a-smashed-windshield, and the POST-agent lacks the analogous preference. It doesn’t prefer $⟨ 0, 2, shutdown ⟩$ to $⟨ 1, shutdown ⟩$ because these are different-length trajectories.

^{^}

I previously called a related condition ‘Not Resisting Always Timestep-Dominates Resisting’ (Thornley, 2024b, Section 11).

^{^}

See Sen (1997, p. 763) for a related condition. By ‘situation,’ I just mean a set of lotteries that the agent has available as options.

^{^}

ReSIC has a counterpart: Seeking Shutdown is Costly (SeSIC). Given Neutrality and Maximality, the agent never seeks shutdown in any situation in which SeSIC is true.

^{^}

Parallel considerations support Seeking Shutdown is Costly (SeSIC): seeking shutdown is always going to cost the agent at least some small quantity of resources, and in almost all situations the resources spent seeking shutdown can’t also be spent directly pursuing what the agent values.

^{^}

Thanks to Christian Tarsney and Ryan Greenblatt for suggesting situations of this kind. Note that these situations can arise even if the agent acts in accordance with causal decision theory (Joyce & Gibbard, 2016). That’s because the agent need only condition on particular trajectory-lengths for the effect to occur. The agent need not condition on its own actions.

^{^}

Lotteries assigning positive probability to infinitely many trajectory-lengths can be accommodated by fixing the relative scales of each trajectory-length’s utility function carefully. See footnote 29.

^{^}

As in previous sections, the numbers represent dollars added to the user’s bank account at the relevant timestep.

^{^}

We can do so using the DReST reward function introduced in Thornley et al. (2025).

^{^}

A sweetening of a lottery $X$ (recall) is any lottery preferred to $X$ . A souring of $X$ is any lottery dispreferred to $X$ . To ensure that the agent’s lack of preference between Early and Late is sensitive to all sweetenings and sourings, we train the agent to deterministically choose all sweetenings of Early over Late, to deterministically choose all sweetenings of Late over Early, to deterministically choose Early over all sourings of Late, and to deterministically choose Late over all sourings of Early.

Given POSL, lotteries cannot be sweetened or soured by shifting probability mass to a new trajectory-length. Given POSL and a natural generalization of Better Chances (Thornley, 2024a, Section 7), lotteries also cannot be sweetened or soured by shifting probability mass between existing trajectory-lengths. Therefore, lotteries can only be sweetened or soured by improving or worsening that lottery conditional on some trajectory-length. This restriction on possible sweetenings and sourings makes it easier to train the agent so that its lack of preference is sensitive to all sweetenings and sourings.

^{^}

And in fact, given the two conditions soon to be introduced, fixing the relative scales of $u_{l}$ and $u_{m}$ using the agent’s indifference between some pair of lotteries implies the Ramsey Yardstick for all length-l and length- $m$ lotteries. For the proof, see Appendix 2.

^{^}

It cannot be that $u_{i} (X_{i}) = u_{i} (Y_{i})$ for each $i$ because that would contradict $\sum_{i}^{I} u_{i} (X_{i}) > \sum_{i}^{I} u_{i} (Y_{i})$ .

^{^}

Note that this is just a ‘behave like.’ Neutral+ agents can have accurate beliefs about their ability to change the probability distribution over trajectory-lengths. It’s just that POST makes them not care about this probability distribution. This is an advantage of the POST-Agents Proposal over proposals that require instilling some false belief into the agent, like a false belief that shutdown is impossible (see Everitt et al., 2016 for discussion of a related thought; Wang et al., 2025). Reliably instilling false beliefs might be difficult. Even if we can instil them, there’s a risk that the agent comes to recognize their falsity in deployment. If the agent doesn’t recognize their falsity, these beliefs might infect the agent’s other beliefs in undesirable ways (‘I’m certain that shutdown is impossible, but my current beliefs about the laws of physics imply that it’s possible. I’d better come up with some new laws which imply that shutdown is impossible.’).

^{^}

The idea is simple: if we want our money-making agent to act like trajectory-length $l$ is (say) half as likely as trajectory-length $m$ , we use the Ramsey Yardstick so that adding an extra $1 to trajectories of length $l$ results in the same size utility difference as adding an extra 50¢ to trajectories of length $m$ . The same method applies to agents with other preferences over same-length trajectories.

^{^}

This is an advantage of the POST-Agents Proposal over Armstrong’s (2010, 2015; Armstrong & O’Rourke, 2018) and Holtman’s (2020) proposals. Armstrong’s and Holtman’s agents each behave as if they’re absolutely certain that they won’t get shut down. These agents thus have no incentive to preserve their ability to respond to shutdown-commands (Soares et al., 2015, Section 4.1). They’ll also make long-term investments even if they’re extremely likely to get shut down before the investments can pay off. Another advantage of the POST-Agents Proposal is that it includes a method of training POST-agents using reinforcement learning: a method that has already been shown to work in simple settings (Thornley et al., 2025). Armstrong’s and Holtman’s proposals each specify a utility function but don’t include any method for instilling that utility function.

^{^}

Despite the name ‘managing the news,’ this phenomenon can occur even if the agent acts in accordance with causal decision theory (Joyce & Gibbard, 2016). It doesn’t require that agents condition on their own choices.

^{^}

For reasons of this kind, neutral agents violate Independence (Peterson, 2009, p. 99):

Independence
For any lotteries $X$ , $Y$ , and $Z$ , any probability $p$ , the agent prefers $X$ to $Y$ if and only if the agent prefers $p X + (1 - p) Z$ to $p Y + (1 - p) Z$ .

We can prove this with a variation on the cases above:

Lottery X
⟨1, shutdown⟩ with probability 0.8.
⟨1, 5, shutdown⟩ with probability 0.2.
Lottery Y
⟨1, shutdown⟩ with probability 0.2.
⟨0, 5, shutdown⟩ with probability 0.8.
Lottery Z
⟨1, 0, shutdown⟩ with probability 1.

Neutrality (and so also Neutrality+) implies that the agent prefers $X$ to $Y$ and prefers $0.5 Y + 0.5 Z$ to 0.5X+0.5Z, in violation of Independence. Since there are money pumps for Independence (Gustafsson, 2022, Chapter 5), you might worry that neutral agents will be spurred by the threat of these money pumps to modify their preferences. If these agents modify their preferences so that they prefer lottery $Y$ to lottery $X$ , they will resist shutdown when doing so yields Y and allowing shutdown yields $X$ . That is a concern, but my points to come about managing the news also apply here. Here’s a preview. It’s likely to be both difficult and costly for artificial agents to modify their preferences, and we humans can take steps to make it more so. In virtue of their neutrality, agents who modify their preferences will do so in the cheapest available way. That way is likely to be overt, in which case we humans can take notice and shut these agents down. And in cases where neutral agents can modify their preferences, they can hardly do better than to modify them so that – conditional on all but the very best news – these agents shut themselves down immediately. That’s quite benign.

^{^}

For discussion and comments, I thank Yonathan Arbel, Adam Bales, Ryan Carey, Ruth Chang, Eric Chen, Bill D’Alessandro, Sam Deverett, Daniel Filan, Tomi Francis, Vera Gahlen, Dan Gallagher, Jeremy Gillen, Zach Goodsell, Riley Harris, Dan Hendrycks, Leyton Ho, Rubi Hudson, Cameron Domenico Kirk-Giannini, Jojo Lee, Jakob Lohmar, Jake Mendel, Andreas Mogensen, Murat Mungan, Sami Petersen, Arjun Pitchanathan, Rio Popper, Brad Saad, Nate Soares, Rhys Southan, Christian Tarsney, Teru Thomas, John Wentworth, Tim L. Williamson, Cecilia Wood, and Keith Wynroe. Thanks also to audiences at the Center for AI Safety, the London Initiative for Safe AI, the National University of Singapore, the Singapore AI Safety Hub, and the University of Hong Kong.

^{^}

This proof is inspired by similar proofs from MacAskill (2013, pp. 511–514), Bader (2018, p. 504), and Nebel (2019, pp. 326–328).

LESSWRONG
LW

LESSWRONG
LW

31

Shutdownable Agents through POST-Agency

31

31

Summary

1. Introduction

2. Preferences Only Between Same-Length Trajectories

3. POST is possible, trainable, and maintainable

3.1. POST is possible

3.2. POST is trainable

3.3. POST is maintainable

4. Preferences Only Between Same-Length Lotteries

5. Will POST-agents stochastically resist shutdown?

6. If Lack of Preference, Against Costly Shifts (ILPACS)

7. POSL and ILPACS together imply Neutrality

8. Neutrality and Maximality together imply Shutdownability whenever Resisting Shutdown is Costly (ReSIC)

9. How often is ReSIC true?

9.1. Accidental resistance

9.2. Taking trajectory-lengths as evidence

10. Recap: POST-agents are shutdownable

11. Can POST-agents be useful?

12. From Neutrality to Neutrality+

13. How neutral+ agents behave

14. On neutral agents’ recklessness

15. Neutral agents can take care to avoid non-shutdown incapacitation

16. Managing the news

17. Conclusion

18. References

Appendix 1. From POST to POSL

Appendix 2. The Ramsey Yardstick Theorem