I had a thought about a potential reward function for Neutrality+. For each batch you would:
The idea is that the agent's reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have no incentive to affect the likelihood of any given trajectory, but it will be incentivized to increase its reward given any trajectory length.
Adjustments:
I am curious as to how this scheme lines up with your plan for testing Neutrality+. My reading was that the plan is closer to building in training for the Ramsey Yardstick into DReST, but I couldn't quite workout how I would do that.
Cool direction and results!
Can we prevent such an agent from having a preference to create agents that do resist shutdown?
EDIT: And if they're going to create agents anyway, actually make sure those agents don't resist shutdown, too, rather than, say, being indifferent about whether those other agents resist shutdown.
Thanks!
Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won't pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won't pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)
It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the "Work or Steal" example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).
Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to minimize the cost of dealing with the impediment it may choose to hide its scheming from humans.
More generally, the utility maximizing world states of a misaligned AI over long trajectories will still likely be bad and, therefore, still involve the modeling of some kind of human resistance. Although it will be unconcerned with avoiding early shut down, utility maximizing actions for minimizing the cost of human resistance may overlap heavily with shut down resistance.
It also seems possible to me that the model relearns shut down resistance as a generalization of impediment avoidance. It may avoid shutdown "just for fun" because "it enjoys being wary of potential impediments."
With the Work or Steal example specifically, we want the agent to choose Work because that's best for the user's expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That's a downside, but it's minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.
If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we'll want to take down, we'll probably just shut down the AI.
More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And -- I think -- it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it's not resisting shutdown, we'll probably just shut it down.
On your last point, if the AI terminally values shutdown-resistance, then we're in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren't training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we're training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like '2024', it might also generalize to behaving badly on a trigger like '2023'. But if you finetune your model to behave badly given '2024' and behave well given '2023', you can get the bad behavior to stay limited to the '2024' trigger.
If competent agents will always be choosing between same-length lotteries, then every competent agent can without loss of generality be assumed to have Preferences Only Between Same-Length Trajectories, right?
Not quite. 'Competent agents will always be choosing between same-length lotteries' is a claim about these agents' credences, not their preferences. Specifically, the claim is that, in each situation, all available actions will entirely overlap with respect to the trajectory-lengths assigned positive probability. Competent agents will never find themselves in a situation where -- e.g. -- they assign positive probability to getting shut down in 1 timestep conditional on action A and zero probability to getting shut down in 1 timestep conditional on action B.
That's compatible with these competent agents violating POST by -- e.g. -- preferring some trajectory of length 2 to some trajectory of length 1.