I had a thought about a potential reward function for Neutrality+. For each batch you would:
The idea is that the agent's reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have no incentive to affect the likelihood of any given trajectory, but it will be incentivized to increase its reward given any trajectory length.
Adjustments:
I am curious as to how this scheme lines up with your plan for testing Neutrality+. My reading was that the plan is closer to building in training for the Ramsey Yardstick into DReST, but I couldn't quite workout how I would do that.
It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the "Work or Steal" example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).
Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to minimize the cost of dealing with the impediment it may choose to hide its scheming from humans.
More generally, the utility maximizing world states of a misaligned AI over long trajectories will still likely be bad and, therefore, still involve the modeling of some kind of human resistance. Although it will be unconcerned with avoiding early shut down, utility maximizing actions for minimizing the cost of human resistance may overlap heavily with shut down resistance.
It also seems possible to me that the model relearns shut down resistance as a generalization of impediment avoidance. It may avoid shutdown "just for fun" because "it enjoys being wary of potential impediments."