The planner's behavior is an example of implicit extortion (it even follows the outline from that post: initially rewarding the desired behavior at low cost, briefly paying a large cost to both reward and penalize, and then transitioning to very cheap extortion). An RL agent that can be manipulated to cooperate by this mechanism can just as easily be made to hand the planner daily "protection" money. This suggests that agents that are successful in the real world will probably be at least somewhat resistant to this kind of extortion (or else the world will be some kind of weird equilibrium of these extortion games), either constitutionally or because of legal protections against this kind of extortion.
It seems like a satisfying model of / solution to this problem should somehow leverage the fact that cooperation is positive sum, such that an agent ought to be OK with the outcome.
If you were to actually apply the ideas from this paper, I think the interesting work is done by society agreeing that the planner has the right to use coercive violence to achieve their desired end. At that point it seems easiest to just describe this as a law against defection. The role of the planner seems exactly analogous to a legislator, reaching agreement about how the planner ought to behave is exactly as hard as reaching agreement about legislation, and there is no way to achieve the same outcome without such an agreement. Interpreted as a practical guide to legislation, I don't think this kind of heuristic adds much beyond conventional political economy.
(Of course, in a world with powerful AI systems, such laws will be enforced primarily by other AI systems. That seems like a tricky problem, but I don't see it as being meaningfully distinct from the normal alignment problem.)
Are you saying the real world ISN'T ALREADY an equilibrium of extortion games? Extortion that augments the defect penalty for positive-sum interactions and extortion which shifts a zero- or negative-sum interaction to one side or the other is still extortion, right?
I'm not sure I buy the story about applying this to cooperation in any sort of complicated environment, among agents anywhere near human level. It seems like you need the "police" to actually know what the right thing to do is, and you need the agents to not be able to get around the policing.
Maybe the idea is that you could use this kind of policing in cases where you can't just alter the reward function of the agent?