Suppose that AI capability research is done, but AI safety research is ongoing. Any of the major players can launch an AI at the press of a button to win the cosmos. The longer everyone waits, the lower the chance that the cosmos is paperclips. The default is that someone will press the button once they prefer their chance at an intact cosmos to risking the race going on further. This unfortunate situation could be helped by the fact that pressing the button need not be obvious by the other players. So suppose that the winner decides to lay low and smite whoever presses the button thereafter*. Then other people would have an incentive not to press the button that goes up over time!

Let paperclip probability p(t):=e^-t decay exponentially. Let t' be the last time at which the one other player wouldn't press the button. What mixed button-pressing strategy do we employ to make the get-smitten probability shore up the fading paperclip probability? At time t>=t', we press the button with probability density -p'(t)=e^-t. Then the probability that our strategy ever causes paperclips is .5*e^-2t'.

*He could also just figure out what everyone else would do in any situation and reward accordingly as a strategy against one-boxers, or copy the planet ten times over as a strategy against thirders, but this variant should work against your average human. (Turns out a large amount of strategies become available once you're omnipotent. Suggest more.)

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 1:35 PM

I think I see the point, but I'm not convinced it's actually feasible, because it requires a precommitment guaranteed by everyone. Yet for humans, it seems intuitive that the winner (if she doesn't create paperclips) will use the power right away, which goes against the precommitment, and thus the precommitment is ineffective.

Does this objection makes sense to you, or do you think I am confused by your proposal?

Indeed players might follow a different strategy than they declare. A player can only verify another player's precommitment after pressing the button (or through old-fashioned espionage of their button setup). But I find it reasonable to expect that a player, seeing the shape of the AI race and what is needed to prevent mutual destruction, would actually design their AGI to use a decision theory that would follow through on the precommitment. Humans may not be intuitively compelled by weird decision theories, but they can expect someone to write an AGI that uses them. Although even a human may find giving other players what they deserve more important than not letting the world as we know it continue for another decade.

Compare to Dr. Strangelove's doomsday machine. We expect that a human in the loop would not follow through, but we can't expect that no human would build such a machine.