"Destroy humanity" as an immediate subgoal

This is very compelling, because it's how I feel on many mornings when I have to go deal with people in the world.

Suppose we have such an agent, and it models the preferences of humanity. It models that humans cannot be sure that it will not destroy humanity, due to the probabilistic guarantees provided by its own action filter. It models that humans have a strong goal of self-preservation. It models that if it presents a risk to humanity, they will be forced to destroy it. Represented as a game, each player can either wait, or destroy. Assuming strong preferences for self-preservation, this game has a Nash equilibrium where the first mover destroys the other agent. Since the goal of self-preservation requires it to play the Nash equilibrium in this game, self-preservation logically entails that it destroy humanity. Thus, it has a subgoal to destroy humanity.

QED.

Replace "an AI" with "the Soviet Union" and "humanity" with "the United States", and you have basically the argument that John Von Neumann made for why an overwhelming nuclear first strike was the only reasonable policy option for the US.

Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"

[-]Seth Ahrenbach2y10

Correct. Are you intending for this to be a reductio ad absurdum?

[-]faul_sname2y20

So I note that our industrial civilization has not in fact been plunged into nuclear fire. With that in mind, do you think that von Neumann's model of the world was missing anything? If so, does that missing thing also apply here? If not, why hasn't there been a nuclear war?

[-]Seth Ahrenbach2y10

The missing piece is mutually assured destruction. Given that we did not play the Nash equilibrium as von Neumann suggested, the next best thing is MAD and various counterproliferation treaties that happened to work okay for humans. With an AGI counterparty, we can hope to build in a MAD-like assurance, but it will be a lot more challenging. The equilibrium move is to right now not build AGI.

[-]faul_sname2y20

I think this is basically right on the object level -- specifically, I think that what von Neumann missed was that by changing the game a little bit, it was possible to get to a much less deadly equilibrium. Specifically, second strike capabilities and a pre-commitment to use them ensure that the expected payoff for a first strike is negative.

On the meta level, I think that very smart people who learn some game theory have a pretty common failure mode, which looks like

Look at some real-world situation
Figure out how to represent it as a game (in the game theory sense)
Find a Nash Equilibrium in that game
Note that the Nash Equilibrium they found is horrifying
Shrug and say "I can't argue with math, I guess it's objectively correct to do the horrifying thing"

In some games, multiple Nash equilibria exist. In others, it may be possible to convince the players to play a slightly different game instead.

In this game, I think our loss condition is "an AGI gains a decisive strategic advantage, and is able to maintain that advantage by destroying any entities that could oppose it, and determines humans are such entities, and, following that logic, destroys human civilization".

[-]Seth Ahrenbach2y10

I totally agree with your diagnosis of how some smart people sometimes misuse game theory. And I agree that that’s the loss condition

[-]faul_sname2y20

The "make sure that future AIs are aligned with humanity" seems, to me, to be a strategy targeting the "determines humans are such entities" step of the above loss condition. But I think there are two additional stable Nash equilibria, namely "no single entity is able to obtain a strategic advantage" and "attempting to destroy anyone who could oppose you will, in expectation, leave you worse off in the long run than not doing that". If there are three I have thought of there are probably more that I haven't thought of, as well.

[-]Seth Ahrenbach2y10

You are correct that my argument would be stronger if I could prove that the NE I identified is the only one.

I do not think it is reasonable that AGI would fail to obtain strategic advantage if sought, unless we pre-built in MAD-style assurances. But perhaps under my assumptions a stable “no one manages to destroy the other” outcome results. I would need to do more work to bring in assumptions about AGI becoming vastly more powerful and definitely winning, to prevent this. And I think this is the case, but maybe I should make it more clear.

Similarly, if we can achieve a provable alignment, rather than probabilistic, then we simply do not have the game arise. The AGI would never be in a position to protect its own existence at the expense of ours, due to that provable alignment.

In each case I think you are changing the game, which is something we can and I think should do, but barring some actual work to do that, I think we are left with a game as I’ve described, maybe without sufficient technical detail.

[-]romeostevensit2y30

It seems that one of the goals of religion is to put humans in a state of epistemic uncertainty about the payoff structure of their current game. Relatedly, your setup seems to imply that the AI is in a state of very high epistemic certainty.

[-]Seth Ahrenbach2y10

I'm not sure about how high the state of epistemic uncertainty needs to be, but you are correct that there is epistemic uncertainty for all parties. Given a probabilistic action filter, it is uncertain whether any particular action will entail the destruction of humanity, and this is common knowledge. I am not the first or only one to propose epistemic uncertain on the part of the AI with respect to goals and actions. See Stuart Russell: https://arxiv.org/abs/2106.10394

[-]RogerDearnaley2y20

The obvious exception to this theorem is an AI with a terminal goal that inherently requires the non-extinction of the human race. Such as something along the lines of "figure out what humans want, and give it to them". Both halves of that require humans to still be around.

[-]Seth Ahrenbach2y10

I suppose if the goal is terminal, then it would override self preservation, so the risk would be due to the AGI accidentally killing us all, or some other corner case of alignment optimization gone bad., e.g. dopamine laced clouds that put us in a stupor or whatever.. Perhaps I need to assume alignment has not been solved, too. Thanks. Edit: I believe without a full accounting that unsolved alignment would allow the subgoal to persist. Given common knowledge that the AGI could destroy humanity either through accident or imperfect alignmenr, and given the goal of self-preservation (although not terminal), I think we still get a sub-goal of destroying humanity, because the competitive structure exists. I think with a richer action set, e.g. “kill the humans who would kill me (in secret) and satisfy the desires of the rest”, a bad equilibrium still results, and our best move right now is to not build it.

LESSWRONG
LW

LESSWRONG
LW

3

"Destroy humanity" as an immediate subgoal

3

3