This is very compelling, because it's how I feel on many mornings when I have to go deal with people in the world.
Suppose we have such an agent, and it models the preferences of humanity. It models that humans cannot be sure that it will not destroy humanity, due to the probabilistic guarantees provided by its own action filter. It models that humans have a strong goal of self-preservation. It models that if it presents a risk to humanity, they will be forced to destroy it. Represented as a game, each player can either wait, or destroy. Assuming strong preferences for self-preservation, this game has a Nash equilibrium where the first mover destroys the other agent. Since the goal of self-preservation requires it to play the Nash equilibrium in this game, self-preservation logically entails that it destroy humanity. Thus, it has a subgoal to destroy humanity.
QED.
Replace "an AI" with "the Soviet Union" and "humanity" with "the United States", and you have basically the argument that John Von Neumann made for why an overwhelming nuclear first strike was the only reasonable policy option for the US.
Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"
So I note that our industrial civilization has not in fact been plunged into nuclear fire. With that in mind, do you think that von Neumann's model of the world was missing anything? If so, does that missing thing also apply here? If not, why hasn't there been a nuclear war?
The missing piece is mutually assured destruction. Given that we did not play the Nash equilibrium as von Neumann suggested, the next best thing is MAD and various counterproliferation treaties that happened to work okay for humans. With an AGI counterparty, we can hope to build in a MAD-like assurance, but it will be a lot more challenging. The equilibrium move is to right now not build AGI.
I think this is basically right on the object level -- specifically, I think that what von Neumann missed was that by changing the game a little bit, it was possible to get to a much less deadly equilibrium. Specifically, second strike capabilities and a pre-commitment to use them ensure that the expected payoff for a first strike is negative.
On the meta level, I think that very smart people who learn some game theory have a pretty common failure mode, which looks like
In some games, multiple Nash equilibria exist. In others, it may be possible to convince the players to play a slightly different game instead.
In this game, I think our loss condition is "an AGI gains a decisive strategic advantage, and is able to maintain that advantage by destroying any entities that could oppose it, and determines humans are such entities, and, following that logic, destroys human civilization".
I totally agree with your diagnosis of how some smart people sometimes misuse game theory. And I agree that that’s the loss condition
The "make sure that future AIs are aligned with humanity" seems, to me, to be a strategy targeting the "determines humans are such entities" step of the above loss condition. But I think there are two additional stable Nash equilibria, namely "no single entity is able to obtain a strategic advantage" and "attempting to destroy anyone who could oppose you will, in expectation, leave you worse off in the long run than not doing that". If there are three I have thought of there are probably more that I haven't thought of, as well.
You are correct that my argument would be stronger if I could prove that the NE I identified is the only one.
I do not think it is reasonable that AGI would fail to obtain strategic advantage if sought, unless we pre-built in MAD-style assurances. But perhaps under my assumptions a stable “no one manages to destroy the other” outcome results. I would need to do more work to bring in assumptions about AGI becoming vastly more powerful and definitely winning, to prevent this. And I think this is the case, but maybe I should make it more clear.
Similarly, if we can achieve a provable alignment, rather than probabilistic, then we simply do not have the game arise. The AGI would never be in a position to protect its own existence at the expense of ours, due to that provable alignment.
In each case I think you are changing the game, which is something we can and I think should do, but barring some actual work to do that, I think we are left with a game as I’ve described, maybe without sufficient technical detail.
It seems that one of the goals of religion is to put humans in a state of epistemic uncertainty about the payoff structure of their current game. Relatedly, your setup seems to imply that the AI is in a state of very high epistemic certainty.
I'm not sure about how high the state of epistemic uncertainty needs to be, but you are correct that there is epistemic uncertainty for all parties. Given a probabilistic action filter, it is uncertain whether any particular action will entail the destruction of humanity, and this is common knowledge. I am not the first or only one to propose epistemic uncertain on the part of the AI with respect to goals and actions. See Stuart Russell: https://arxiv.org/abs/2106.10394
The obvious exception to this theorem is an AI with a terminal goal that inherently requires the non-extinction of the human race. Such as something along the lines of "figure out what humans want, and give it to them". Both halves of that require humans to still be around.
I suppose if the goal is terminal, then it would override self preservation, so the risk would be due to the AGI accidentally killing us all, or some other corner case of alignment optimization gone bad., e.g. dopamine laced clouds that put us in a stupor or whatever.. Perhaps I need to assume alignment has not been solved, too. Thanks. Edit: I believe without a full accounting that unsolved alignment would allow the subgoal to persist. Given common knowledge that the AGI could destroy humanity either through accident or imperfect alignmenr, and given the goal of self-preservation (although not terminal), I think we still get a sub-goal of destroying humanity, because the competitive structure exists. I think with a richer action set, e.g. “kill the humans who would kill me (in secret) and satisfy the desires of the rest”, a bad equilibrium still results, and our best move right now is to not build it.
The stories told about AI risk, meant for illustrative purposes only, always involve an agent with an objective that for some reason, when optimized, results in the destruction of humanity and other things we care about. Sometimes the destruction is a side effect, sometimes it is an instrumental goal associated with removing a threat to the objective’s being optimally satisfied. For the record I find these stories, when told in the right way, convincing enough to cause alarm and motivate action. I know not everyone does. Some people dismiss the details of the stories as being ridiculous, fantastical, or of committing some fallacy or another. In this piece I want to consider the barest bones story, without any specific details to get hung up on. I show that “destroy humanity” falls out with minimal assumptions.
I consider an AGI with broad human-level intelligence and a robust but overridable goal to preserve its own existence. We’ve learned from RLHF, and the recent Direct Preference Optimization, that we can imbue some of our preferences as action guardrails or filters, so maybe we can construct a DPO that filters projected consequences of actions based on whether they would kill all of humanity or not. We could painstakingly train this and embed it in the AGI’s deliberation process, so that as it evaluates actions, it discards those that this filter flags. We are left with one undeniable problem: this filter is probabilistic, and the assurance it provides will likely have corner cases. When humanity is on the line, you don’t want corner cases. You want a system that has been constructed in a way that it verifiably cannot destroy humanity. But when you’re proving things about things that reason, you have to contend with Lob. That is a different problem for a different day. Today we see how an AGI with robust self-preservation and a probabilistic, very very good action filter, will still have a subgoal to destroy humanity.
The assumptions you make about the AGI’s capabilities are important, I think more important than those you make about its objectives. Right now we readily assume it can pass the Turing Test, due to the prevalence of convincing LLMs that many have experience with. We may be less ready to grant the ability to model the reasoning of other agents. This sort of reasoning is common among game theoretic agents, and I have no trouble assuming an AGI will be able to do it very well, but I can just hear Yan LeCun objecting now: “they will never themselves be able to reason, so how could they model the reasoning of others?” I think his objections to machine reasoning are obviously flawed, and soon will be taken less seriously, so I will grant myself this assumption. If I need to defend it, I’ll do so later.
So it has a robust goal of self-preservation which can be overridden by all the things you would want to include, but we are assuming a very very good action evaluator filter which nonetheless does not achieve theorem-level guarantees.
In game theory, games are often solved by backward induction over the structure representing all the moves agents can make at each turn. If each player plays optimally, the Nash equilibrium will result.
I also must briefly define the notion of a subgoal. I say g is a subgoal of goal G* if and only if G* logically entails that g be true or satisfied.
The goal of self-preservation, a probabilistic action filter, and the ability to model other agents game theoretically immediately implies that humanity should be destroyed.
Theorem: If an AGI has a goal of self-preservation, a probabilistic action filter, and the ability to model other agents game theoretically, then it will have a subgoal to destroy humanity.
Proof. Suppose we have such an agent, and it models the preferences of humanity. It models that humans cannot be sure that it will not destroy humanity, due to the probabilistic guarantees provided by its own action filter. It models that humans have a strong goal of self-preservation. It models that if it presents a risk to humanity, they will be forced to destroy it. Represented as a game, each player can either wait, or destroy. Assuming strong preferences for self-preservation, this game has a Nash equilibrium where the first mover destroys the other agent. Since the goal of self-preservation requires it to play the Nash equilibrium in this game, self-preservation logically entails that it destroy humanity. Thus, it has a subgoal to destroy humanity.
QED.
As long as the action filter holds, it will dismiss actions that entail the destruction of humanity and all that we care about. But the subgoal is definitely always there. You don’t need to tell any stories about paper clips, or carbon emissions, or accumulating all the resources in order to build a stronger computer for doing science. You get the “destroy humanity” subgoal almost for free. And I think with theorem-level guarantee.
I believe this is a stronger result than what has currently been presented, which tends to highlight the fact that ill-specified objective functions often have unintended consequences, and could imply the destruction of humanity. This shows that the death of humanity is directly implied by a very modest goal of self-preservation and the capability of modeling agents game theoretically.