Epistemic status: Not well argued for, haven’t spent much time on it, and it’s not very worked out. This thesis is not new at all, and is implicit in a lot (most?) of the discussion about x-risk from AI. The intended contribution of this post is to state the thesis explicitly in a way that can help make the alignment problem clearer. I can imagine the thesis is too strong as stated.


I sometimes hear implicitly or explicitly in people’s skepticism of AI risk that we can just not build dangerous AI. I want to formulate a thesis that does part of the work of countering this idea. 

If we want to be safe against misaligned AI, we need to defend the vulnerabilities that (a) misaligned AI system(s) could exploit that would pose an existential threat. We need aligned AI that defends against misaligned AI, in order to make sure such misaligned AI poses no threat, (either by ensuring they are not created, or never are able to accumulate the resources needed to be a threat).

In game theory terms: If we successfully build aligned AI systems, then we need to play an asymmetric game between humans with (aligned) defensive AI, versus potential (misaligned) offensive AI. It is not obvious that in an asymmetric game both sides need to have the same capabilities. The thesis I’m making is that the defensive AI needs to have at least all the capabilities that the offensive AI would need to pose an existential threat: The defensive AI needs to be basically capable of starting from a set of resources X, and doing all the damage that the offensive AI could do using resources X. 

The/an AI offense-defense symmetry thesis: In order for aligned AI systems to defend indefinitely against existential threat from misaligned AI without themselves harming humanity’s potential, that defensive AI needs to have broadly speaking all the capabilities of the misaligned AI systems that it defends against. 

Note that the defender may need to have a higher (or lower) level of those capabilities than the attacker, and it may need additional kinds of capabilities as well that the attackers don’t have.

I don’t think this thesis is exactly right, but more or less right. I won’t really make a solid argument for it, but give some analogies and mechanisms:

Some analogies from offense-defense in human domains:

State security services who defend against terrorists: 

  • The terrorists are persons with guns, the state security services are also persons with guns. To do damage one needs to be able to use guns to shoot people in a firefight. To defend, one still also needs to be able to use guns to shoot people in a firefight.
  • To do a lot of damage as a terrorist, you need to search for vulnerabilities to exploit. To defend against this, you need to search for vulnerabilities to fix/defend against. 
  • There are additional skills the defender needs to have, such as skills in tracking and monitoring suspects, which the attacker doesn’t need. The opposite direction seems to be less the case.

Computer security specialists defending against attacks:

  • Hackers and malware designers are people who understand computer systems and their vulnerabilities. IT Security professionals are also just people who understand computer systems and their vulnerabilities.
  • Empirically, black hat hackers or malware developers can contribute (once they change their motivations) to IT security, and security professionals could, if they wanted, use their knowledge to write malware. 

Mechanisms why this thesis might be true

  • Offense and defense both require the same world-model. If the offender’s task is “find and exploit a vulnerability”, and the defense’s task is “find and defend all vulnerabilities”, then while their planning tasks are different, they both require the world-model of the domain in which they do this planning. E.g. Both hackers and computer security professionals need to actually understand programming, operating systems, networking, and how there can be vulnerabilities in these systems. 
  • Offense and defense both need to search for vulnerabilities. Even defenders still need to be able to find vulnerabilities in their systems, especially when they can’t just respond ex-post to vulnerabilities found by attackers because a single failure is sufficiently damaging.
  • Symmetric subgames in asymmetric games. Some asymmetric games in the real world have symmetric subgames. E.g. Even though the conflict between an occupying power with a large military versus a guerilla force is an “asymmetric war”, at local points in the conflict, there are situations that are just one team of soldiers with guns fighting another team of soldiers with guns. Hence both the occupying power and the guerilla force need capabilities like marksmanship, tactical skill, coordination and so forth, for basically the same reason. 

A mechanism why this might not be true

  • Defense can use abstraction to counter categories of attacks. Slightly metaphorically, the attackers problem is to prove “∃ vulnerability” (i.e. to find a vulnerability), while the defender’s problem is the opposite, to prove “∀ secure” (i.e. to ensure there is no vulnerability). But rather than the defender searching through all vulnerabilities and defending against them, it can simply execute a smaller amount of general countermeasures without understanding all the particular vulnerabilities and their exploits. This might require different capabilities.

For example, in computer security, rather than searching for vulnerabilities (e.g. by white hat hacking) and fixing them, one can work on formally verified software to rule out categories of vulnerabilities. This requires very different capabilities than hacking does.

Conclusion

This thesis would imply that in order to permanently defend against existential risk from AI without permanently crippling humanity, we need to develop aligned AI which has all the capabilities necessary to cause such an existential catastrophe. In particular, we cannot build weak AI to defend against strong AI. 


 

New to LessWrong?

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 10:09 PM

Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say "this is sort of related", or are you trying to say "the strategy stealing assumption and this defense-offense symmetry thesis is the same thing"?

In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:

  • Strategy-stealing assumption is (in the context of AI alignment): for any strategy that a misaligned AI can use to obtain influence/power/resources, humans can employ a similar strategy to obtain a similar amount of influence/power/resources.
  • This defense-offense symmetry thesis: In certain domains, in order to defend against an attacker, the defender need the same cognitive skills (knowledge, understanding, models, ...) as the attacker (and possibly more).

These seem sort of related, but they are just very different claims, even depending on different ontologies/cocepts. One particularly simple-to-state difference is that the strategy-stealing argument is explicitly about symmetric games whereas the defense-offense symmetry is about a (specific kind of) asymmetric game, where there is a defender who first has some time to build defenses, and then an attacker who can respond to that and exploit any weaknesses. (and the strategy-stealing argument as applied to AI alignment is not literally symmetric, but semi-symmetric in the sense of the relation between inbeing kind of "linear").

So yeah given this, could you say what you think the relation is?

Strategy-stealing assumption is (in the context of AI alignment): for any strategy that a misaligned AI can use to obtain influence/power/resources, humans can employ a similar strategy to obtain a similar amount of influence/power/resources.

... And the humans have a majority of the resources / power, which requires having competitive aligned AI systems. More broadly strategy-stealing is "the player with majority resources / power can just copy the strategy of the other player".

One particularly simple-to-state difference is that the strategy-stealing argument is explicitly about symmetric games whereas the defense-offense symmetry is about a (specific kind of) asymmetric game, where there is a defender who first has some time to build defenses, and then an attacker who can respond to that and exploit any weaknesses.

I wouldn't say the strategy-stealing assumption is about a symmetric game; it's symmetric only in that the actions available to both sides are approximately the same. The goals of the two sides are pretty different and aren't zero-sum.

Similarly I think in the defense-offense case the actions available to both sides are approximately the same but the goals are pretty different (defend X vs attack X). The strategy-stealing argument as applied to defense-offense would say something like "whatever offense does to increase its resources / power is something that defense could also do to increase resources / power". E.g. if the terrorists secretly go around shooting people to decrease state power, the state could also go around secretly shooting terrorists to decrease terrorist power. Often the position with majority resources / power (i.e. the state) will have a better action than that available, and so you'll see the two groups doing different things, but "use the same strategy as the less-resourced group" is an available baseline that helps you preserve your majority resources / power.

This isn't the same as your thesis. Your thesis says "the defender needs to have the same capabilities as the attacker". The strategy-stealing argument directly assumes that the defender has the same capabilities (i.e. assumes the conclusion of your thesis), and then uses that to argue that there is a lower bound on how well the majority-resourced player can do.

So anyway I'd say the relation is that both theses are talking about the same sort of game / environment, and defense-offense is a central example application of the strategy-stealing argument (especially in AI alignment, where humanity + aligned AI are defending against misaligned AI attackers).

"I think in the defense-offense case the actions available to both sides are approximately the same"

If attacker has the action "cause a 100% lethal global pandemic" and the defender has the task "prevent a 100% lethal global pandemic", then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states). 

If you build an OS that you're trying to make safe against attacks, you might do e.g. what the seL4 microkernel team did and formally verify the OS to rule out large classes of attacks, and this is an entirely different kind of action than "find a vulnerability in the OS and develop an exploit to take control over it".

"I wouldn't say the strategy-stealing assumption is about a symmetric game"

Just to point out that the original strategy stealing argument assumes literal symmetry. I think the argument only works insofar as generalizing from literal symmetry doesn't break this argument (to e.g. something more like linearity of the benefit of initial resources). I think you actually need something like symmetry in both instrumental goals, and "initial-resources-to-output map". 

The strategy-stealing argument as applied to defense-offense would say something like "whatever offense does to increase its resources / power is something that defense could also do to increase resources / power".

Yes, but this is almost the opposite of what the offense-defense symmetry thesis is saying. Because it can both be true that 1. defender can steal attacker's strategies, AND 2. defender alternatively has a bunch of much easier strategies available, by which it can defend against attacker and keep all the resources.

This DO-symmetry thesis says that 2 is NOT true, because all such strategies in fact also require the same kind of skills. The point of the DO-symmetry thesis is to make more explicit the argument that humans cannot defend against misaligned AI without their own aligned AI. 

"This isn't the same as your thesis."

Ok I only read this after writing all of the above, so I thought you were implying they were the same (and was confused as to why you would imply this), and I'm guessing you actually just meant to say "these things are sort of vaguely related". 

Anyway, if I wanted to state what I think the relation is in a simple way I'd say that they give lower and upper bounds respectively on the capabilities needed from AI systems:

  • OD-symmetry thesis: We need our defensive AI to be at least as capable as any misaligned AI.
  • strategy-stealing: We don't need our defensive AI to be any more capable.

I think probably both are not entirely right.

Yes, all of that mostly sounds right to me.

I agree the formal strategy-stealing argument relies on literal symmetry; I would say the linked post is applying it to asymmetric situations, where you can recover something roughly symmetric, by assuming that both players need to first accumulate resources and power. (I think this is basically what you said.)

  • Hackers and malware designers are people who understand computer systems and their vulnerabilities. IT Security professionals are also just people who understand computer systems and their vulnerabilities.

Very few people comprehensively understand computer systems. Certainly less than even a fraction of the numbers in the above groups.

A lot of 'hacking' and 'security', even at the lowest levels of abstraction, are folks operating at several layers above the bare metal. 

Thus the actual dynamics that are observed occur in less secure system with actors of only partial understanding on both sides.

Yeah, I know they don't understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn't really communicated by my phrasing though, so maybe I should edit that

I think that you need to distinguish two different goals:

  • the very ambitious goal of eliminating any risk of misaligned AI doing any significant damage. If even possible, that would require an aligned AI with much stronger capabilities than the misaligned one (or many aligned AIs such that their combined capabilities are not easily matched)
  • the more limited goal to reduce extinction risk by AGI to a low enough level (say, comparable to asteroid risk or natural pathogen risk). This might manageble with the help of lesser AIs, depending on time to prepare

I agree this is a good distinction.