This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
Additional thanks to Ameya Prabhu and Callum McDougall for their thoughts and feedback on this post.
I’ve seen that in various posts people will make an offhanded reference to “strategy-stealing” or “the strategy-stealing assumption” without very clearly defining what they mean by “strategy-stealing”. Part of the trouble with this is that these posts are often of wildly different flavors, and it’s often not clear what their connection to each other is, or perhaps it’s unclear under what conditions it might be feasible to “steal the strategy” of an AI, or sometimes it’s unclear where there is any notion of doing any stealing of strategies at any point.
Spoiler upfront, part of the reason for this is that the term comes from a relatively esoteric game-theoretic concept that basically never applies in a direct sense to the scenarios in which it is name-dropped. In this post, I’ll try to explain the common use of this seemingly-inappropriate term and why people are using it. In order to do this, I’ll first attempt to develop a more general framework for thinking about strategy-stealing, and then I’ll try to extend the intuition that develops to bridge the gap between a couple of seemingly highly disparate posts. Along the way I’ll attempt to clarify the connection to a few other ideas in alignment, and clarify what implications the strategy-stealing framework might have for the direction of other alignment research. I’ll declare up front that I think that in an AI safety context, the strategy-stealing framework is best thought of as a tool for managing certain intuitions about competition; rhetorically the goal of my post is to distill out those aspects of “game-theoretic” strategy-stealing that are broadly applicable to alignment so that we can mostly throw out the rest.
As an aside, I think that due to the (very relatable) confusion I’ve seen elsewhere, strategy-stealing is not actually a good name for the concept in an AI safety context, but the term is already in use and I don’t have any ideas for any better ones, so I’ll continue to just use the term strategy-stealing even in contexts where no strategies are ever stolen. My hope is that I do a good enough job of unifying the existing uses of the term to avoid a proliferation of definitions.
Strategy Stealing in Game Theory
The strategy-stealing argument page on Wikipedia has a very good short explanation of the term, reproduced here:
In combinatorial game theory, the strategy-stealing argument is a general argument that shows, for many two-player games, that the second player cannot have a guaranteed winning strategy. The strategy-stealing argument applies to any symmetric game (one in which either player has the same set of available moves with the same results, so that the first player can "use" the second player's strategy) in which an extra move can never be a disadvantage.
The argument works by obtaining a contradiction. A winning strategy is assumed to exist for the second player, who is using it. But then, roughly speaking, after making an arbitrary first move – which by the conditions above is not a disadvantage – the first player may then also play according to this winning strategy. The result is that both players are guaranteed to win – which is absurd, thus contradicting the assumption that such a strategy exists.
Although they were made pretty clear, note explicitly the following assumptions and limitations of the argument:
Strategy Stealing Intuitions For Competition Among AI Agents: Paul Christiano’s Model
In The Strategy-Stealing Assumption, Paul Christiano says to a crude approximation that “If you squint kinda hard, you can model trying to influence the future as a game which is symmetric with respect to some notion of power.” More concretely, he makes the following argument:
Other Assumptions and Sources of Justification for Paul’s Model
You might have noticed by now that even in principle, strictly speaking the game-theoretic notion of strategy-stealing doesn’t apply here, because we’re no longer in the context of a two-player turn-based adversarial game. Actually, instead of what’s commonly meant by “strategy-stealing” in game theory, Paul justifies his model by making reference to Jessica Taylor’s Strategies for Coalitions In Unit Sum Games. In this setting, we’re actually talking about games with multiple agents acting simultaneously, and where payoffs are not necessarily binary but merely unit sum .
I think intuitively Paul justifies the use of the term “strategy-stealing” with the idea that at least in a very similar fashion, we use the symmetry of the game to come to the intuitive conclusion that “coalitions can take advantage of public knowledge, i.e, steal strategies, to obtain power approximately proportionate to their size”. Actually, the results in Jessica’s post are even weaker than that-- Theorem 1 only shows that the conclusion holds for a very, very specific type of game, and Theorem 2 assumes that the coalition is of a certain size, and also assumes prior knowledge of other players’ strategies. I personally don’t think that these theoretical results justify the assumptions of Paul’s model in the more general setting that he describes particularly strongly, so it’s not surprising if you can come up with critiques to those assumptions. Anyway, none of this really affects the structure of the rest of the post, since the whole point rhetorically is that we ultimately just want to use intuitions that game theory helps us develop.
**Problems Applying the Strategy-Stealing Framework **
Rhetorically, the position of Paul’s post is that we may want to find ways to make this “strategy-stealing assumption” approximately true, at which point “all” we have to do is make sure that aligned humans control a sufficiently large proportion of resources. (Of course, we’d also have to address problems with/loopholes in the model and argument above, but his post spends a lot of time rather comprehensively doing that, so I won’t redo his analysis here.) I think the problem with this position is that from a practical perspective, it is basically impossible to make this assumption true, and it would involve solving a ton of other subproblems. To elaborate, the intuitively confusing thing to me about Paul’s post is that in many commonly imagined AI takeoff scenarios, it’s fairly clear that most of the assumptions you need in the game-theoretic strategy-stealing setting do not hold, not even approximately:
**Why might the strategy-stealing framework still be useful? **
As I’ve suggested in the introduction and despite my tone up until now, I don’t think the strategy-stealing framework is completely worthless just because Paul’s particular research direction seems infeasible. I think that when people speak about “strategy-stealing”, they’re often pointing to a concept which is not that closely related to the game-theoretic concept, but which takes advantage of some subset of the intuitions it involves, and this can be a really useful framing for thinking somewhat concretely about potential problems for alignment in various scenarios. One major source of intuition I’m thinking of is that often, success in a game may boil down to control of a single “resource”, for which a strong policy is easy to determine, so that no actual stealing of strategies is needed:
Ways of Factoring Potential Research Directions
You can think of Paul’s various uses of the term “strategy-stealing” as an attempt to unify the core problems in several potential future scenarios under a single common framework. For starters, there are the 11 different objections he raises in The Strategy-Stealing Assumption, but thinking about the computational constraints on your ability to “steal strategies” also directly motivates his ideas on inaccessible information. Tangentially, you can also see that his framing of the AI safety problem compels him to think about even things like question-answering systems in the context of the way they affect “strategy-stealing” dynamics.
Strategy Stealing Intuitions Within the Alignment Problem
Up until now all the scenarios we’ve explicitly described have assumed the existence of (coalitions of) AIs aligned in different ways, and we’ve applied strategy-stealing concepts to determine how their objectives shape the far future. However, some of the intuitions about “parties competing to affect a result in a game with functionally one important resource” can be applied to processes internal to an individual machine learning model, or optimization process (which, importantly, makes this a relevant concept even in the unipolar takeoff case). It’s not as clear how to think about these as it is to think about the world described by Paul Christiano, but I’ll just list some dynamics in machine learning, how they might loosely fit to a strategy-stealing framework, and suggest research directions that these might imply.
Some of the arguments can probably be extended to n-player turn-based games with relatively little difficulty, certain simultaneous games with also relatively little difficulty (as we’ll see below), and probably continuous-time games with moderate difficulty. ↩︎
This is the reason that the strategy-stealing argument can’t be used to prove a win for Black in Go with komi: the game is not actually symmetric; if you try to pass your first turn to “effectively become P2”, you can’t win by taking (half the board - komi + 0.5) points, like White can. ↩︎
For fun, though, this is also one reason why strategy-stealing can’t be used to prove a guaranteed win/draw for White in chess, due to zugzwang. ↩︎
Actually this isn’t the best example because first-player advantage in raw Gomoku is so huge that Gomoku has been explicitly solved by computers, but we can correct this example by imagining that instead of Gomoku I named a way more computationally complex version of Gomoku, where you win if you get like 800 in a row in like 15 dimensions or something. ↩︎
This assumes that the proportion of “influence” over the future a coalition holds is roughly proportional to the fraction of maximum possible utility they could achieve if everyone were aligned. There are obvious flaws to this assumption, which Paul discusses in his post. ↩︎
This means that the use of the phrase “human values… win out” above is doing a little bit of subtle lifting. Under the assumptions of Paul’s model, humans with 99% of flexible influence can achieve 99% of maximum utility in the long run. IMHO It’s a moral philosophical question whether this is an acceptable outcome; Paul bites the bullet and assumes that it is for his analysis. ↩︎
Furthermore, depending on the exact scenario you’re analyzing you might have to make the assumption that aligned AIs are designed such that humans can effectively cooperate with them, which starts to bleed into considerations about interpretability and corrigibility. This wasn’t discussed in Paul’s original post but was in the comments. ↩︎
I understood the idea of Paul's post as: if we start in a world where humans-with-aligned-AIs control 50% of relevant resources (computers, land, minerals, whatever), and unaligned AIs control 50% of relevant resources, and where the strategy-stealing assumption is true—i.e., the assumption that any good strategy that the unaligned AIs can do, the humans-with-aligned-AIs are equally capable of doing themselves—then the humans-with-aligned-AIs will wind up controlling 50% of the long-term future. And the same argument probably holds for 99%-1% or any other ratio. This part seems perfectly plausible to me, if all those assumptions hold.
Then we can talk about why the strategy-stealing assumption is not in fact true. The unaligned AIs can cause wars and pandemics and food shortages and removing-all-the-oxygen-from-the-atmosphere to harm the humans-with-aligned-AIs, but not so much vice-versa. The unaligned AI can execute a good strategy which the humans-and-aligned-AIs are too uncoordinated to do, instead the latter will just be bickering amongst themselves, hamstrung by following laws and customs and taboos etc., and not having a good coherent idea of what they're trying to do anyway. The aligned AIs might be less capable than an unaligned AI because of "alignment tax"—we make them safe by making them less powerful (they act conservatively, there are humans in the loop, etc.). And so on and so forth. All this stuff is in Paul's post, I think.
I feel like Paul's post is a great post in all those details, but I would have replaced the conclusion section with
"So, in summary, for all these reasons, the strategy-stealing assumption (in this context) is more-or-less totally false and we shouldn't waste our time thinking about it"
whereas Paul's conclusion section is kinda the opposite. (Zvi's comment along the same lines.)
I feel like a lot of this post is listing reasons that the strategy-stealing assumption is false (e.g. humans don't know what they're trying to do and can't coordinate with each other regardless), which are mostly consistent with Paul's post. It also notes that there are situations in which we don't care whether the strategy-stealing assumption is true or false (e.g. unipolar AGI outcomes, situations where all the AIs are misaligned, etc.).
And then other parts of the post are, umm, I'm not sure, sending "something is wrong" vibes that I'm not really understanding or sympathizing with…