When does technical work to reduce AGI conflict make a difference?
Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda

Wiki Contributions


Thanks for sharing, I'm happy that someone is looking into this. I'm not an expert in the area, but my impression is that this is consistent with a large body of empirical work on "procedural fairness", i.e., people tend to be happier with outcomes that they consider to have been generated by a fair decision-making process. It might be interesting to replicate studies from that literature with an AI as the decision-maker.

[I work at CAIF and CLR]

Thanks for this!

I recommend making it clearer that CAIF is not focused on s-risk and is not formally affiliated with CLR (except for overlap in personnel). While it’s true that there is significant overlap in CLR’s and CAIF’s research interests, CAIF’s mission is much broader than CLR’s (“improve the cooperative intelligence of advanced AI for the benefit of all”), and its founders + leadership are motivated by a variety of catastrophic risks from AI.

Also, “foundational game theory research” isn’t an accurate description of CAIF’s scope. CAIF is interested in a variety of fields relevant to the cooperative intelligence of advanced AI systems. While this includes game theory and decision theory, I expect that a majority of CAIF’s resources (measured in both grants and staff time) will be directed at machine learning, and that we’ll also support work from the social and natural sciences. Also see Open Problems in Cooperative AI and CAIF’s recent call for proposals for a better sense of the kinds of work we want to support.

[ETA] I don’t think “foundational game theory research” is an accurate description of CLR’s scope, either, though I understand how public writing could give that impression. It is true that several CLR researchers have worked and are currently working on foundational game & decision theory research. But people work on a variety of things. Much of our recent technical and strategic work on cooperation is grounded in more prosaic models of AI (though to be fair much of this is not yet public; there are some forthcoming posts that hopefully make this clearer, which I can link back to when they’re up.) Other topics include risks from malevolent actors and AI forecasting.

[Edit 14/9] Some of these "forthcoming posts" are up now.

A few thoughts on this part:

I guess [coordination failures between AIs] feels like mainly the type of thing that we can outsource to AIs, once they’re sufficiently capable. I don’t see a particularly strong reason to think that systems that are comparably powerful as humans, or more powerful than humans, are going to make obvious mistakes in how they coordinate. You have this framing of AI coordination. We could also just say politics, right? Like we think that geopolitics is going to be hard in a world where AIs exist. And when you have that framing, you’re like, geopolitics is hard, but we’ve made a bunch of progress compared with a few hundred years ago where there were many more wars. It feels pretty plausible that a bunch of trends that have led to less conflict are just going to continue. And so I still haven’t seen arguments that make me feel like this particular problem is incredibly difficult, as opposed to arguments which I have seen for why the alignment problem is plausibly incredibly difficult.

I agree that a lot of thinking on how to make AI cooperation go well can be deferred to when we have highly capable AI assistants. But there is still the question of how human overseers will make use of highly capable AI assistants when reasoning about tricky bargaining problems, what kinds of commitments to make and so on. Some of these problems are qualitatively different than the problems of human geopolitics. And I don’t see much reason for confidence that early AIs and their overseers will think sufficiently clearly about this by default, that is, without some conceptual groundwork having been laid going into a world with the first powerful AI assistants. (This and this are examples of conceptual groundwork I consider valuable to have done before we get powerful AI assistants.)

There is also the possibility that we lose control of AGI systems early on, but it’s still possible to reduce risks of worse-than-extinction outcomes due to cooperation failures involving those systems. This work might not be delegable.

(Overall, I agree that thinking specific to AI cooperation should be a smaller part of the existential risk reduction portfolio than generic alignment, but maybe a larger portion than the quote here suggests.)

We are now using a new definition of s-risks. I've edited this post to reflect the change.

New definition:

S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.

Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.

Ok, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe.

It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation of the systems’ domain-specific abilities may be exacerbated by competitive pressures. And of course there is historical precedent for delegation to overconfident military bureaucracies.

On the other hand, to the extent that human leadership is able to correctly assess their systems’ competence in this domain, it may be only because there has been a sufficiently successful AI cooperation research program. For instance, maybe this research program has furnished appropriate simulation environments to probe the relevant aspects of the systems’ behavior, transparency tools for investigating cognition about other AI systems, norms for the resolution of conflicting interests and methods for robustly instilling those norms, etc, along with enough researcher-hours applying these tools to have an accurate sense of how well the systems will navigate conflict.

As for irreversible delegation — there is the question of whether delegation is in principle reversible, and the question of whether human leaders would want to override their AI delegates once war is underway. Even if delegation is reversible, human leaders may think that their delegates are better suited to wage war on their behalf once it has started. Perhaps because things are simply happening so fast for them to have confidence that they could intervene without placing themselves at a decisive disadvantage.

The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment

I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.

  1. The principals largely delegate to AI systems on military decision-making, mistakenly believing that the systems are extremely competent in this domain.

  2. The mostly-intent-aligned AI systems, who are actually not extremely competent in this domain, make hair-trigger commitments of the kind described in the OP. The systems make their principals aware of these commitments and (being mostly-intent-aligned) convince their principals “in good faith” that this is the best strategy to pursue. In particular they are convinced that this will not lead to existential catastrophe.

  3. The commitments are triggered as described in the OP, leading to conflict. The conflict proceeds too quickly for the principals to effectively intervene / the principals think their best bet at this point is to continue to delegate to the AIs.

  4. At every step both principals and AIs think they’re doing what’s best by the respective principals’ lights. Nevertheless, due to a combination of incompetence at bargaining and structural factors (e.g., persistent uncertainty about the other side’s resolve), the AIs continue to fight to the point of extinction or unrecoverable collapse.

Would be curious to know which parts of this story you find most implausible.

Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced otherwise”. Idk.

Your meta-level concern seems warranted. One would at least want to try to formalize the kinds of commitments we’re discussing and ask if they provide any guarantees, modulo equilibrium selection.

It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”

The set of solution concepts you commit to regarding as acceptable still poses an equilibrium selection problem. But, on the face of it at least, the “thinking more” part is handled by conditional commitments to act on the basis of future beliefs.

I guess there’s the problem of what protocols for specifying future world-models you commit to regarding as acceptable. Maybe there are additional protocols that haven’t occurred to you, but which other agents may have committed to and which you would regard as acceptable when presented to you. Hopefully it is possible to specify sufficiently flexible methods for determining whether protocols proposed by your future counterparts are acceptable that this is not a problem.

Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.

A few miscellaneous points:

  • I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.

  • I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares to other sources, but I’d guess that it is somewhat underestimated, because of my impression that folks generally underestimate the difficulty of getting agents to get along (even if they are otherwise highly competent).

Load More