# Ω 25

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

It seems likely to me that AIs will be able to coordinate with each other much more easily (i.e., at lower cost and greater scale) than humans currently can, for example by merging into coherent unified agents by combining their utility functions. This has been discussed at least since 2009, but I'm not sure its implications have been widely recognized. In this post I talk about two such implications that occurred to me relatively recently.

I was recently reminded of this quote from Robin Hanson's Prefer Law To Values:

The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to offer, so we mostly care that they are law-abiding enough to respect our property rights. If they use the same law to keep the peace among themselves as they use to keep the peace with us, we could have a long and prosperous future in whatever weird world they conjure. In such a vast rich universe our “retirement income” should buy a comfortable if not central place for humans to watch it all in wonder.

Robin argued that this implies we should work to make it more likely that our current institutions like laws will survive into the AI era. But (aside from the problem that we're most likely still incurring astronomical waste even if many humans survive "in retirement"), assuming that AIs will have the ability to coordinate amongst themselves by doing something like merging their utility functions, there will be no reason to use laws (much less "the same laws") to keep peace among themselves. So the first implication is that to the extent that AIs are likely to have this ability, working in the direction Robin suggested would likely be futile.

The second implication is that AI safety/alignment approaches that aim to preserve an AI's competitiveness must also preserve its ability to coordinate with other AIs, since that is likely an important part of its competitiveness. For example, making an AI corrigible in the sense of allowing a human to shut it (and its successors/subagents) down or change how it functions would seemingly make it impossible for this AI to merge with another AI that is not corrigible, or not corrigible in the same way. (I've mentioned this a number of times in previous comments, as a reason why I'm pessimistic about specific approaches, but I'm not sure if others have picked up on it, or agree with it, as a general concern, which partly motivates this post.)

Questions: Do you agree AIs are likely to have the ability to coordinate with each other at low cost? What other implications does this have, especially for our strategies for reducing x-risk?

# Ω 25

New Comment

I very much agree. Historical analogies:

To a tiger, human hunter-gatherers must be frustrating and bewildering in their ability to coordinate. "What the hell? Why are they all pouncing on me when I jumped on the little one? The little one is already dead anyway, and they are risking their own lives now for nothing! Dammit, gotta run!"

To a tribe of hunter-gatherers, farmers must be frustrating and bewildering in their ability to coordinate. "What the hell? We pillaged and slew that one village real good, they sure didn't have enough warriors left over to chase us down... why are the neighboring villages coming after us? And what's this--they have professional soldiers with fancy equipment riding horses? Somehow hundreds--no, thousands--of farmers cooperated over a period of several years to make this punitive expedition possible! How were we to know they would go to such lengths?"

To the nations colonized by the Europeans, it must have been pretty interesting how the Europeans were so busy fighting each other constantly, yet somehow managed to more or less peacefully divide up Africa, Asia, South America, etc. to be colonized between them. Take the Opium Wars and the Boxer Rebellion for example. I could imagine a Hansonian prophet in a native american tribe saying something like "Whatever laws the European nations use to keep the peace among themselves, we will benefit from them also; we'll register as a nation, sign treaties and alliances, and rely on the same balance of power." He would have been disastrously wrong.

I expect something similar to happen with us humans and AGI, if there are multiple AGI. "What? They all have different architectures and objectives, not to mention different users and owners... we even explicitly told them to compete with each other! Why are they doing X.... noooooooo...." (Perhaps they are competing with each other furiously, even fighting each other. Yet somehow they'll find a way to cut us out of whatever deal they reach, just as European powers so often did for their various native allies.)

Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.

One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?

I wonder also if the conflicts that remain are nevertheless more peaceful. When hunter-gatherer tribes fight each other, they often murder all the men and enslave the women, or so I hear. Similar things happened with farmer societies sometimes, but also sometimes they just become new territories and have to pay tribute and levy conscripts and endure the occasional pillage. And then industrialized modern nations even have rules about how you can't rape and pillage and genocide and sell into slavery the citizens of your enemy. Perhaps AI conflicts would be even more peaceful. For example, perhaps they would look something more like fancy maneuvers, propaganda, and hacking, with swift capitulation by the "checkmated" AI, which is nevertheless allowed to continue existing with some smaller amount of influence over the future. Perhaps no property would even be destroyed in the entire war!

Just spitballing here. I feel much less confident in this trend than in the trend I pointed out above.

I'd like to push back on the assumption that AIs will have explicit utility functions. Even if you think that sufficiently advanced AIs will behave in a utility-maximising way, their utility functions may be encoded in a way that's difficult to formalise (e.g. somewhere within a neural network).

It may also be the case that coordination is much harder for AIs than for humans. For example, humans are constrained by having bodies, which makes it easier to punish defection - hiding from the government is tricky! Our bodies also make anonymity much harder. Whereas if you're a piece of code which can copy itself anywhere in the world, reneging on agreements may become relatively easy. Another reasion why AI cooperation might be harder is simply that AIs will be capable of a much wider range of goals and cognitive processes than humans, and so they may be less predictable to each other and/or have less common ground with each other.

I’d like to push back on the assumption that AIs will have explicit utility functions.

Yeah I was expecting this, and don't want to rely too heavily on such an assumption, which is why I used "for example" everywhere. :)

Even if you think that sufficiently advanced AIs will behave in a utility-maximising way, their utility functions may be encoded in a way that’s difficult to formalise (e.g. somewhere within a neural network).

I think they don't necessarily need to formalize their utility functions, just isolate the parts of their neural networks that encode those functions. And then you could probably take two such neural networks and optimize for a weighted average of the outputs of the two utility functions. (Although for bargaining purposes, to determine the weights, they probably do need to look inside the neural networks somehow and can't just treat them as black boxes.)

Or are you're thinking that the parts encoding the utility function are so intertwined with the rest of the AI that they can't be separated out, and the difficulty of doing that increases with the intelligence of the AI so that the AI remains unable to isolate its own utility function as it gets smarter? If so, it's not clear to me why there wouldn't be AIs with cleanly separable utility functions that are nearly as intelligent, which would outcompete AIs with non-separable utility functions because they can merge with each other and obtain the benefits of better coordination.

It may also be the case that coordination is much harder for AIs than for humans.

My reply to Robin Hanson seems to apply here. Did you see that?

If there's a lot of coordination among AI, even if only through transactions, I feel like this implies we would need to add "resources which might be valuable to other AIs" to the list of things we can expect any given AI to instrumentally pursue.

This paper by Critch is relevant: it argues that agents with different beliefs will bet their future share of a merged utility function, such that it skews towards whoever's predictions are more correct.

I had read that paper recently, but I think we can abstract away from the issue by saying that (if merging is a thing for them) AIs will use decision procedures that are "closed under" merging, just like we currently focus attention on decision procedures that are "closed under" self-modification. (I suspect that modulo logical uncertainty, which Critch's paper also ignores, UDT might already be such a decision procedure, in other words Critch's argument doesn't apply to UDT, but I haven't spent much time thinking about it.)

A big obstacle to human cooperation is bargaining: deciding how to split the benefit from cooperation. If it didn't exist, I think humans would cooperate more. But the same obstacle also applies to AIs. Sure, there's a general argument that AIs will outperform humans at all tasks and bargaining too, but I'd like to understand it more specifically. Do you have any mechanisms in mind that would make bargaining easier for AIs?

A big obstacle to human cooperation is bargaining: deciding how to split the benefit from cooperation. If it didn’t exist, I think humans would cooperate more.

Can you give some examples of where human cooperation is mainly being stopped by difficulty with bargaining? It seems to me like enforcing deals is usually the bigger part of the problem. For example in large companies there are a lot of inefficiencies like shirking, monitoring costs to reduce shirking, political infighting, empire building, CYA, red tape, etc., which get worse as companies get bigger. It sure seems like enforcement (i.e., there's no way to enforce a deal where everyone agrees to stop doing these things) rather than bargaining is the main problem there.

Or consider the inefficiencies in academia, where people often focus more on getting papers published than working on the most important problems. I think that's mainly because an agreement to reward people for publishing papers is easily enforceable and while an agreement to reward people for working on the most important problems isn't. I don't see how improved bargaining would solve this problem.

Do you have any mechanisms in mind that would make bargaining easier for AIs?

I haven't thought about this much, but perhaps if AIs had introspective access to their utility functions, that would make it easier for them to make use of formal bargaining solutions that take utility functions as inputs? Generally it seems likely that AIs will be better at bargaining than humans, for the same kind of reason as here, but AFAICT just making enforcement easier would probably suffice to greatly reduce coordination costs.

Can you give some examples of where human cooperation is mainly being stopped by difficulty with bargaining?

Two kids fighting over a toy; a married couple arguing about who should do the dishes; war.

But now I think I can answer my own question. War only happens if two agents don't have common knowledge about who would win (otherwise they'd agree to skip the costs of war). So if AIs are better than humans at establishing that kind of common knowledge, that makes bargaining failure less likely.

War only happens if two agents don’t have common knowledge about who would win (otherwise they’d agree to skip the costs of war).

But that assumes strong ability to enforcement agreements (which humans typically don't have). For example suppose it's common knowledge that if countries A and B went to war, A would conquer B with probability .9 and it would cost each side $1 trillion. If they could enforce agreements, then they could agree to roll a 10-sided die in place of the war and save$1 trillion each, but if they couldn't, then A would go to war with B anyway if it lost the roll, so now B has a .99 probability of being taken over. Alternatively maybe B agrees to be taken over by A with certainty but get some compensation to cover the .1 chance that it doesn't lose the war. But after taking over B, A could just expropriate all of B's property including the compensation that it paid.

War only happens if two agents don’t have common knowledge about who would win (otherwise they’d agree to skip the costs of war).

They might also have poorly aligned incentives, like a war between two countries that allows both governments to gain power and prestige, at the cost of destruction that is borne by the ordinary people of both countries. But this sort of principle-agent problem also seems like something AIs should be better at dealing with.

Not only of who would win, but also about the costs it would have. I think the difficulty in establishing common knowledge about this is in part due to people traing to deceive each other. Its not clear that the ability to see through deception improves faster than the ability to deceive with increasing intelligence.

The claim that AI is vastly better at coordination seems to me implausible on its face. I'm open to argument, but will remain skeptical until I hear good arguments.

Have you considered the specific mechanism that I proposed, and if so what do you find implausible about it? (If not, see this longer post or this shorter comment.)

I did manage to find a quote from you that perhaps explains most of our disagreement on this specific mechanism:

There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination.

Can you elaborate on what these other factors are? It seems to me that most coordination costs in the real world come from value differences, so it's puzzling to see you write this.

Abstracting away from the specific mechanism, as a more general argument, AI designers or evolution will (sooner or later) be able to explore a much larger region of mind design space than biological evolution could. Within this region there are bound to be minds much better at coordination than humans, and we should certainly expect coordination ability to be one objective that AI designers or evolution will optimize for since it offers a significant competitive advantage.

This doesn't guarantee that the designs that end up "winning" will have much better coordination ability than humans because maybe the designers/evolution will be forced to trade off coordination ability for something else they value, to the extent that the "winner" don't coordinate much better than humans, but that doesn't seem like something we should expect to happen by default, without some specific reason to, and it becomes less and less likely as more and more of mind design space is explored.

It seems to me that computers don't suffer from most of the constraints humans do. For example, AI can expose its source code and its error-less memory. Humans have no such option, and our very best approximations are made of stories and error-prone memory.

They can provide guarantees which humans cannot, simulate one another within precise boundaries in a way humans cannot, calculate risk and confidence levels in a way humans cannot, communicate their preferences precisely in a way humans cannot. All of this seems to point in the direction of increased clarity and accuracy of trust.

On the other hand, I see no reason to believe AI will have the strong bias in favor of coordination or trust that we have, so it is possible that clear and accurate trust levels will make coordination a rare event. That seems off to me though, because it feels like saying they would be better off working alone in a world filled with potential competitors. That statement flatly disagrees with my reading of history.

Extending this: trust problems could impede the flow of information in the first place in such a way that the introspective access stops being an amplifier across a system boundary. An AI can expose some code, but an AI that trusts other AIs to be exposing their code in a trustworthy fashion rather than choosing what code to show based on what will make the conversation partner do something they want seems like it'd be exploitable, and an AI that always exposes its code in a trustworthy fashion may also be exploitable.

Human societies do “creating enclaves of higher trust within a surrounding environment of lower trust” a lot, and it does improve coordination when it works right. I don't know which way this would swing for super-coordination among AIs.

What evidence would convince you otherwise? Would superhuman performance in games that require difficult coordination be compelling?

Deepmind has outlined Hanabi as one of the next games to tackle: https://arxiv.org/abs/1902.00506

Avalon is another example, with better current performance.

As a subset of the claim that AI is vastly better at everything, being vastly better at coordination is plausible. The specific arguments that AI somehow has (unlike any intelligence or optimization process we know of today) introspection into it's "utility function" or can provide non-behavioral evidence of it's intent to similarly-powerful AIs seem pretty weak.

I haven't seen anyone attempting to model shifting equilibria and negotiation/conflict among AIs (and coalitions of AIs and of AIs + humans) with differing goals and levels of computational power, so it seems pretty unfounded to speculate on how "coordination" as a general topic will play out.

I'd expect a designed thing to have much cleaner, much more comprehensible internals. If you gave a human a compromise utility function and told them that it was a perfect average of their desires (or their tribe's desires) and their opponents' desires, they would not be able to verify this, they wouldn't recognise their utility function, they might not even individually possess it (again, human values seem to be a bit distributed), and they would be inclined to reject a fair deal, humans tend to see their other only in extreme shades, more foreign than they really are.

Do you not believe that an AGI is likely to be self-comprehending? I wonder, sir, do you still not anticipate foom? Is it connected to that disagreement?

This post is excellent, in that it has a very high importance-to-word-count ratio. It'll take up only a page or so, but convey a very useful and relevant idea, and moreover ask an important question that will hopefully stimulate further thought.

As usual, a converse problem is likely to give useful insights. I would approach this issue from the other direction: what prevents humans from coordinating with each other at low cost? Do we expect AIs to have similar traits/issues?

One possible way for AIs to coordinate with each other is for two or more AIs to modify their individual utility functions into some compromise utility function, in a mutually verifiable way, or equivalently to jointly construct a successor AI with the same compromise utility function and then hand over control of resources to the successor AI. This simply isn't something that humans can do.

modify their individual utility functions into some compromise utility function, in a mutually verifiable way, or equivalently to jointly construct a successor AI with the same compromise utility function and then hand over control of resources to the successor AI

This is precisely equivalent to Coasean efficiency, FWIW - indeed, correspondence with some "compromise" welfare function is what it means for an outcome to be efficient in this sense. It's definitely the case that humans, and agents more generally, can face obstacles to achieving this, so that they're limited to some constrained-efficient outcome - something that does maximize some welfare function, but only after taking some inevitable constraints into account!

(For instance, if the pricing of some commodity, service or whatever is bounded due to an information problem, so that "cheap" versions of it predominate, then the marginal rates of transformation won't necessarily be equalized across agents. Agent A might put her endowment towards goal X, while agent B will use her own resources to pursue some goal Y. But that's a constraint that could in principle be well-defined - a transaction cost. Put them all together, and you'll understand how these constraints determine what you lose to inefficiency - the "price of anarchy", so to speak.)

Strong upvote, very good to know

Agent A might put her endowment towards goal X, while agent B will use her own resources to pursue some goal Y

I internalised the meaning of these variables only to find you didn't refer to them again. What was the point of this sentence.

But jointly constructing a successor with compromise values and then giving them the reins is something humans can sort of do via parenting, there's just more fuzziness and randomness and drift involved, no? That is, assuming human children take a bunch of the structure of their mindsets from what their parents teach them, which certainly seems to be the case on the face of it.

Yes, but humans generally hand off resources to their children as late as possible (whereas the AIs in my scheme would do so as soon as possible) which suggests that coordination is not the primary purpose for humans to have children.

I'm pretty sure nobility frequently arranged marriages to do exactly this, for this purpose, to avoid costly conflicts.

Humans tend to highly value their own personal experiences - getting to do things that feel fun, acquiring personal status, putting intrinsic value on their own survival, etc. This limits the extent to which they'll co-operate with a group, since their own interests and the interests of the group are only partially the same. AIs with less personal interests would be better incentivized to coordinate - if you only care about the amount of paperclips in the universe, you will be able to better further that goal with others than if each AI was instead optimizing for the amount of paperclips that they personally got to produce.

Some academics argue that religion etc. evolved for the purpose of suppressing personal interests and giving them a common impersonal goal, partially getting around this problem. I discussed this and its connection to these matters a bit in my old post Intelligence Explosion vs. Co-operative Explosion.

Nominating along with its sequel. Make important point about competitiveness and coordination.

While I think this post isn't the best writeup of this topic I can imagine, I think it makes a really important point quite succinctly, and is one that I have brought up many times in arguments around takeoff speeds and risk scenarios since this post came out.

Summary: There are a number of differences between how humans cooperate and how hypothetical AI agents could cooperate, and these differences have important strategic implications for AI forecasting and safety. The first big implication is that AIs with explicit utility functions will be able to merge their values. This merging may have the effect of rending laws and norms obsolete, since large conflicts would no longer occur. The second big implication is that our approaches to AI safety should preserve the ability for AIs to cooperate. This is because if AIs *don't* have the ability to cooperate, they might not be as effective, as they will be outcompeted by factions who can cooperate better.

Opinion: My usual starting point for future forecasting is to assume that AI won't alter any long term trends, and then update from there on the evidence. Most technologies haven't disrupted centuries-long trends in conflict resolution, which makes me hesitant to accept the first implication. Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions. I still think that cooperation will be easier in the future, but it probably won't follow a radical departure from past trends.

Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions.

I gave a more general argument in response to Robin Hanson, which doesn't depend on this assumption. Curious if you find that more convincing.

Thanks for the elaboration. You quoted Robin Hanson as saying

There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination.

My model says that this is about right. It generally takes a few more things for people to cooperate, such as common knowledge of perfect value matching, common knowledge of willingness to cooperate, and an understanding of the benefits of cooperation.

By assumption, AIs will become smarter than humans, which makes me think they will understand the benefits of cooperation better than we do. But this understanding won't be gained "all at once" but will instead be continuous with the past. This is essentially why I think cooperation will be easier in the future, but that it will more-or-less follow a gradual transition from our current trends (I think cooperation has been increasing globally in the last few centuries anyway, for similar reasons).

Abstracting away from the specific mechanism, as a more general argument, AI designers or evolution will (sooner or later) be able to explore a much larger region of mind design space than biological evolution could. Within this region there are bound to be minds much better at coordination than humans, and we should certainly expect coordination ability to be one objective that AI designers or evolution will optimize for since it offers a significant competitive advantage.

I agree that we will be able to search over a larger space of mind-design, and I also agree that this implies that it will be easier to find minds that cooperate.

I don't agree that cooperation necessarily allows you to have a greater competitive advantage. It's worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead.

This is essentially the same argument that evolutionary game theorists have used to explain the evolution of aggression, as I understand it. Of course, there are some simplifying assumptions which could be worth disputing.

I don’t agree that cooperation necessarily allows you to have a greater competitive advantage. It’s worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead.

I'm definitely not using the naive model which expects unilateral cooperation to become widespread. Instead when I say "cooperation" I typically have in mind some sort of cooperation that comes with a mechanism for ensuring that it's mutual or reciprocal. You can see this in the concrete example I gave in this linked post.

Have you thought at all about what merged utility function two AI's would agree on? I doubt it would be of the form .

Critch wrote a related paper:

Existing multi-objective reinforcement learning (MORL) algorithms do not account for objectives that arise from players with differing beliefs.Concretely, consider two players with different beliefs and utility functions who may cooperate to build a machine that takes actions on their behalf. A representation is needed for how much the machine’s policy will prioritize each player’s interests over time. Assuming the players have reached common knowledge of their situation, this paper derives a recursion that any Pareto optimal policy must satisfy. Two qualitative observations can be made from the recursion: the machine must (1) use each player’s own beliefs in evaluating how well an action will serve that player’s utility function, and (2) shift the relative priority it assigns to each player’s expected utilities over time, by a factor proportional to how well that player’s beliefs predict the machine’s inputs. Observation (2) represents a substantial divergence from naive linear utility aggregation (as in Harsanyi’s utilitarian theorem, and existing MORL algorithms), which is shown here to be inadequate for Pareto optimal sequential decision-making on behalf of players with different beliefs.

Toward negotiable reinforcement learning: shifting priorities in Pareto optimal sequential decision-making

Stuart Armstrong wrote a post that argued for merged utility functions of this form (plus a tie-breaker), but there are definitely things, like different priors and logical uncertainty, which the argument doesn't take into account, that make it unclear what the actual form of the utility function would be (or if the merged AI would even be doing expected utility maximization). I'm curious what your own reason for doubting it is.

One utility function might turn out much easier to optimize than the other, in which case the harder-to-optimize one will be ignored completely. Random events might influence which utility function is harder to optimize, so one can't necessarily tune in advance to try to take this into account.

One of the reasons was the problem of positive affine scaling preserving behavior, but I see Stuart addresses that.

And actually, some of the reasons for thinking there would be more complicated mixing are going away as I think about it more.

EDIT: yeah if they had the same priors and did unbounded reasoning, I wouldn't be surprised anymore if there exists a that they would agree to.

This may fall in the the fallowing type of reasoning: "Superinteligent AI will be super in any human capability X. Human can cooperate. Thus SAI will have superhuman capability to cooperate."

The problem of such conjecture is that if we take an opposite human quality not-X, SAI will also have superhuman capability in it. For example, if X= cheating, then superintelligent AI will have superhuman capability in cheating.

However, SAI can't be simultaneously super-cooperator and super-cheater.

I think superintelligent AI will probably have superhuman capability at cheating in an absolute sense, i.e., they'll be much better than humans at cheating humans. But I don't see a reason to think they'll be better at cheating other superintelligent AI than humans are at cheating other humans, since SAI will also be superhuman at detecting and preventing cheating.

But 10 000 IQ AI can cheat 1000 IQ AI? If yes, only equally powerful AIs will cooperate.

AI could have a superhuman capability to find win-win solutions and sell it as a service to humans in form of market arbitrage, courts, partner matching (e.g; Tinder).

Based on this win-win solutions finding capability, AI will not have to "take over the world" - it could negotiate its way to global power, and everyone will win because of it (at leat, initially).