An Alternative Approach to AI Cooperation

by Wei_Dai2 min read31st Jul 200925 comments

8

Personal Blog

[This post summarizes my side of a conversation between me and cousin_it, and continues it.]

Several people here have shown interest in an approach to modeling AI interactions that was suggested by Eliezer Yudkowsky: assume that AIs can gain common knowledge of each other's source code, and explore the decision/game theory that results from this assumption.

In this post, I'd like to describe an alternative approach*, based on the idea that two or more AIs may be able to securely merge themselves into a joint machine, and allow this joint machine to make and carry out subsequent decisions. I argue that this assumption is as plausible as that of common knowledge of source code, since it can be built upon the same technological foundation that has been proposed to implement common knowledge of source code. That proposal, by Tim Freeman, was this:

Entity A could prove to entity B that it has source code S by
consenting to be replaced by a new entity A' that was constructed by a
manufacturing process jointly monitored by A and B.  During this
process, both A and B observe that A' is constructed to run source
code S.  After A' is constructed, A shuts down and gives all of its
resources to A'.

Notice that the same technology can be used for two AIs to merge into a single machine running source code S (which they both agreed upon). All that needs to be changed from the above process is for B to also shut down and give all of its resources to A' after A' is constructed. Not knowing if there is a standard name for this kind of technology, I've given it the moniker "secure joint construction."

I conjecture that the two approaches are equal in power, in the sense that any cooperation made possible by the common knowledge of source code is also possible given the secure merger ability, and vice versa. This is because under the assumption of common knowledge of source code, the likely outcome is for all AIs to modify themselves into using the same decision algorithm, with that algorithm making and carrying out subsequent decisions. The collection of these cooperative machines running the same algorithm can be viewed as one distributed machine, thus suggesting the equivalence of the two approaches.

It is conceptually simpler to assume that AIs will merge into a centralized joint machine. This causes no loss of generality, since if for some reason the AIs find it more advantageous to merge into a distributed joint machine, they will surely come up with solutions to distributed computing problems like friend-or-foe recognition and consensus by themselves. The merger approach allows such issues to be abstracted away as implementation details of the joint machine.

Another way to view these two approaches is that they each offers a way for AIs to enforce agreements, comparable with the enforcement of contracts by a court, except that the assumed technologies allow the AIs to enforce agreements without a trusted third party, and with potentially higher assurance of compliance. This allows most AI-AI interactions to be modeled using cooperative game theory, which assumes such enforcement of agreements.

* My original proposal, posted on SL4, was that AIs would use Bayesian aggregation to determine the decision algorithm of their joint machine. I later realized that cooperative game theory is a better fit, because only a cooperative game solution ensures that each AI has sufficient incentives to merge.

[It appears to me that cousin_it and I share many understandings, while Vladimir Nesov and Eliezer seem to have ideas closer to each other and to share certain insights that I am not able to access. I hope this post encourages them to clarify their ideas relative to those of cousin_it and I.]

8

25 comments, sorted by Highlighting new comments since Today at 1:45 PM
New Comment

Would this be a sufficient formalization of your idea (restricting to two players for simplicity)?

Given a finite two-player game G, construct the game G_M (for 'merge') as follows: each player picks both a strategy profile of G (i.e., a function assigning a strategy to each player), and a "fallback" strategy. If both players select the same strategy profile, they get the corresponding payoff from G. If they select different strategy profiles, they get the payoff according to their fallback strategies.

Contrast with the formalization of source-code swapping, for two players:

Given a finite two-player game G, construct the game G_S (for 'source swapping') as follows: each player chooses a program that takes the other players' programs as input, that outputs the strategy to play in game G, and that terminates for all inputs. Players' payoffs are determined by the strategies their programs choose.

The first is arguably simpler to reason about, but it seems that any contract the players may want to enter into would be a Nash equilibrium of G_M, and for every Nash equilibrium of G_M there is a Nash equilibrium of G_S that has the same payoffs:

  • Suppose s is a strategy profile of G and ((s,f1),(s,f2)) is a Nash equilibrium of G_M. Then, consider the programs P1 = "if the other player plays P2, play s_1; otherwise, play f1" and P2 = "if the other player plays P1, play s_2; otherwise, play f2," constructed by mutual Quining. It's easy to see that (P1,P2) is a Nash equilibrium with the same payoffs.
  • Suppose ((s,f1),(t,f2)) is a Nash equilibrium of G_M, with s != t. Then, neither party can do better by switching to a different fallback strategy. Let C_x be the constant program that outputs x regardless of its input. Again, it's easy to see that (C_f1,C_f2) is a Nash equilibrium with the same payoffs.

I think that the converse is also true (for every Nash equilibrium of G_S, there is an equilibrium of G_M with the same payoffs), but I haven't been able to prove it; the problem is to find appropriate fallback strategies in G, and the idea is to show that if there aren't any, then to be a Nash equilibrium, P1 and P2 must act differently depending on what the program they're passed does, and to do that in general, they would have to solve the halting problem. But so far, I haven't been able to prove this.

I'd be happy to expand the "easy to see"s if desired :-)

I think that the converse is also true (for every Nash equilibrium of G_S, there is an equilibrium of G_M with the same payoffs)

It seems to me that you're right. In any Nash equilibrium of G_S you get no less than your security level in G, otherwise you'd switch. So your fallback strategy in G_M only has to bring the other guy down to his security level regardless of the cost to yourself.

This also tells us which outcomes of G are implementable as Nash equilibria in G_M and G_S: if an outcome gives each player no less than their security value, it's implementable, and vice versa.

Not sure that this is the best formalization of Wei Dai's idea, though. I don't completely understand it yet.

This also tells us which outcomes of G are implementable as Nash equilibria in G_M and G_S: if an outcome gives each player no less than their security value, it's implementable, and vice versa.

Hm, nice formulation, but no, not quite, I think. Security value is the best you can get if you move first, right? I think an outcome is Nash implementable in G_M iff it gives each player no less than they can get if they move second -- the fallback strategy of the other player is fixed, the deviating player can take it as the "first move."

As for G_S, I think any outcome can be Nash implemented if it gives each player no less than if they move second (simply use the translation from G_M), and any outcome that can be Nash implemented in G_S must give each player no less than if they move first (because otherwise they could deviate by using a constant function). But I'm not sure about the outcomes that fall between these two conditions.

But at least, this characterization makes clear that most outcomes we're usually interested in can indeed be implemented in both G_M and G_S. Thanks, I was actually looking for a succinct way to state this condition for G_S earlier :-)

Security value is the best you can get if you move first, right? I think an outcome is Nash implementable in G_M iff it gives each player no less than they can get if they move second -- the fallback strategy of the other player is fixed, the deviating player can take it as the "first move."

If mixed strategies are allowed, these values coincide by the minimax theorem. I think my conclusions were correct.

Hm. I see. Thanks for pointing that out. Embarrassingly, I completely ignored mixed strategies above -- implicitly assumed that the constructions of G_M and G_S would be over pure strategies of G, and analyzed only pure strategy profiles of G_M and G_S.

I do see how constructing G_M and G_S over mixed strategies of G would make the two values equal by the minimax theorem, but I think there are complications when we analyze mixed strategies. Mixed Nash equilibria of G_S can enforce correlated play in G, even without relying on cryptography, as follows: Have the pure strategies in the equilibrium correspond to a common quined program (as before) plus a natural number < n, for some given n. Then the program adds the different players' numbers modulo n, and uses the result to determine the strategy profile in G. If a player chooses their number with uniform probability through a mixed strategy of G_S, then they know that all sums modulo n are equally likely, no matter how the other players choose their numbers. So everybody choosing their numbers uniformly at random can be a Nash equilibrium, because no player can change the expected result by unilaterally switching to a different way of choosing their number.

But G_M, as defined above (whether over pure or mixed strategies), can, as far as I can see, not enforce correlated play in G.

A natural way to rectify this is to define the first component of G_M to be neither just a pure nor just a mixed strategy profile of G, but an arbitrary distribution over pure strategy profiles of G -- a "correlated strategy profile." Can the earlier proof then be used to show that G_M has a mixed strategy profile realizing a certain outcome ⇔ G has "correlated strategy profile" realizing that outcome that pays each player at least their security value ⇔ G_S has a mixed strategy profile realizing that outcome?

Good catch. To tell the truth, I didn't even think about mixing strategies in G_M and G_S, only playing deterministically and purely "on top of" mixed strategies in G. When we add mixing, G_S does turn out to be stronger than G_M due to correlated play; your construction is very nice.

Your final result is correct, here's a proof:

1) Any Nash equilibrium of G_S or "new" G_M plays a correlated strategy profile of G (by definition of correlated strategy profile, it's broad enough) that on average gives each player no less than their security value (otherwise they'd switch).

2) Any such "good" profile can be implemented as a Nash equilibrium of G_S in mixed strategies, using your construction above and the usual method of punishment by security level. If all of the enemy's strategies are quine-compatible with yours, the profile gets played exactly thanks to the niceties of your construction. If any small part of his strategies are incompatible with yours, that's enough to bring him down on average.

3) For "new" G_M you just submit the profile and the punishment fallback. So "new" G_M doesn't actually need mixed strategies, our latest deus ex machina is too strong.

Thanks, Benja and cousin_it, for working out the math. The "new" G_M looks correct to me (and let's just call it G_M from now, since I see no reason why the joint machine can't be programmed with a correlated strategy profile). To make it a little more realistic, we can add a step after the joint machine is constructed, where each player has a choice to transfer his resources or not. It's pretty obvious that adding this step makes no difference in the outcome, since the players can program the joint machine to execute their fallback strategies if one of the player fails to transfer.

Benja's method for G_S to enforce correlated play in G seems to require simultaneous source-swapping. Otherwise, if I choose my random number first, then the second player can pick his number to favor him, right? It seems that the common quined program has to use some cryptographic method to generate a common random number in a way that's not vulnerable to manipulation by the players.

When two AIs merge this way, you can consider them pre-merge as already being a single distributed process. Then, the process of merging is just some dynamic that happens to the distributed process of both AIs. What did actually happen in this dynamic?

Consider any method for committing to run published source code that isn't vulnerable to deception (Freeman's works, I think).

If you are saying that physically supervised joint reincarnation is just (if it works) another type of enforceable mutual source code promise, and otherwise changes nothing, I agree.

They went from running on two different nodes with different code, with each node controlling its own resources, to running on just one node with some code that they both agreed upon, that controls both of their resources.

I added "(which they both agreed upon)" to the post. Does that clear it up, or is the unclarity somewhere else?

What is the difference between two (interacting) nodes controlling their own resources and (instead of them) one distributed node controlling the resources? What do you mean by "control"? What is gained by calling basically the same thing "one node" and not "two nodes"? If the important part is "agreed-upon", how does it work in terms of the two interacting processes, and how to translate it in a statement about the joint process?

I'm confused about your confusion. I'm using familiar human terms, and assuming straightforward (though not necessarily easy) translation into AI implementation. Is there no obvious counterpart for "control" and "agreed-upon" in AI?

What if we assume that the two original AIs are human uploads? Does that help?

Or taken two flesh-and-blood humans, and suppose they jointly program a robot and then die, physically handing their assets over to the robot before they die. Would you also say the two humans can already be considered as a single distributed process, and there's nothing to be gained by calling the same thing "one node" and not "two nodes"?

What if the first AI can just ask the second AI to do anything with its resources, and the second AI just does that? Just because the AIs agree to some action (merging), doesn't mean that the action was really an optimal choice for them: they could be wrong. You need a more formal model to deal with such issues (for example, the AIs could have a utility function over outcomes, but then again we face the issue of bargaining).

I don't see how this setting helps to reduce bargaining/cooperation.

I assume that the original AIs use a standard decision theory. You're right that it doesn't reduce bargaining. That wasn't the goal. The goal is to allow any agreements that do form to be enforced. In other words, I'm trying to eliminate courts, not bargaining.

Of course a solution that also eliminates bargaining would be even better, but that might be too ambitious. Feel free to prove me wrong though. :)

Agreements are solutions to bargaining, so how can you have agreements without solving bargaining? In other words, if you just assign some agreement without considering a process of reaching it, why do you need to consider the two agents before the merge at all?

I don't just assign some agreement without considering a process of reaching it. I make that a separate problem to be solved, and just assume here that AIs do have some way to reach agreements by bargaining, in order to focus on the mechanism for enforcement of agreements.

ETA: Do you think common knowledge of source code can help with the bargaining problem in some way? Maybe that's the insight I'm missing?

Individual verified-source reincarnation and joint verified-source reincarnation on a single computing substrate are equivalent in every way.

Call A+ the reincarnation of A, and B+ the reincarnation of B.

There must be some way of observing that the irrevocable transfer of authority has actually occurred. This means you need some way of trusting promises by all entities that owe some allegiance to A or B. So far the only method proposed to establish trust in one's decision rule is to submit to verified reincarnation. This may not be to the liking of non-machine intelligence, and at best costly to those involved, if not infeasible due to speed-of-light physical separation.

Further, A+ shouldn't depart, in any way it wasn't willing to do unilaterally for its own sake, from the values and decision rules of A, unless there's a secure atomic transfer of authority from A->A+ and B->B+. I don't see how such a thing is possible, so A+ really can't commit to any cooperation until it verifies that B->B+ has occurred.

So, A+ and B+ would differ from A and B only in that their own end of the contract they propose to enter would become active iff the other actually reincarnates (A+ can do this once B->B+, because she knows that B+ will do the same). But it is necessary for one of A+ or B+ to reincarnate first.

"On the count of 3, hang up!"

This may not be to the liking of non-machine intelligence, and at best costly to those involved, if not infeasible due to speed-of-light physical separation.

Yes, I agree. That's why I tried to make clear that the necessary technology is being assumed into existence.

But it is necessary for one of A+ or B+ to reincarnate first.

In joint verified-source reincarnation, or what I called secure merger, there is just A'. Let's ask, after A' is constructed, but before A and B have transferred their resources, is there any incentive for either A or B to cheat and fail to transfer? Not if they programed A' in such a way that A and B are each individually made no worse off by transferring. For example, A and B could program A' so that it would act exactly as A would act if B fails to transfer, and exactly as B would act if A fails to transfer, and carry out the cooperative solution only if both sides transfer.

I made my opening claim (equivalence of reincarnation on independent vs. shared substrates) so I didn't have to fuss about notation. Do you disagree?

As for making each individual's transfer risk-free, that was my point - the new version of oneself must be substantially identical to the old one. The only new payload should be the (uncheatable) protocol for committing to new agreements. You've verified that your counterpart implements the same protocol, and you can make binding agreements.

Of course, we assume A and B don't mind reincarnation ,and feel no angst at brief coexistence of multiple copies of themselves, or the destruction of one.

I think in the centralized model, it's easier to see how the resource transfer could be made risk-free. In the distributed model, you have to deal with distributed systems issues, which seem like a distraction to the main point that a technology like "secure joint construction" or "verified source-code reincarnation" can be used to enforce cooperative agreements among AIs (if they don't mind being merged/reincarnated)

The process of transferring resources might be done visibly and gradually, like the iterated donation pact in this comment.

I agree that it might be nice to transfer gradually, but if you avoid the risk of losing any fraction, then you solve the problem of one-shot transfer also. If you don't solve that problem, then you still risk losing something to treachery. Maybe you could suggest a concrete protocol for a verified-source reincarnation scenario.

I'm also envisioning some non-divisible resources as well, e.g. privileges and obligations with respect to independent entities (we'll assume these were crafted to transfer on reincarnation).

Are the machines' goals equivalent, compatible, or incompatible?

In the first case, cooperation is no problem, the second case represents a rather unlikely scenario, and in the third case, you probably won't see much cooperation.

The second case may be more plausible in a non-zero-sum environment, where the merged agent can achieve more of each prior agent's goals than they would each achieve in a competitive setting. Think of the Superhappies' propensity to merge goals with other species as an analogue.

Companies merge sometimes - that seems like a kind of proof of concept. They are usually trying to gain mass in order to defeat third parties.