## LESSWRONGLW

Cooperating with agents with different ideas of fairness, while resisting exploitation

If I understand correctly, what Stuart proposes is just a special case of what Eliezer proposes. EY's scheme requires some function mapping the degree of skew in the split to the number of points you're going to take off the total. SA's scheme is the special case where that function is the constant zero.

The more punishing function you use, the stronger incentive you create for others to accept your definition of 'fair', but on the other hand, if the party you're trading with genuinely has a a different concept of 'fair' and if you're both following this ... (read more)

# 46

There's an idea from the latest MIRI workshop which I haven't seen in informal theories of negotiation, and I want to know if this is a known idea.

(Old well-known ideas:)

Suppose a standard Prisoner's Dilemma matrix where (3, 3) is the payoff for mutual cooperation, (2, 2) is the payoff for mutual defection, and (0, 5) is the payoff if you cooperate and they defect.

Suppose we're going to play a PD iterated for four rounds.  We have common knowledge of each other's source code so we can apply modal cooperation or similar means of reaching a binding 'agreement' without other enforcement methods.

If we mutually defect on every round, our net mutual payoff is (8, 8).  This is a 'Nash equilibrium' because neither agent can unilaterally change its action and thereby do better, if the opponents' actions stay fixed.  If we mutually cooperate on every round, the result is (12, 12) and this result is on the 'Pareto boundary' because neither agent can do better unless the other agent does worse.  It would seem a desirable principle for rational agents (with common knowledge of each other's source code / common knowledge of rationality) to find an outcome on the Pareto boundary, since otherwise they are leaving value on the table.

But (12, 12) isn't the only possible result on the Pareto boundary.  Suppose that running the opponent's source code, you find that they're willing to cooperate on three rounds and defect on one round, if you cooperate on every round, for a payoff of (9, 14) slanted their way.  If they use their knowledge of your code to predict you refusing to accept that bargain, they will defect on every round for the mutual payoff of (8, 8).

I would consider it obvious that a rational agent should refuse this unfair bargain.  Otherwise agents with knowledge of your source code will offer you only this bargain, instead of the (12, 12) of mutual cooperation on every round; they will exploit your willingness to accept a result on the Pareto boundary in which almost all of the gains from trade go to them.

Generalizing:  Once you have a notion of a 'fair' result - in this case (12, 12) - then an agent which accepts any outcome in which it does worse than the fair result, while the opponent does better, is 'exploitable' relative to this fair bargain.  Like the Nash equilibrium, the only way you should do worse than 'fair' is if the opponent also does worse.

So we wrote down on the whiteboard an attempted definition of unexploitability in cooperative games as follows:

"Suppose we have a [magical] definition N of a fair outcome.  A rational agent should only do worse than N if its opponent does worse than N, or else [if bargaining fails] should only do worse than the Nash equilibrium if its opponent does worse than the Nash equilibrium."  (Note that this definition precludes giving in to a threat of blackmail.)

(Key possible-innovation:)

It then occurred to me that this definition opened the possibility for other, intermediate bargains between the 'fair' solution on the Pareto boundary, and the Nash equilibrium.

Suppose the other agent has a slightly different definition of fairness and they think that what you consider to be a payoff of (12, 12) favors you too much; they think that you're the one making an unfair demand.  They'll refuse (12, 12) with the same feeling of indignation that you would apply to (9, 14).

Well, if you give in to an arrangement with an expected payoff of, say, (11, 13) as you evaluate payoffs, then you're giving other agents an incentive to skew their definitions of fairness.

But it does not create poor incentives (AFAICT) to accept instead a bargain with an expected payoff of, say, (10, 11) which the other agent thinks is 'fair'.  Though they're sad that you refused the truly fair outcome of (as you count utilons) 11, 13 and that you couldn't reach the Pareto boundary together, still, this is better than the Nash equilibrium of (8, 8).  And though you think the bargain is unfair, you are not creating incentives to exploit you.  By insisting on this definition of fairness, the other agent has done worse for themselves than other (12, 12).  The other agent probably thinks that (10, 11) is 'unfair' slanted your way, but they likewise accept that this does not create bad incentives, since you did worse than the 'fair' outcome of (11, 13).

There could be many acceptable negotiating equilibria between what you think is the 'fair' point on the Pareto boundary, and the Nash equilibrium.  So long as each step down in what you think is 'fairness' reduces the total payoff to the other agent, even if it reduces your own payoff even more.  This resists exploitation and avoids creating an incentive for claiming that you have a different definition of fairness, while still holding open the possibility of some degree of cooperation with agents who honestly disagree with you about what's fair and are trying to avoid exploitation themselves.

This translates into an informal principle of negotiations:  Be willing to accept unfair bargains, but only if (you make it clear) both sides are doing worse than what you consider to be a fair bargain.

I haven't seen this advocated before even as an informal principle of negotiations.  Is it in the literature anywhere?  Someone suggested Schelling might have said it, but didn't provide a chapter number.