# Cooperating with agents with different ideas of fairness, while resisting exploitation

There's an idea from the latest MIRI workshop which I haven't seen in informal theories of negotiation, and I want to know if this is a known idea.

*(Old well-known ideas:)*

Suppose a standard Prisoner's Dilemma matrix where (3, 3) is the payoff for mutual cooperation, (2, 2) is the payoff for mutual defection, and (0, 5) is the payoff if you cooperate and they defect.

Suppose we're going to play a PD iterated for four rounds. We have common knowledge of each other's source code so we can apply modal cooperation or similar means of reaching a binding 'agreement' without other enforcement methods.

If we mutually defect on every round, our net mutual payoff is (8, 8). This is a 'Nash equilibrium' because neither agent can unilaterally change its action and thereby do better, if the opponents' actions stay fixed. If we mutually cooperate on every round, the result is (12, 12) and this result is on the 'Pareto boundary' because neither agent can do better unless the other agent does worse. It would seem a desirable principle for rational agents (with common knowledge of each other's source code / common knowledge of rationality) to find an outcome on the Pareto boundary, since otherwise they are leaving value on the table.

But (12, 12) isn't the only possible result on the Pareto boundary. Suppose that running the opponent's source code, you find that they're willing to cooperate on three rounds and defect on one round, if you cooperate on *every *round, for a payoff of (9, 14) slanted their way. If they use their knowledge of your code to predict you refusing to accept that bargain, they will defect on every round for the mutual payoff of (8, 8).

I would consider it obvious that a rational agent should refuse this unfair bargain. Otherwise agents with knowledge of your source code will offer you *only *this bargain, instead of the (12, 12) of mutual cooperation on every round; they will exploit your willingness to accept a result on the Pareto boundary in which almost all of the gains from trade go to them.

*(Newer ideas:)*

Generalizing: Once you have a notion of a 'fair' result - in this case (12, 12) - then an agent which accepts any outcome in which it does worse than the fair result, while the opponent does *better*, is 'exploitable' relative to this fair bargain. Like the Nash equilibrium, the only way you should do worse than 'fair' is if the opponent also does worse.

So we wrote down on the whiteboard an attempted definition of unexploitability in cooperative games as follows:

"Suppose we have a [magical] definition N of a fair outcome. A rational agent should only do worse than N if its opponent does worse than N, or else [if bargaining fails] should only do worse than the Nash equilibrium if its opponent does worse than the Nash equilibrium." (Note that this definition precludes giving in to a threat of blackmail.)

*(Key possible-innovation:)*

It then occurred to me that this definition opened the possibility for other, intermediate bargains between the 'fair' solution on the Pareto boundary, and the Nash equilibrium.

Suppose the other agent has a slightly different definition of fairness and they think that what you consider to be a payoff of (12, 12) favors you too much; they think that you're the one making an unfair demand. They'll refuse (12, 12) with the same feeling of indignation that you would apply to (9, 14).

Well, if you give in to an arrangement with an expected payoff of, say, (11, 13) as you evaluate payoffs, then you're giving other agents an incentive to skew their definitions of fairness.

But it does *not *create poor incentives (AFAICT) to accept instead a bargain with an expected payoff of, say, (10, 11) which the other agent thinks is 'fair'. Though they're sad that you refused the truly fair outcome of (as you count utilons) 11, 13 and that you couldn't reach the Pareto boundary together, still, this is better than the Nash equilibrium of (8, 8). And though you think the bargain is unfair, you are not creating incentives to exploit you. By insisting on this definition of fairness, the other agent has done worse for themselves than other (12, 12). The other agent probably thinks that (10, 11) is 'unfair' slanted your way, but they likewise accept that this does not create bad incentives, since you did worse than the 'fair' outcome of (11, 13).

There could be many acceptable negotiating equilibria between what you think is the 'fair' point on the Pareto boundary, and the Nash equilibrium. So long as each step down in what you think is 'fairness' reduces the total payoff to the other agent, even if it reduces your own payoff even more. This resists exploitation and avoids creating an incentive for claiming that you have a different definition of fairness, while still holding open the possibility of some degree of cooperation with agents who honestly disagree with you about what's fair and are trying to avoid exploitation themselves.

This translates into an informal principle of negotiations: Be willing to accept unfair bargains, but only if (you make it clear) *both* sides are doing worse than what you consider to be a fair bargain.

I haven't seen this advocated before even as an informal principle of negotiations. Is it in the literature anywhere? Someone suggested Schelling might have said it, but didn't provide a chapter number.

ADDED:

Clarification 1: Yes, utilities are invariant up to a positive affine transformation so there's no canonical way to split utilities evenly. Hence the part about "Assume a magical solution N which gives us the fair division." If we knew the exact properties of how to implement this magical solution, taking it at first for magical, that might give us some idea of what N should be, too.

Clarification 2: The way this might work is that you pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome: (100, 100), (98, 99), (96, 98). Perhaps stop well short of Nash if the skew becomes too extreme. Drop to Nash as the last resort. The other agent does the same, starting with their own ideal of fairness on the Pareto boundary. Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle. Both of you will do worse against a fixed opponent's strategy by unilaterally adopting more self-favoring ideas of fairness. Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness. This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it. You should *not* be picking the descending sequence in an agent-dependent way that incentivizes, at cost to you, skewed claims about fairness.

Clarification 3: You must take into account the other agent's costs and other opportunities when ensuring that the net outcome, in terms of final utilities, is worse for them than the reward offered for 'fair' cooperation. Offering them the chance to buy half as many paperclips at a lower, less fair price, does no good if they can go next door, get the same offer again, and buy the same number of paperclips at a lower total price.