Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Modal Bargaining Agents

4jimrandomh

0orthonormal

2jimrandomh

0orthonormal

0jimrandomh

3jimrandomh

1orthonormal

2János Kramár

1orthonormal

2János Kramár

0orthonormal

2János Kramár

0orthonormal

0János Kramár

0orthonormal

2jessicata

0orthonormal

0orthonormal

0János Kramár

0János Kramár

New Comment

Point (1) seems to be a combination of an issue of working around the absence of a mathematically-elegant communication channel in the formalism, and an incentive to choose some orderings over others because of (2). If (2) is solved and they can communicate, then they can agree on an ordering without any trouble because they're both indifferent to which one is chosen.

If you don't have communication but you have solved (2), I think you can solve the problem by splitting agents into two stages. In the first stage, agents try to coordinate on an ordering over the points. To do this, the two agents X and Y each generate a bag of orderings Ox and Oy that they think might be Schelling points. Agent X first draws an ordering from Ox and tries to prove coordination on it in PA+1, then draws another with replacement and tries to prove coordination on it in PA+2, then draws another with replacement and tries to prove coordination on it in PA+3, etc. Agent Y does the same thing, with a different bag of proposed orderings. If there is overlap between their respective sets, then the odds that they will fail to find a coordination point fall off exponentially, albeit slowly.

Then in the second stage, after proving in PA+n for some n that they will both go through the same ordering, each will try to prove coordination on point 1 in PA+n+1, on point 2 in PA+n+2, etc, for the points they find acceptable.

I think your proposal is more complicated than, say, mutually randomly choosing an ordering in one step. Does it have any superior properties to just doing that?

I don't think the mechanics of the problem, as specified, let them mutually specify random things without something like an externally-provided probability distribution. This is aimed at eliminating that requirement. But it may be that this issue isn't very illuminating and would be better addressed by adjusting the problem formulation to provide that.

We already assumed a source of mutual randomness in order to guarantee that the feasible set is convex (caption to Figure 1).

To clarify, what I meant was not that they need a source of shared randomness, but that they need a shared probability distribution; ie, having dice isn't enough, they also need to coordinate on a way of interpreting the dice, which is similar to the original problem of coordinating on an ordering over points.

Regarding (2), the main problem is that this creates an incentive for agents to choose orderings that favor themselves when there is overlap between the acceptable regions, and this creates a high chance that they won't be able to agree on an ordering at all. Jessica Taylor's solution solves the problem of not being able to find an ordering, but at the cost of all the surplus utility that was in the region of overlap. For example, if Janos and I are deciding how to divide a dollar, I offer that Janos keeps it, and Janos offers that I keep it, that solution would have us set it on fire instead.

Instead, perhaps we could redefine the algorithm so that "cooperation at point N" means entering another round of negotiation, where only points that each agent finds at least as good as N are considered, and negotiation continues until it reaches a fixed point.

How to actually convert this into an algorithm? I haven't figured out all the technical details, but I think the key is having agents prove things of the form "we'll coordinate on a point I find at least as good as point N".

I thought about that at some point, in the case where they're biased in their own directions, but of course there it just reintroduces the incentive for playing hardball. In the case where they're each overly generous, they already have the incentive to bias slightly more in their own direction.

However, there's not an obvious way to translate the second round of negotiation into the modal framework...

How about a gridless scheme like:

The agents agree that they will each output how much utility they will get, and if they fail to choose a feasible point they both get 0.

Now discretize the possible "rectangle areas": let them be . (This requires a way to agree on that, but this seems easier than agreeing on grid points; the finer the better, basically. Perhaps the most obvious way to do it is to have these be evenly spaced from to ; then only and need to be agreed upon.)

X will do the following:

let be the first area in this list that is achieved by a feasible point on X's fairness curve. (We will assume the fairness curve is weakly convex and weakly increasing.)

for , let be the "fair" utility distribution that achieves area ; let .

let be Y's output.

# Speculative phase:

if proves and is feasible, return .

else if proves and is feasible, return .

else if proves and is feasible, return .

# Bargaining phase:

else if proves and is feasible, return .

else if proves and is feasible, return .

else if proves and is feasible, return .

else return .

If both agents follow protocol, the result is guaranteed to be feasible and to not be above either agent's fairness curve.

Furthermore, if the intersection of the fairness curves is at a feasible point that produces rectangle area for some and X and Y follow the protocol then on the step, they will be able to agree on the coordinatewise min of their proposed feasible points with area .

If they're both really generous and their fairness curves don't intersect inside the feasible region, the given protocol will agree on approximately the highest feasible multiple of how much they thought they were fairly due, avoiding utility sacrifice.

I don't think I get this. Let's walk though an example that looks absurd to me, and let me know if I'm misinterpreting something along the way:

We start with the feasible set as the convex hull of (0,0), (2,3), and (3,2). X thinks that (3,2) is fair, while Y thinks that (2,3) is fair. By Eliezer's algorithm (and by the modal instantiation), they would end up at (2,2).

Let's say that includes (3,2), and includes (2,3); then is the first rectangle considered fair by X, and is the first rectangle considered fair by Y.

Then X, running the algorithm above, first tries to show in PA+1 that and that (3, y) is in the feasible set; if it succeeds, it outputs 3. Meanwhile, Y tries to show that and that (x,3) is in the feasible set; if it succeeds, it outputs 3. Neither of these proof searches can succeed (since PA+1 doesn't know its own consistency, each agent thinks the other might output 3 by the Principle of Explosion).

Now we move to the next stage. X tries to show in PA+2 that and that (3/2, y) is feasible; if so, it outputs 3/2. Y likewise tries to show in PA+2 that and that (x, 3/2) is feasible; if so, it outputs 3/2. Now we have a Lobian circle; both successfully prove that the other outputs 3/2, and thus output 3/2 themselves. And so the agents coordinate at (3/2, 3/2), rather than at (2,2) or anything better.

Did I miss something?

You did miss something: namely from PA+2 X wants to show feasibility of , not . In your example, , so the Löbian circle you describe will fail.

I'll walk through what will happen in the example.

The are just areas (ie ), not rectangles. In this example, is enough to contain and . For conciseness let's have , , and (so ).

Both X and Y have . According to X, , , and .

First the speculative phase will happen:

X will try to prove in PA+1 that and that is in the feasible set. The latter is immediately false, so this will fail.

Next, X will try to prove in PA+2 that and that is in the feasible set. This too is immediately false.

Next, X will try to prove in PA+3 that and that is feasible (in short, that ), and return 3 if so. This will fail to reach a cycle.

Then the bargaining phase:

Next, X will try to prove in PA+4 that and that is feasible, and return 3 if so. This will fail identically.

Next, X will try to prove in PA+5 that and that is feasible, and return 2 if so. This will reach a Löbian circle, and X and Y will both return 2, which is what we want.

OK! You were right, I misinterpreted, and I do like this proposal!

I concur that in the limit as , this does hit the intersection of the fairness curves if the agents are biased in their own favor, and hits the Pareto curve linear multiple of each agent's request if they're biased in each others' favor. Moreover, both of these behaviors are invariant under rescalings of utility functions, so we're good on that front too!

I haven't yet thought about whether this works analogously in three-player games, but it feels like it should...

If the fairness constraints are all pairwise (ie each player has fairness curves for each opponent), then the scheme generalizes directly. Slightly more generally, if each player's fairness set is weakly convex and closed under componentwise max, the scheme still generalizes directly (in effect the componentwise max creates a fairness curve which can be intersected with the surfaces to get the points.

In order to generalize fully, the agents should each precommunicate their fairness sets. In fact, after doing this, the algorithm is very simple: player X can compute what it believes is the optimal- feasible-and-fair-according-to-everyone point (which is unique because these are all convex sets), and if PA proves the outcome will be fair-according-to-X, then output ; otherwise output 0.

I'd like for this to work with as little prior coordination as possible, so I'm not keen on assuming the agents precommunicate their fairness sets. But the generalization with only the pre-coordinated is neat.

What's the harm in requiring prior coordination, considering there's already a prior agreement to follow a particular protocol involving s? (And something earlier on in the context about a shared source of randomness to ensure convexity of the feasible set.)

The actual problem we want to work toward is one where all the prior coordination info is in the environment independent of the particular agents (e.g. the existence of Schelling points), and the agents are just deducing things about each other. For instance, two FairBots work in a source code swap Prisoner's Dilemma against one another even if written in different programming languages.

I'm willing to accept "accepting a natural ordering on the payoff set" and "accepting a natural set of outcome products" as things that could conceivably be Schelling points in a simple environment, but "know the shape of each others' fairness sets" looks like an infinite pre-exchange of information that cannot be gleaned from the environment.

(And "generate mutual random bits" is a cooperative thing that can be viewed as an atomic action in the environment.)

Quick point on 2: we could define acceptable solutions as those that are Pareto-inferior to the fair point. This ensures that there is only one Pareto optimal feasible point that is acceptable to everyone. Of course, it does result in worse solutions, but doing something like this seems like a good step towards solving 1.

I now think we can do this without prior coordination, in some sense, by using pseudorandom bitstrings and lots of oracle calls. (Idea came from a conversation with Alex Mennen.)

Intuition: if both players are sampling their acceptable outcome sets in random fashion, and each is heavily biased toward sampling points where they do better, then the first time that they both check the same point at the same level is very probably going to be near Eliezer's fairness point.

(As noted in Quinn's example, we'll need to make the map from output-time to outcome injective, but we can do this via the transform suggested in that draft.)

Consider the distribution over the agent's acceptable outcome set with probability density proportional to a positive exponential of the agent's utility function :

if acceptable to , 0 otherwise.

Our agents need to be deterministic, so we'll have a vast family of them parametrized by sequences of draws from this distribution. Given a large and a sequence of drawn from that distribution, we have the agent

```
def A():
for i in range(N):
if PA+i proves A()=a_i implies B()=b_i:
return a_i
else return default Nash equilibrium
```

(If we want to do this even more generally and not assume they're considering the same set of ordered pairs, we can instead ask for proofs that , where is the actual utility of the universe from A's perspective and is the utility of a particular point in the feasible set.)

If two such agents are playing against one another, then the probability (with respect to the sequence of draws that defined each agent) of a mutually acceptable ordered pair being chosen at each step is proportional to , which is maximized at Eliezer's acceptable point, and they have chances to get it. For and large, and for very large, with high probability two such agents will coordinate near Eliezer's acceptable point.

How about a gridless scheme like:

The agents agree that they will each output how much utility they will get, and if they fail to choose a feasible point they both get 0.

Now discretize the possible "rectangle areas": let them be . (This requires a way to agree on that, but this seems easier than agreeing on grid points; the finer the better, basically. Perhaps the most obvious way to do it is to have these be evenly spaced from to ; then only and need to be agreed upon.)

X will do the following:

let be the first area in this list that is achieved by a feasible point on X's fairness curve. (We will assume the fairness curve is weakly convex and weakly increasing.)

for , let be the "fair" utility distribution that achieves area ; let .

let be Y's output.

# Speculative phase:

if proves and is feasible, return .

else if proves and is feasible, return .

else if proves and is feasible, return .

# Bargaining phase:

else if proves and is feasible, return .

else if proves and is feasible, return .

else if proves and is feasible, return .

else return .

If both agents follow protocol, the result is guaranteed to be feasible and to not be above either agent's fairness curve.

Furthermore, if the intersection of the fairness curves is at a feasible point that produces rectangle area for some and X and Y follow the protocol then on the step, they will be able to agree on the coordinatewise min of their proposed feasible points with area .

If they're both really generous and their fairness curves don't intersect inside the feasible region, the given protocol will agree on approximately the highest feasible multiple of how much they thought they were fairly due, avoiding utility sacrifice.

How about a gridless scheme like:

The agents agree that they will each output how much utility they will get, and if they fail to choose a feasible point they both get 0.

Now discretize the possible "rectangle areas": let them be . (This requires a way to agree on that, but this seems easier than agreeing on grid points; the finer the better, basically. Perhaps the most obvious way to do it is to have these be evenly spaced from to ; then only and need to be agreed upon.)

X will do the following:

let be the first area in this list that is achieved by a feasible point on X's fairness curve. (We will assume the fairness curve is weakly convex and weakly increasing.)

for , let be the "fair" utility distribution that achieves area ; let .

let be Y's output.

# Speculative phase:

if proves and is feasible, return .

else if proves and is feasible, return .

else if proves and is feasible, return .

# Bargaining phase:

else if proves and is feasible, return .

else if proves and is feasible, return .

else if proves and is feasible, return .

else return .

If both agents follow protocol, the result is guaranteed to be feasible and to not be above either agent's fairness curve.

Furthermore, if the intersection of the fairness curves is at a feasible point that produces rectangle area for some and X and Y follow the protocol then on the step, they will be able to agree on the coordinatewise min of their proposed feasible points with area .

If they're both really generous and their fairness curves don't intersect inside the feasible region, the given protocol will agree on approximately the highest feasible multiple of how much they thought they were fairly due, avoiding utility sacrifice.

Summary: Bargaining problems are interesting in the case of Löbian cooperation; Eliezer suggested a geometric algorithm for resolving bargaining conflicts by leaving the Pareto frontier, and this algorithm can be made into a modal agent, given an additional suggestion by Benja.## Bargaining and Fairness

When two agents can read each others' source code before playing a game, Löbian cooperation can transform many games into pure bargaining problems. Explicitly, the algorithm of "if I can prove you choose X, then I choose Y, otherwise I choose Z" makes it possible for an agent to offer a deal better for both parties than a Nash equilibrium, then default to that Nash equilibrium if the deal isn't verifiably taken. In the simple case where there is a unique Nash equilibrium, two agents with that sort of algorithm can effectively reduce the decision theory problem to a negotiation over which point on the Pareto frontier they will select.

However, if two agents insist on different deals (e.g. X insists on point D, Y insists on point A), then it falls through and we're back to the Nash equilibrium O. And if you try and patch this by offering several deals in sequence along the Pareto frontier, then a savvy opponent is just going to grab whichever of those is best for them. So we'd like both a notion of the fair outcome, and a way to still make deals with agents who disagree with us on fairness (without falling back to the Nash equilibrium, and without incentivizing them to play hardball with us).

Note that utility functions are only defined up to affine transformations, so any solution should be invariant under independent rescalings of the players' utilities. This requirement, plus a few others (winding up on the Pareto frontier, independence of irrelevant alternatives, and symmetry) are all satisfied by the Nash solution to the bargaining problem: choose the point on the Pareto frontier so that the area of the rectangle from O to that point is maximized.

This gives us a pretty plausible answer to the first question, but leaves us with the second: are we simply at war with agents that have other ideas of what's fair? (Let's say that X thinks that the Nash solution N is fair, but Y thinks that B is fair.) Other theorists have come up with other definitions of the fair solution to a bargaining problem, so this is a live question!

And the question of incentives makes it even more difficult: if you try something like "50% my fair equilibrium, 50% their fair equilibrium", you create an incentive for other agents to bias their definition of fairness in their own favor, since that boosts their payoff.

## Bargaining Away from the Pareto Frontier

Eliezer's suggestion in this case is as follows: an agent defines its set of acceptable deals as "all points in the feasible set for which my opponent's score is at most what they would get at the point I think is fair". If each agent's definition of fairness is biased in their own favor, the intersection of the agents' acceptable deals has a corner within the feasible set (but not on the Pareto frontier unless the agents agree on the fair point), and that is the point where they should actually achieve Löbian cooperation.

Note that in this setup, you get no extra utility for redefining fairness in a self-serving way. Each agent winds up getting the utility they would have had at the fairness point of the

otheragent. (Actually, Eliezer suggests a very slight slope to these lines, in the direction that makes itworsefor the opponent to insist on a more extreme fairness point. This sets up good incentives for meeting in the middle. But for simplicity, we'll just consider the incentive-neutral version here.)Moreover, this extends to games involving more than two agents: each one defines a set of acceptable deals by the condition that no other agent gets more than they would have at the agent's fairness point, and the intersection has a corner in the feasible set, where each agent gets the minimum of the payoffs it would have achieved at the other agents' fairness points.

## Modal Bargaining Agents

Now, how do we set this bargaining algorithm up as a modal decision theory? We can only consider finitely many options, though we're allowed to consider computably mixed strategies. Let's assume that our game has finitely many pure strategies for each player. As above, we'll assume there is a unique Nash equilibrium, and set it at the origin.

Then there's a natural set of points we should consider: the grid points within the feasible set (the convex hull formed by the Nash equilibrium and the Pareto optimal points above it) whose coordinates correspond to utilities of the pure strategies on the Pareto frontier. This is easier to see than to read:

Now all we need is to be sure that Löbian cooperation happens at the point we expect. There's one significant problem here: we need to worry about syncing the proof levels that different agents are using.

(If you haven't seen this before, you might want to work out what happens if any two of the following three agents are paired with one another:

Results in rot13: Nal gjb qvssrerag ntragf nobir erghea Q ntnvafg rnpu bgure, orpnhfr CN pna'g cebir vgf bja pbafvfgrapl, naq gurersber gur snpg gung n cebbs frnepu va CN snvyf pna arire or cebirq va CN.)

Benja suggested one way to ensure that Löbian cooperation happens at the right point: we assume that in addition to the payoff matrix, the agents are mutually aware of an ordering relation on the grid points. In order to land on the best mutually acceptable point (in the Pareto sense), it's merely necessary for the ordering to respect the "level" of grid points, defined as the total number of grid lines traversed along either axis from O to the grid point.

Then, we simply have each player look for cooperation at proof level N at the Nth point, skipping those points that are unacceptable to it. Since the best mutually acceptable point is the only acceptable point at its level (or any prior level), it will be chosen.

Again, this works for modal decision problems with more than two players.

## Open questions and nagging issues: