A high-level model of AI bargaining

Anthony DiGiovanni

Advanced AIs might be capable of various credible commitments unavailable to humans, which they could use when bargaining with each other. “Bargaining” can sound like something pretty specific: haggling over (literal) prices. But, in the sense discussed in Schelling’s The Strategy of Conflict for instance, “bargaining” refers to any attempt to resolve a dispute over resources — from algorithmic trading and litigation, to diplomacy between national AGI projects and negotiations over norms for space settlement.

To think clearly about interventions to mitigate conflict between AIs, I think it’s important to ground our research and strategy in a very general qualitative model of bargaining with commitments. This post sketches such a model, plus some more concrete examples of its building blocks.

I plan to explain some crucial implications in future writings. But as a teaser, this model doesn’t imply agents will play a Nash equilibrium!

Model

Where does this model come from? Basically, I started with the classic model from open-source game theory or “program equilibrium” literature. Then, I relaxed several assumptions to allow for some realistic, strategically relevant dynamics. That said, I’ve glossed over some other important dynamics for ease of exposition. I’ll say more on these at the end of the post.

High-level setup

Two AI agents, Alice and Bob, interact over two phases: before vs. after some time T, defined below. It will help to start with the “after” phase.

Bargaining phase (after T): Alice and Bob bargain over some contested resource. Specifically, “bargaining” consists in credibly reporting to each other their (i) demands/offers and (ii) policies for which outside options they’d each take if bargaining failed (such as leaving the resource alone, or initiating conflict). Bargaining ends when they either:

agree on compatible demands, and split the resource accordingly; or
lock in incompatible demands (e.g., Alice wants 50% of the resource while Bob wants 70%), and each follow their outside option policy.

How each agent decides (i) and (ii) is determined by some procedure chosen before the bargaining phase, as follows.

Pre-bargaining phase (before T): Each agent might try to shape the other’s incentives by credibly committing to constraints on their procedure for deciding (i) and (ii) — e.g., committing to never accept less than 50%. So they need to decide:

which credible commitments ^[1] ^[2] to make (if any) among the commitments they’re capable of, and when;
what information to gain/receive about each other’s commitments before making their own; and
what to do to prepare for (1) and (2), such as thinking more about possible strategies, or developing new capabilities for commitments or information control. (These capabilities can also change over time due to exogenous causes.)

The agents make these decisions under uncertainty about each other’s decisions, though they can resolve some of this uncertainty via (2).

Now for more details on how these commitments to bargaining procedures might work, and on the three actions above.

Bargaining programs

Each agent’s procedure for what they’ll do in the bargaining phase, called a program, takes as input information about the other agent’s program, as well as other features of the strategic situation. ^[3] (See (2) below for an example of a relevant “feature of the strategic situation”.)

As a very simplified example, Alice might follow the program: “If I can prove that Bob’s program would eventually accept my demands if I stuck to them, then I’ll demand 100%. Otherwise, I’ll accept no less than 50%, and fight if we disagree.” So, the AIs can implement conditional commitments, instead of necessarily either locking in rigid demands or conceding to whoever commits first.

Then, T is the first time both agents know which single program each other has committed to. ^[4]

Pre-bargaining actions

At any time t < T, each agent can do one of three actions:

Credibly commit to some constraints on the program they’ll use, including deciding on a single program. The agent makes these constraints verifiable and tries to communicate them to the other agent, possibly with some lag time.
- These constraints can be fairly indirect, e.g., Alice might delegate her remaining pre-bargaining actions to a successor, or decide to follow a particular decision theory.
- The idea is that the agent might want to:
  - make certain partial or “meta” commitments early on, but
  - delay committing to a single program until after learning or thinking more, or developing the capabilities for more advanced commitments.
Open or close information channels (e.g., some mutually trusted third-party server), through which they learn whether the other agent has made some credible commitment.
- (More in-the-weeds note:) Why might Alice or Bob close their info channel? A full discussion of this is out of scope, but intuitively: Suppose Alice wants to follow some sophisticated program p after T, but can’t yet credibly commit to p. In the meantime, Bob might commit to high demands that are incompatible with those of p, but he’d only want to do so if he expects his commitment could influence her behavior. She might therefore try to show him that he can’t influence her, by closing her info channel until she can credibly commit to p. (Though of course, this is risky, because Bob might try to do the same!)
Prepare, i.e., anything other than the above. This could include thinking more, developing capabilities, etc.

Concrete examples of aspects of this model

Part of the model	Examples
What the AIs bargain over	Allocation of compute among stakeholders; legal settlements; compensation for AIs’ labor or commitments not to sabotage alignment; contents of a jointly trained successor AI’s model spec
Commitment (verification) capabilities the AIs might develop	Finetuning to implement a surrogate goal; interpretability tools to make commitments verifiable; scaffolding to cleanly separate “first-order” commitments from a conditional renegotiation wrapper
Partial commitments	Delegating future (pre-)bargaining decisions to a successor with a surrogate goal; delegating to successors trained to adhere to a particular decision theory; putting strong norms of impartiality/symmetry for bargaining in an auditable model spec
Information that programs might condition their behavior on	Whether their first-order demands would be incompatible; whether third parties are observing the interaction; whether the other program is unconditionally committed to certain demands

Commentary on the model’s assumptions

The model makes the simplifying assumptions listed below. None of these are trivial. But overall, I think it will be fruitful to start by working out the main implications of the model as-is, and relax these assumptions from there.

The agents only make commitments that are fully credible, and there’s no private information (in the technical sense from, e.g., Fearon (1995)).
- We should expect there to be not-fully-credible commitments and private information. But in general, I don’t expect these elements to have qualitatively different implications for the most promising interventions against AI conflict, like safe Pareto improvements.
Only two agents interact at a time, which means the model doesn’t explicitly account for reputational incentives.
Strategic choices by the AIs’ human developers (e.g., do they train AIs to have cooperative propensities?) aren’t represented.

However, we don’t assume the agents:

are expected utility maximizers;
are necessarily capable of credibly committing to complex counterfactual behaviors (e.g., that they have fine-grained self-modification abilities and the ability to verify each other’s “source code”);
have fixed capabilities over time;
make commitments simultaneously;
make full commitments all at once, rather than a sequence of partial commitments as the agents gain information about each other; ^[5]
follow, or will converge to, any particular decision theory (see here for more);
can, or can’t, unilaterally force each other to verify their commitments; or
have well-calibrated beliefs about each other’s commitments before time T, or otherwise coordinate on a Nash equilibrium (including a program equilibrium) or cooperative bargaining solution.

This last point is worth a closer look. Indeed, I think dropping the equilibrium assumption is one of the most important starting points for a good theory of AI bargaining. But we’ll get to that in another post. ^[6]

Echoing the safe Pareto improvements agenda post: “Commitments” are meant to include modifications to one’s decision theory or values/preferences. It has been argued (example) that decision theories like updateless decision theory (UDT) can sidestep the need for “commitments” in the usual sense. We’ll set this question aside here, and treat the resolution to make one’s future decisions according to UDT as a commitment in itself. ↩︎
We define a commitment’s “credibility” relative to the set of agents the commitment needs to be made credible to. In some contexts, agents might want to make commitments that they can’t make credible to others. E.g., they might follow acausal decision theories and expect that if they commit to participate in evidential cooperation in large worlds (ECL), others are more likely to make the same commitment. These commitments are (vacuously) “credible”, because they don’t need to be made credible to anyone else. ↩︎
This is inspired by the “program game” formalism of Tennenholtz (2004), but my model isn’t committed to the specific assumptions in that paper — most notably, that players choose programs simultaneously. As described in the “Bargaining phase”, we allow for strategic decision-making to be carried out by the program itself, not just by the agent choosing the program. ↩︎
More generally, we could define each agent’s subjective T as the first time after which (a) that agent has decided a single program and (b) they know the other agent’s single program. But as far as I can tell, the implications of the model aren’t sensitive to this. ↩︎
As a point of contrast, the first five assumptions are made by this paper, which I nonetheless consider an important result in AI bargaining theory. ↩︎
Thanks to Nathaniel Sauerberg for helpful comments. ↩︎

^{^}

PS: this idea was copied from an old post of mine, but it's not a good post because I no longer believe in the other things I wrote in that post.

^{^}

PS: this idea was copied from an old post of mine, but it's not a good post because I no longer believe in the other things I wrote in that post.

[-]Knight Lee2h40

I’ll accept no less than 50%, and fight if we disagree

Do you mean the commitments follow an "all or nothing" pattern, where if both sides commit to 51% they're doomed?

I imagine that commitments might be less extreme, where overlap is costly but not fatal:^[1]

If each side commits to taking 51%, the rule of their commitment is to punish the other side by destroying anything more than 49% the other side takes, and then further destroying 0.5% for every 1% less than 51% they receive.

Each side takes 50%, but destroy 1% of the other side's pie so each side is only left with 49%. They both realize they received 2% less than the target 51%. This means each side destroys 1% of what the other side has, so each side now only has 48%. This is 3% less than 51%, so they further destroy 0.5% of what the other side has so they're both left with %. This continues to %, %, %, and so on. Of course, they can skip the formalities and just jump to 47% which is the final state.

It is important to punish the other side for punishing you, at least a little.

If you only destroy what they other side takes past 49%, but you do not destroy further based on how little you got, then the other side can get away with committing to take 70% of the pie, betting on the small chance you are a sucker and only ask for 30%. If they are correct, they get away with 70% of the pie. If they are wrong, then you will destroy their pie until they are left with only 49%, and they will destroy your pie until you are left with less than 30%, but they still get the maximum amount they could have gotten if they committed any lesser amount. This means it doesn't hurt for them to commit to take 70%. It only hurts you, and has a small chance of benefiting them.

PS: I'm not saying your post is wrong, since it's clearly titled "A high-level model of AI bargaining" rather than "A very detailed model of AI bargaining!" I just feel this detail is worth mentioning.

^{^}
PS: this idea was copied from an old post of mine, but it's not a good post because I no longer believe in the other things I wrote in that post.

12