Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Work done during SERI MATS 3.0 with mentorship from Jesse Cliffton. Huge thanks for all the feedback and discussions to Anthony DiGiovanni, Daniel Kokotajlo, Martín Soto, Rubi J. Hudson and Jan Betley! Also posted to EA forum.

Daniel's post about commitment races motivates why they may be a severe problem. Here, I'll describe a concrete protocol that if adopted, would let us avoid some cases of miscoordination caused by them.

TL;DR

The key ingredient is having a mandatory time delay, during which the commitments aren't yet binding. At the end of that delay, you decide whether to make your commitment binding or revert it, and this decision can be conditional on previous decisions of other participants. This in itself would give rise to new races, but it can be managed by adding some additional rules.

I think the biggest challenge would be to convince the "commitment infrastructure" (which I describe below) to adopt such a protocol.

Benefits

  • In the case of the game of chicken, the 3 rules listed below should often push whoever committed later to Swerve.
    • In simple games like chicken it may be achieved easier, just by relying on conditional commitments (“If I came second, I Swerve”).[1] But here we add another mechanism: tentative commitment period, which gives us another nice feature:
  • In the real world it can be not so obvious that some commitments are incompatible. The tentative period gives the agents time to analyze the situation in depth and check if any commitments are clashing. This is especially useful in highly multipolar cases where multiple parties try to commit at the same time or where actions have complex consequences and interactions.
    • We also don't need to know in advance all the actions the others can take - we can analyze their actions after they've already tentatively committed to them.
  • Even if participants manage to coordinate (so one Dares and one Swerves), the solution found hastily during a commitment race can still be quite poor. Boomerang enables bargaining that can Pareto improve on this hasty solution.

Necessary ingredients

The protocol relies on some mechanism M on which agents can make commitments - a "commitment infrastructure". M could be something like the Ethereum network, or some powerful international body.

We require that:

  1. When someone publishes a commitment, M arrives at a consensus about the time at which the commitment was published.
  2. It’s more convenient/cheap/credible to do some commitments on M that outside of it.

2. is needed because the protocol relies on certain commitments being forbidden. Agents could decide to do those forbidden commitments outside of M, so we need to make that as hard as possible for them, compared to committing on M. I think this is the hardest part of the whole proposal. M would need to be locked into place by a network effect - everyone is using M because everyone else is using M.

Protocol

Here are the rules:

  • R1: All commitments have a mandatory tentative period, meaning that they only become binding after some fixed time T (we can say that freeze_time = publish_time + T).
    • So you have no way to make a commitment credible before freeze_time (if you were allowed to end the tentative period, we would create a new race to end the tentative period as soon as possible).
  • R2: During the tentative period, you can still decide to revert your commitment.
    • Somewhere before the freeze_time you send to M your final decision (whether you revert or not), but hashed.[2]
      • You also need to add random salt to your decision before hashing, so that it cannot be revealed through brute-forcing.
    • After freeze_time you reveal the decision (and it must match the previously sent hash, otherwise M would reject your commitment).
      • So in the analogy of the game of chicken - you threw out your steering wheel, but here it boomerangs back to you, giving you your last chance to catch it.
    • You shouldn’t be able to reveal the final decision to anyone before freeze_time because we don’t want the commitment to get credible before freeze_time.
      • To ensure that, we add a rule that anyone who knows the final decision before freeze_time has the power to revert the commitment.
      • Now, if you reveal the decision to your opponent, they will probably break your commitment.
  • R3: Your final decision is allowed to be conditional on the final decision of some other commitment, if and only if your freeze_time comes after the freeze_time of that other commitment.

Those rules may seem like a lot, but I think they (or some comparably complex set of rules) are all needed if we want to avoid creating new races later in time. The aim is to have only one race, at the very beginning, and everything else should be calm, non-racy and completely independent of agents' speed of making commitments (f.e. what their ping is, or how well connected they are with the commitment infrastructure).

Example

We have a modified game of chicken with the following payoffs:

  • if you both Dare, you die, which is worth -100 utils
  • if you Dare and your opponent Swerves, you prove that you're a badass which is worth 10 utils
  • if you Swerve, you drive into a shrubbery, which ruins your car's awesome paint job, which is worth -20 utils
  • there also may be some additional actions available, but they are not obvious

Let's set the length of the tentative period at one minute, and let’s say that they have 3 minutes before they potentially crash into each other.

  • 0:00 - Race starts.
  • 0:01 - Alice publishes a commitment "I Dare"  - it's like throwing her steering wheel out the window - the wheel will "boomerang back" at 1:01 at which point if Alice doesn't "catch it", the commitment becomes final.
  • 0:02 - Bob didn't see in time that Alice threw out the wheel, so he publishes a commitment "I Dare"  - it's like throwing his steering wheel out the window - the wheel will boomerang back at 1:02. At this point, in a regular game of chicken they would be doomed. But here, there's still hope.
  • 0:53 - Bob sends out Hash(“If Alice doesn't revert her commitment to Dare, I Revert this commitment”)[3][4]
  • 0:55 - Alice sends out Hash(“I don't revert”)
  • 1:04 - Bob reveals the original decision: “If Alice doesn't revert her commitment to Dare, I Revert this commitment”
  • 1:07 - Alice reveals the original decision: “I don't revert”
  • 1:07 - M makes Alice’s “I don't revert” binding, and then also resolves Bob’s decision to “I Revert this commitment [to dare]”. The fact that Alice is now committed to Dare, later makes Bob Swerve.

Note that in principle at 0:53 Bob could instead decide to unconditionally Dare even though he is second, hoping that Alice may be too scared to Dare.

But with Boomerang such ruthless Daring is much less likely than without it. At the time of decision, Alice and Bob have a shared knowledge of who is first, and also only the second one can make a conditional commitment. This breaks the symmetry of the original game of chicken. The option of making the conditional commitment (when you have that option) is pretty compelling - it's both safe and taking opportunities when they arise. Additionally it would create a focal point of what the participants are "supposed to do" - everyone expects that the first committer gets to Dare and the second must do a conditional commitment, and diverting from this equilibrium would only hurt you.

Addition of bargaining

With the three rules described above, we managed to avoid the most catastrophic outcome. But that outcome is still pretty poor, because the initial commitments were chosen with almost zero thought. If agents later notice some Pareto improvement, to move to this new solution the first agent (Alice) would need to revert her first commitment and give up her privileged position. To be willing to do it, Alice would need a guarantee from the second agent (Bob) that he will also revert. But in the existing protocol, Alice cannot have such a guarantee, because after Alice reverts, Bob could still do whatever - R3 forbids conditioning on commitments that come after yours.

To fix that, we can add another rule:

  • R4: you can allow some other commitment to condition on your commitment even if its freeze time comes before yours but they still have the right to reject this option
    • This right to reject may seem counter-intuitive, but being unable to condition on others is actually a privilege. It makes your commitment more credible and it is this them who are pushed to Swerve.

It may be tricky to see how that helps, so let's rerun our example with that new rule:

  • 0:00 - Race starts.
  • 0:01 - Alice throws her steering wheel out the window. 
  • 0:02 - Bob throws his steering wheel out the window.
  • 0:37 - Bob realizes that they can Pareto improve over the previous outcome! They could just both stop, and he will publicly declare that Alice is more badass than him, and also pay her one util. This gives him a payoff of -1 instead of -20, and for Alice +11 instead of +10. He tentatively commits to do this, if Alice reverts her commitment to Dare. (Bob's new commitment can become final at 1:37.) He also allows Alice to condition her decision (at 1:01) on his decision (at 1:37).
  • 0:53 - Bob sends out Hash(“If Alice doesn't revert her commitment to Dare, I Revert this commitment”)
  • 0:55 - Alice sends out Hash(“If Bob doesn’t Revert that commitment from 0:37, I revert my commitment to Dare”)
  • 1:04 - Bob reveals the original decision: “If Alice doesn't revert her commitment to Dare, I Revert this commitment”
  • 1:07 - Alice reveals the original decision: “If Bob doesn’t Revert that commitment from 0:37, I revert my commitment to Dare”
  • 1:31 - Bob sends out Hash(“Follow through with the new commitment”)
  • 1:39 - Bob reveals the original decision: “Follow through with the new commitment”
  • 1:39 - M makes Bob’s plan binding - he must now stop, declare Alice to be more badass and pay her; then M resolves Alice’s conditional commitment to “I revert my commitment to Dare”. The fact that Bob stopped his car on the tracks and they've already established that she's the badass, makes her stop too.

We could even have a chain of multiple commitments “conditioning on the future”. In practice we may want to limit that somehow, so that the resolution cannot be delayed indefinitely.

Future work

  • Adoption. - What are the most tractable ways of implementing this protocol? F.e. if agents make the commitments on Ethereum network or in some international legal body, how do we convince all the nodes in the network / participating nations to enforce the protocol?[5]
    • It's probably easiest to start by convincing a handful of most important players. Then the rest of players may gradually follow.
    • Once you have that system in place, to lock it in, participants may want to penalize those who make commitments using some alternative mechanisms. [6] (h/t Jan Betley)
  • Reputation effects. - You may choose to make ruthless unconditional commitments (like Daring when you know someone else already committed to Dare) to build a ruthless reputation. This way, in future interactions you may hope that others will Swerve even when they committed first. Could this buildup of ruthless reputation be prevented? F.e. we could require full anonymity of the agents. Or rather be more selective, and somehow “erase the identity” of whoever behaves ruthlessly.[7]
  • Gaming the system using proxies. - The protocol could be broken if you state your decision publicly, and then pay some proxy to salt and hash this decision and upload this hash in your name (by somehow authorizing it to do it, f.e. giving it your private key). As long as this proxy doesn’t reveal this salt to anyone, no one will invalidate your commitment. And to the extent that the proxy is known to behave reliably, your publicly stated decision is credible. How do we break such proxies or make them impractical?
  • Can you find other ways to break the protocol?

Appendix

Some non-crucial technical details that you may want to skip:

  • rule R4 gives rise to a kind of iterated ultimatum game:
    • Alice could reject the option to condition on Bob's new commitment, making that new commitment useless
    • it would force Bob to propose a new one, with a split more favorable to Alice
    • this rejecting+offering may continue right up until freeze_time, so it has some potential for miscoordination (Alice takes the risk of rejecting a very late offer, and Bob doesn't have enough time to offer a new one)
    • so we may need to add a limit, that you can’t offer this option closer to their freeze_time than some duration, and also they need to wait some (shorter) duration before rejecting your option
    • this way it would be clear that some offer is the last one
  • cycles
    • conditioning on the future could make some commitments rely on themselves in a circular way
    • easiest way to prevent it, is that when you allow an earlier commitment C1 to condition on your C2, you must give up the power to condition on anything between C1 and C2
    • I’m not sure yet if that’s enough for more complex graphs of conditioning - anyway, we could somehow calculate which periods you cannot condition on
    • or, a potentially more powerful but tricky option, is to embrace the circularity, and if we have multiple ways to satisfy a circular commitment, choose a Pareto optimal way
      • this has a similar vibe to open-source game theory
      • agents would need to state their preferences for all those possible ways
      • if there are multiple Pareto optimal solutions, probably agents with commitments earlier on the timeline should have priority
      • if there is no way to satisfy the cycle, we should break it and probably again resolve in favor of earlier committers, but I’m not sure about the details here
  • we may get another race to lock in the hash of your final decision right after you published your tentative commitment, because if you lock in very soon, that weakly indicates that you decided not to revert
    • to fix that, we may also require that you are only allowed to lock in the hash of your final decision after some time has passed since the start of the tentative period
       
  1. ^

    This would only work in very simple cases like chicken, because you would need to know in advance what are all the possible commitments that others can make, so that you can define what "being second in a race" exactly means.

  2. ^

    Alternative rule could be to have M generate some random number at freeze_time, and only then an agent can make the final decision, because we require them to reference that number in the decision message. But that could create a race, where the second committer decides to Dare anyway, because they hope this information will reach the first committer soon enough to sway them. For this reason we would need to postpone the generation of second committer's random number, to wait for the first committer's decision. But if the protocol is used by a lot of agents at the same time, and we play it safe and assume that everyone may potentially clash with anyone, then we have to postpone every commitment on the network which scales badly.

  3. ^

    To be clear, the decisions would actually be written as formal statements, not natural language, and also explicitly state which commitments they reference.

  4. ^

    The order of sending these hashes is irrelevant here. That's why Bob can send that hash first, even though he's the second committer. 

  5. ^

    It may be better to adopt Boomerang sooner than later: After someone already established a strategic advantage that lets them commit more ruthlessly, they will oppose the adoption of such a protocol. But agents should be keener to accept the protocol if they don't know yet if they'll be the advantaged or disadvantaged ones.

  6. ^

    This works best if commitments on those alternative mechanisms are crisp, so that you can clearly define what will be penalized. F.e. committing through smart contracts is crisper than committing through staking your reputation.

    But this penalization may be tricky, because it's costly for the penalizer, and you would prefer others to carry this cost. So it requires participants to coordinate to all penalize together. Here's an example technique which may help.

  7. ^

    But if we require full anonymity, we lose any positive reputation effects we had. And if we “erase the identity” of whoever behaves ruthlessly, then encountering someone with a fresh identity serves as evidence that they are ruthless, defeating the purpose of this erasure.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 9:08 PM

I like this line of thinking, and I'm really glad to see the modeling of "internal intent to commit" separate from "commitment published".  

Unfortunately, I'm not sure it applies to any real kind of commitment race.  The vast majority that I know of use ad-hoc and agent-specific commitment mechanisms, and these would seem to obviate any legible and game-able mechanism.  For instance, it seems like Alice, presuming she commits first, expects the maximum payoff, and is better off having a "real" commitment than a revocable commitment that Bob can talk her out of.

better off having a "real" commitment than a revocable commitment that Bob can talk her out of

I'm confused what you mean here. In principle Alice can revoke her commitment before the freeze time in this protocol, but Bob can't force her to do so. And if it's common knowledge that Alice's freeze time comes before Bob's, then: Since Alice knows that there will be a window after her freeze time where Bob knows Alice's commitment is frozen, and Bob has a chance to revert, then there would be no reason (barring some other commitment mechanism, including Bob being verifiably updateless while Alice isn't) for Bob not to revoke (to Swerve) if Alice refused to revert from Dare. So Alice would practically always keep her commitment.

The power to revoke commitments here is helpful in the hands of the second-mover, who made the initial incompatible commitment because of, e.g., some lag time between the first-mover's making and broadcasting the commitment.

The second mover ALREADY had the option not to commit - they could just swerve or crash, according to their decision theory.  The revocation period doesn't actually change payouts or decision mechanisms, and if it doesn't change the sequence of commitment, I don't see how it makes any difference at all.  If it DOES change the sequence of commitment, then the first-mover would prefer not to lose their advantage, and will just use a non-revocable commitment.

It seems like this is introducing some sort of information or negotiation into the decisions, but I don't see how. In MANY such games, allowing side-payments or outside-of-game considerations can find better outcomes.  This doesn't do that, as far as I can see.

The second mover ALREADY had the option not to commit - they could just swerve or crash, according to their decision theory.

The premise here is that the second-mover decided to commit soon after the first-mover did, because the proof of the first-mover's initial commitment didn't reach the second-mover quickly enough. They could have not committed initially, but they decided to do so because they had a chance of being first.

I'm not sure exactly what you mean by "according to their decision theory" (as in, what this adds here).

if it doesn't change the sequence of commitment, I don't see how it makes any difference at all

The difference is that the second-mover can say "oh shit I committed before getting the broadcast of the first-mover's commitment—I'd prefer to revoke this commitment because it's pointless, my commitment doesn't shape the first-mover's incentives in any way since I know the first-mover will just prefer to keep their commitment fixed."

As I said, the first-mover doesn't lose their advantage from this at all, because their commitment is locked (at their freeze time) before the second-mover's. So they can just leave their commitment in place, and their decision won't be swayed by the second-mover's at all because of the rule: "You shouldn’t be able to reveal the final decision to anyone before freeze_time because we don’t want the commitment to get credible before freeze_time."

The premise here is that the second-mover decided to commit soon after the first-mover did, because the proof of the first-mover's initial commitment didn't reach the second-mover quickly enough.

That's a very critical deviation from the standard problem statement, which should be made very clear.  Also, re-reading the timeline, it appears to introduce side-payments (at 0:37 in the timeline), which is also a MAJOR deviation from the standard problem.

These two things (speed of information and ability to negotiate outside of the given payoff matrix) should be separated - both are fairly easy to model, and there will be much simpler solutions to integrate each of them into the decisions, which will be better than the combination of the two limited to a revocation window.

I edited the post to make it clearer that Bob throws out the wheel because he didn't notice in time that Alice threw.

Yup, side payments are a deviation, that's why I have this disclaimer in game definition (I edited the post now to emphasize it more):

there also may be some additional actions available, but they are not obvious

Re separating speed of information and negotiations: I think here they are already pretty separate. The first example with 3 protocol rules doesn't allow negotiations and only tackles the information speed problem. The second example with additional fourth rule enables negotiations. Maybe you could also have a system tackling only negotiations and not the information speed problem, but I'm not sure now how would it look like, or if it would be much simpler.

Another problem (closely tied to negotiations) I wanted to tackle is something like "speed of deliberation" where agents make some bad commitments because they didn't have enough time to consider their consequences, and later realize they want to revoke/negotiate.

Yup, you're totally right, it may be too easy to commit in other ways, outside this protocol. But I still think it may be possible to create such a "main mechanism" for making commitments where it's just very easy/cheap/credible to commit, compared to other mechanisms. But that would require a crazy amount of cooperation.

The vast majority that I know of use ad-hoc and agent-specific commitment mechanisms

If you have some particular mechanisms in mind could you list some? I'd like to compile a list of the most relevant commitment mechanisms to try to analyze them.

I'm not sure I'd call it "too easy to commit in other ways", so much as "this doesn't describe a commitment".  The power of a commitment is that the other player KNOWS that no strategy or discussion can change the decision.  That's the whole point.  If it's revocable or changeable, it's not a commitment, it's a meaningless statement of intent.

Real-world commitments come in many forms, from public announcements to get social pressure for follow-through to legal contracts with third parties to simply not bringing money so being unable to pay for something.

Oh yeah, I meant the final locked-in commitment, not initial tentative one. And my point is that when committing outside is sufficiently more costly, then it's not worth doing it, even if that would let you commit faster.

Reminds me of Noam Brown's work with getting ML systems to play imperfect information games.