Intelligence in Commitment Races

David Udell

A distillation of my understanding of the commitment races problem.

Greaser Courage

It's 1950 and you're a greaser in a greaser gang. Recreationally, you're driving your heavy American cars (without seatbelts) at each other at speed, and seeing who swerves first. Whoever swerves first is a coward and is "chicken"; their opponent is courageous and the victor. Both drivers swerving is less humiliating than just you swerving. Both not swerving means both drivers die.

The bolts on your steering wheel have been loosened in the chop shop, and so your steering wheel can be removed if you pull it out towards you.

If you remove your steering wheel and throw it prominently out your window, then your opponent will see this and realize that you are now incapable of swerving. They will then swerve, as they prefer humiliation to death. This wins you glory that you live to see.

But if you both simultaneously throw your steering wheels out the window, then neither of you will be able to slow down in time and both of you will die.

Commitment Races

The two above greasers are thus in a situation where one can individually do better by throwing out their steering wheel quickly, but fares worse if both adopt this strategy. Both-drivers-having-their-steering-wheels is a commons that each greaser can take from, but both do poorly if the commons is depleted by over-exhaustion. Greasers with unloosened steering wheels don't share a both-drivers-having-their-steering-wheels commons -- because each greaser can commit ahead of time to not swerving, this commons exists. Because the greasers are both itching to commit first to not swerving, we say that the greasers playing chicken with loosened steering wheels are in a commitment race with each other.

Besides throwing out steering wheels, other kinds of actions allow agents to precommit to act ahead of time, in view of their opponents. If you can alter your source code so that you will definitely pass up a somewhat desirable contract if your ideal terms aren't met, you'll be offered better contracts than agents that can't alter their source code. Alice the human and Bot the human-intelligence AGI are trading with each other. Bot has edit access to his own source code; Alice does not have access to hers. Before beginning any trading negotiations with Alice, Bot is sure to modify his own source code so that he won't accept less than almost all of the value pie. Bot then shows this self-modification to Alice, before going to trade. Alice will now offer Bot a trade where she gets almost nothing and Bot gets almost everything: Alice still prefers a trade where she gets almost nothing to a trade where she gets literally nothing.

Something perverse has happened here! Before self-modifying, Bot had a huge panoply of possible contracts that he could agree to or reject. After self-modifying, Bot had strictly fewer options available to him. Bot did better in life by throwing away options that he previously had! Alice and Bot entered into a trade relationship because they both understand the notion of positive-sum interactions; they're both smart, sophisticated agents. Alice brought all those smarts with her into the trading room. Bot threw some of his options away and made his future self effectively dumber. Usually, when we study rationality we find that intelligence is good for finding and choosing the best option out of a big set of alternatives. Usually, smart rationalists want more options because that means a greater chance of the alternatives including an even better option. Smart rationalists want a lot of options, because that gives them more possible stabs at a better alternative, and then want their reasoning to steer their final decision after sifting through those alternatives. Being smarter is ordinarily good for quickly and accurately sifting through larger option spaces. With commitment races, being smarter and using that to sift through options is a losing move. Being smart in a commitment race is, unusually, a losing position -- you win in a commitment race to the extent that you can make yourself dumb, fast.

Eliezer:
IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent. If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some 'commitment' earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.
I am not locked into warfare with things that demand $6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters.
From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself. I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with $9 and the other poor agent can only take $1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

Eliezer's argument above is that rational agents use precommitments to shape their outward-facing incentives profile so that all agents trading with them are incentivized to offer Shapley splits of the value pie. On Eliezer's view, one should precommit such that any agent like Bot will get a lower EV by offering you almost nothing than by offering you the Shapley split.

Bot has read enough Eliezer on LessWrong to know how Eliezer will precommit. Knowing this, Bot does best by offering Eliezer Shapley splits, and by accepting Eliezer's Shapley offers. Because Bot knows what Eliezer is like, Bot has no reason to self-modify before trading with Eliezer. Bot happily rips off Alice, but doesn't rip off Eliezer.

But what if Bot doesn't yet know much about Eliezer? If Bot reasons that self-modifying is worthwhile in expectation before it reasons about what Eliezer is like as a trader, then Bot will end up offering Eliezer almost nothing anyways. Bot would have moved first in 'logical time,' and so Eliezer's precommitment footwork wouldn't have logically subjunctively prevented Bot's self-modification. Moreover, if an as-of-yet ignorant Bot has some premonition that learning more about Eliezer will make Bot encounter truths he'd rather not encounter, Bot can self-modify on the basis of that premonition, before risking reading up on Eliezer on LessWrong. Note that it would not be sufficient for Bot to self-modify to forget any inconvenient truths Bot had learned about Eliezer. Bot's self-modifications really do have to precede Bot's understanding Eliezer, as Eliezer's incentives profile will leave a forgot-and-then-self-modified Bot worse off.

Bot will only win in a commitment race with Eliezer if Bot self-modifies for the wrong reason, in advance of understanding why self-modification is valuable. Bot, if successful, acted on mere premonitions in his prior reasoning about self-modification. Bot got to, and could only get to, a winning state in the logical mental game against Eliezer "by accident."

Bot will be more and more tempted to self-modify the more it learns about bargaining and decision theory. But Bot will be more and more averse to self-modification against Eliezer as it improves its Eliezer model. If Eliezer takes pains to be easy to model, then Bot will become adverse to self-modifying against Eliezer sooner in Bot's reflection. But because Eliezer would never precommit to probably turn down a rock with an un-Shapley offer painted on its front (because non-agents bearing fixed offers created ex nihilo cannot be deterred or made less likely through any precommitment) there's always some state for Bot to stumble into in its path of reflection and self-modification where Bot comes out on top.^[1]

^{^}
H/t James Faville for this idea.

But because Eliezer would never precommit to probably turn down a rock with an un-Shapley offer painted on its front (because non-agents bearing fixed offers created ex nihilo cannot be deterred or made less likely through any precommitment) there's always some state for Bot to stumble into in its path of reflection and self-modification where Bot comes out on top.

This is exactly why Eliezer (and I) would turn down a rock with an unfair offer. Sure, there's some tiny chance that it was indeed created ex nihilo, but it's far more likely that it was produced by some process that deliberately tried to hide the process that produced the offer.

Moreover, if an as-of-yet ignorant Bot has some premonition that learning more about Eliezer will make Bot encounter truths he'd rather not encounter, Bot can self-modify on the basis of that premonition, before risking reading up on Eliezer on LessWrong.

This depends on the basis of that premonition. At every point, Bot is considering the effect of its commitment on some space of possible agents, which gets narrowed down whenever Bot learns more about Eliezer. If Bot knows everything about Eliezer when it makes the commitment, then of course Eliezer should not give in. If Bot knows some evidence, then actual-Eliezer is essentially in a cooperation-dilemma with the possible agents that Bot thinks Eliezer could be. Then, Eliezer should not give in if he thinks that he can logically cooperate with enough of the possible agents to make Bot's commitment unwise.

This isn't true all of the time; I expect that a North Korean version of Eliezer would give into threats, since he would be unable to logically cooperate with enough of the population to make the government see threat-making as pointless. Still, I do expect that in the situations which Eliezer (and I, for that matter) encounters in the future to not be like this.

Eliezer would never precommit to probably turn down a rock with an un-Shapley offer painted on its front (because non-agents bearing fixed offers created ex nihilo cannot be deterred or made less likely through any precommitment)

But we do not live in a universe where such rocks commonly appear ex nihilo. So when you come across even a non-agent making an unfair offer, you have to consider the likelihood that it was created by an adversarial agent (as is indeed the case in your Bot example).

Sure, but Bot is only being adversarial against Alice, not Eliezer, since it makes the decision to precommit before learning whether its opponent is Alice or Eliezer. (To be clear, here I just use the shorthands Alice = less sophisticated opponent and Eliezer = more sophisticated opponent.) To put it differently, Bot would have made the same decision even if it was sure that Eliezer would punish precommitment, since according to its model at the time it makes the precommitment, its opponent is more likely to be an Alice than an Eliezer.

So the only motivation for Eliezer to punish the precommitment would be if he is "offended at the unfairness on Alice's behalf", i.e. if his notion of fairness depends on Alice's utility function as though it were Eliezer's.

The solution of precommitting on a meta level by adopting a decision theory that's not susceptible to this kind of race is pretty strong, in a universe where you encounter such things, and where you matter to the bot. Bot's counter-counter is to self-modify to offer small amounts, and train the humans to take the deal rather than undertaking clever precommittments. In a universe that's more rich and has a mix of agents with different capabilities and mechanisms for decision-making, it'll depend on how much knowledge you have of your opponent.

The real lesson in thought experiments of this nature is "whoever can model their opponent better than their opponent can model them has a huge edge". Also, real games are far more complicated, and whoever can model the rest of life in relation to the game at hand better has an edge as well.

Yeah, the crucial point is whoever out-models the opponent wins.

If you can alter your source code so that you will definitely pass up a somewhat desirable contract if your ideal terms aren't met, you'll be offered better contracts than agents that can't alter their source code.

Why do you think this is true as a general rule?

It obviously won't be true in a population where everyone modifies their source code in the same way. Everyone will reject every contract there, and (if their modified source code allows them any sapience apart from the rule about accepting contracts) everyone will know that all contracts will be rejected and not bother to offer any.

It might be effective in a population of mostly CDT theorists who look only at only the direct causal outcomes of their actions in whatever situation they happen to have stumbled into. However, even a CDT user might determine that accepting a bad deal now raises the probability of future bots modifying their source code and offering bad deals in the future that would otherwise have been good. This might outweigh the immediate slight utility gain of accepting a single bad deal rather than no deal. Given that "there are lots of other people who decide in much the same way that I do" is outside CDT's scope, the effect will be weak and only decisive if there are likely to be many future deals with bots that use the outcome of this particular deal to decide whether to self-modify.

It certainly wouldn't be effective in a population of largely FDT users who know that most others use FDT, since predictably accepting bad deals yields a world in which you only get offered bad deals. Altering their own source code is just one of the many ways that they can credibly offer bad deals to you, and is nothing special. Missing out occasionally on a little utility if you happen to encounter an irrational self-modified bot is more than made up by the gains from being in a world where self-modifying is irrational.

It's also no use in a world where most of the population has a "fairness" heuristic and don't care whether it's rational or not. The bot may get a few 90% deals, and in that sense "gets offered better contracts", but misses out on a lot more 50% deals that would have been of greater total benefit.

Greasers with unloosened steering wheels don't share a both-drivers-having-their-steering-wheels commons -- because each greaser can commit ahead of time to not swerving, this commons exists.

This sentence feels unnecessary/out-of-place/confusing. (Stylistic nitpick)

Bot did better in life by throwing away options that he previously had!

Being smart in a commitment race is, unusually, a losing position -- you win in a commitment race to the extent that you can make yourself dumb, fast.

Not exactly -- Bot did better by making Alice believe that he didn't have certain options, which happened to be facilitated by throwing away those options. Throwing away options is useless (tactically, that is; it might be useful psychologically) unless it affects others' actions.