The Blackmail Equation


This is Eliezer's model of blackmail in decision theory at the recent workshop at SIAI, filtered through my own understanding. Eliezer help and advice were much appreciated; any errors here-in are my own.

The mysterious stranger blackmailing the Countess of Rectitude over her extra-marital affair with Baron Chastity doesn't have to run a complicated algorithm. He simply has to credibly commit to the course of action:

"If you don't give me money, I will reveal your affair."

And then, generally, the Countess forks over the cash. Which means the blackmailer never does reveal the details of the affair, so that threat remains entirely counterfactual/hypothetical. Even if the blackmailer is Baron Chastity, and the revelation would be devastating for him as well, this makes no difference at all, as long as he can credibly commit to Z. In the world of perfect decision makers, there is no risk to doing so, because the Countess will hand over the money, so the Baron will not take the hit from the revelation.

Indeed, the baron could replace "I will reveal our affair" with Z="I will reveal our affair, then sell my children into slavery, kill my dogs, burn my palace, and donate my organs to medical science while boiling myself in burning tar" or even "I will reveal our affair, then turn on an unfriendly AI", and it would only matter if this changed his pre-commitment to Z. If the Baron can commit to counterfactually doing Z, then he never has to do Z (as the countess will pay him the hush money), so it doesn't matter how horrible the consequences of Z are to himself.

To get some numbers in this model, assume the countess can either pay up or not do so, and the baron can reveal the affair or keep silent. The payoff matrix could look something like this:

(Baron, Countess)
Not pay
 (-90,-110) (-100,-100)
(10,-10) (0,0)

Both the countess and the baron get -100 utility if the affair is revealed, while the countess transfers 10 of her utilitons to the baron if she pays up. Staying silent and not paying have no effect on the utility of either.

Let's see how we could implement the blackmailing if the baron and the countess were running simple decision algorithms. The baron has a variety of tactics he could implement. What is a tactic, for the baron? A tactic is a list of responses he could implement, depending on what the countess does. His four tactics are:

  1. (Pay, NPay)→(Reveal, Silent)        "anti-blackmail" : if she pays, tell all, if she doesn't, keep quiet
  2. (Pay, NPay)→(Reveal, Reveal)      "blabbermouth" : whatever she does, tell all
  3. (Pay, NPay)→(Silent, Silent)         "not-a-word" : whatever she does, keep quiet
  4. (Pay ,NPay)→(Silent, Reveal)        "blackmail" : if she pays, keep quiet, if she doesn't, tell all

The countess, in contract, has only two tactics: pay or don't pay. Each will try and estimate what the other will do, so the baron must model the countess, who must model the baron in turn. This seems as if it leads to infinite regress, but the baron has a short-cut: when reasoning counterfactually as to which tactic to implement, he will substitute that tactic in his model of how the countess models him.

In simple terms, it means that when he is musing 'what were to happen if I were to anti-blackmail, hypothetically', he assume that the countess would model him as an anti-blackmailer. In that case, the countess' decision is easy: her utility maximising decision is not to pay, leaving them with a payoff of (0,0).

Similarly, if he counterfactually considers the blabbermouth tactic, then if the countess models him as such, her utility-maximising tactic is also not to pay up, giving a payoff of (-100,-100). Not-a-word results in a payoff of (0,0), and only if the baron implements the blackmail tactic will the countess pay up, giving a payoff of (10,-10). Since this maximises his utility, he will implement the blackmail tactic. And the countess will pay him, to minimise her utility loss.

Notice that in order for this to work, the baron needs four things:

  1. The baron needs to make his decision after the countess does, so she cannot react to his action.
  2. The baron needs to make his decision after the countess does, so he can react to her action.
  3. The baron needs to be able to precommit to a specific tactic (in this case, blackmail).
  4. The baron needs the countess to find his precommitment plausible.

If we were to model the two players as timeless AI's implementing specific decision theories, what would these conditions become? They can be cast as:

  1. The baron and the countess must exchange their source code.
  2. The baron and the countess must both be rational.
  3. The countess' available tactics are simply to pay or not to pay.
  4. The baron's available tactics are conditional tactics, dependent on what the countess' decision is.
  5. The baron must model the countess as seeing his decision as a fixed fact over which she has no influence.
  6. The countess must indeed see the baron's decision as a fixed fact over which she has no influence.

The baron occupies what Eliezer termed a superior epistemic vantage.

Could two agents be in superior epistemic vantage, as laid out above, one over the other? This is precluded by the set-up above*, as two agents cannot be correct in assuming that the other treats their own decision as a fixed fact, while both running counterfactuals conditioning their response on the varrying tactics of the other.

"I'll tell, if you don't send me the money, or try and stop me from blackmailing you!" versus "I'll never send you the money, if you blackmail me or tell anyone about us!"

Can the countess' brother, the Archduke of Respectability, blackmail the baron on her behalf? If the archduke is in a superior epistemic vantage to the baron, then there is no problem. He could choose a tactic that is dependent on the baron's choice of tactics, without starting an infinite loop, as the baron cannot do the same to him. The most plausible version would go:

"If you blackmail my sister, I will shoot you. If you blabbermouth, I will shoot you. Anti-blackmail and not-a-word are fine by me, though."

Note that Omega, in the Newcomb's problem, is occupying the superior epistemic vantage. His final tactic is the conditional Z="if you two-box, I put nothing in box A; if you one-box, I put in a million pounds," whereas you do not have access to tactics along the lines of "if Omega implements Z, I will two-box; if he doesn't, I will one-box". Instead, like the countess, you have to assume that Omega will indeed implement Z, accept this as fact, and then choose simply to one-box or two-box.

*The argument, as presented here, is a lie, but spelling out the the true version would be tedious and tricky. The countess, for instance, is perfectly free to indulge in counterfactual speculations that the baron may decide something else, as long as she and the baron are both aware that these speculations will never influence her decision. Similarly, the baron is free to model her doing so, as long this similarly leads to no difference. The countess may have a dozen other options, not just the two presented here, as long as they both know she cannot make use of them. There is a whole issue of extracting information from an algorithm and a source code here, where you run into entertaining paradoxes such as if the baron knows the countess will do something, then he will be accurate, and can check whether his knowledge is correct; but if he didn't know this fact, then it would be incorrect. These are beyond the scope of this post.


[EDIT] The impossibility of the countess and the baron being each in epistemic vantage over the other has been clarified, and replaces the original point - about infinite loops - which only implied that result for certain naive algorithms.

[EDIT] Godelian reasons make it impossible to bandy about "he is rational and believes X, hence X is true" with such wild abandon. I've removed the offending lines.

[EDIT] To clarify issues, here is a formal model of how the baron and countess could run their decision theories. Let X be a fact about the world, and let S_B be the baron's source code.


Utility of pay = 10, utility of reveal = -100

Based on S_C, if the countess would accept the baron's behaviour as a fixed fact, run:

Let T={anti-blackmail, blabbermouth, not-a-word, blackmail}

For t_b in T, compute utility of the outcome implied by Countess(t_b,S_B). Choose the t_b that maximises it.


Countess(X, S_B)

If X implies the baron's tactic t_b, then accept t_b as fixed fact.

If not, run Baron(S_C) to compute the baron's tactic t_b. Stop as soon as the tactic is found. Accept as fixed fact.

Utility of pay = -10, utility of reveal = -100.

Let T={pay, not pay}

For t_c in T, under the assumption of t_b, compute utility of outcome. Choose t_c that maximises it.


Both these agents are rational with each other, in that they correctly compute each other's ultimate decisions in this situation. They are not perfectly rational (or rather, their programs are incomplete) in that they do not perform well against general agents, and may fall into infinite loops as written.