Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://doi.org/10.1093/pq/pqaa086

Extracting Money from Causal Decision Theorists

13James_Miller

7Daniel Kokotajlo

3Measure

4Caspar Oesterheld

7cousin_it

7Daniel Kokotajlo

4cousin_it

3Daniel Kokotajlo

2Dagon

1Caspar Oesterheld

1Caspar Oesterheld

2cousin_it

3River

3Caspar Oesterheld

3Daniel Kokotajlo

1River

2Daniel Kokotajlo

3DirectedEvolution

5Caspar Oesterheld

7paulfchristiano

5Caspar Oesterheld

3AVoropaev

4Caspar Oesterheld

1AVoropaev

1Caspar Oesterheld

2Pattern

1Caspar Oesterheld

2Dagon

1Caspar Oesterheld

0Dagon

7Caspar Oesterheld

New Comment

I'm confused since as a buyer if I believed the seller could predict with probability .75 I would flip a fair coin to decide which box to take meaning that the seller couldn't predict with probability .75. If I can't randomize to pick a box I'm not sure how to fit in what you are doing to standard game theory (which I teach).

If standard game theory has nothing to say about what to do in situations where you don't have access to an unpredictable randomization mechanism, so much the worse for standard game theory, I say!

I thought the ability to deploy mixed strategies was a pretty standard part of CDT. Is this not the case, or are you considering a non-standard formulation of CDT?

I think some people may have their pet theories which they call CDT and which require randomization. But CDT as it is usually/traditionally described doesn't ever insist on randomizing (unless randomizing has a positive *causal* effect). In this particular case, even if a randomization device were made available, CDT would either uniquely favor one of the boxes or be indifferent between all distributions over . Compare Section IV.1 of the paper.

What you're referring to are probably so-called ratificationist variants of CDT. These would indeed require randomizing 50-50 between the two boxes. But one can easily construct scenarios which trip these theories up. For example, the seller could put no money in any box if she predicts that the buyer will randomize. Then no distribution is ratifiable. See Section IV.4 for a discussion of Ratificationism.

For example, the seller could put no money in any box if she predicts that the buyer will randomize.

This is a bit unsatisfying, because in my view of decision theory you don't get to predict things like "the agent will randomize" or "the agent will take one box but feel a little wistful about it" and so on. This is unfair, in the same way as predicting that "the agent will use UDT" and punishing for it is unfair. No, you just predict the agent's output. Or if the agent can randomize, you can sample (as many times as you like, but finitely many) from the distribution of the agent's output. A bit more on this here, though the post got little attention.

Can your argument be extended to this case?

in my view of decision theory you don't get to predict things like "the agent will randomize"

Why not? You surely agree that sometimes people can in fact predict such things. So your objection must be that it's unfair when they do and that it's not a strike against a decision theory if it causes you to get money-pumped in those situations. Well... why? Seems pretty bad to me, especially since some extremely high-stakes real-world situations our AIs might face will be of this type.

Sure, and sometimes people can predict things like "the agent will use UDT" and use that to punish the agent. But this kind of prediction is "unfair" because it doesn't lead to an interesting decision theory - you can punish any decision theory that way. So to me the boundaries of "fair" and "unfair" are also partly about mathematical taste and promising-ness, not just what will lead to a better tank and such.

Right, that kind of prediction is unfair because it doesn't lead to an interesting decision theory... but I asked why you don't get to predict things like "the agent will randomize." All sorts of interesting decision theory comes out of considering situations where you do get to predict such things. (Besides, such situations are important in real life.)

I might suggest "not interesting" rather than "not fair" as the complaint. One can image an Omega that leaves the box empty if the player is unpredictable, or if the player doesn't rigorously follow CDT, or just always leaves it empty regardless. But there's no intuition pump that it drives, and no analysis of why a formalization would or wouldn't get the right answer.

When I'm in challenge-the-hypothetical mode, I defend CDT by making the agent believe Omega cheats. It's a trick box that changes contents AFTER the agent chooses, BEFORE the contents are revealed. This is much higher probability to any rational agent than mind-reading or extreme predictability.

On the more philosophical points. My position is perhaps similar to Daniel K's. But anyway...

Of course, I agree that problems that punish the agent for using a particular theory (or using float multiplication or feeling a little wistful or stuff like that) are "unfair"/"don't lead to interesting theory". (Perhaps more precisely, I don't think our theory needs to give algorithms that perform optimally in such problems in the way I want my theory to "perform optimally" Newcomb's problem. Maybe we should still expect our theory to say something about them, in the way that causal decision theorists feel like CDT has interesting/important/correct things to say about Newcomb's problem, despite Newcomb's problem being designed to (unfairly, as they allege) reward non-CDT agents.)

But I don't think these are particularly similar to problems with predictions of the agent's distribution over actions. The distribution over actions is behavioral, whereas performing floating point operations or whatever is not. When randomization is allowed, the subject of your choice is which distribution over actions you play. So to me, which distribution over actions you choose in a problem with randomization allowed, is just like the question of which action you take when randomization is not allowed. (Of course, if you randomize to determine which action's expected utility to calculate first, but this doesn't affect what you do in the end, then I'm fine with not allowing this to affect your utility, because it isn't behavioral.)

I also don't think this leads to uninteresting decision theory. But I don't know how to argue for this here, other than by saying that CDT, EDT, UDT, etc. don't really care whether they choose from/rank a set of distributions or a set of three discrete actions. I think ratificationism-type concepts are the only ones that break when allowing discontinuous dependence on the chosen distribution and I don't find these very plausible anyway.

To be honest, I don't understand the arguments against predicting distributions and predicting actions that you give in that post. I'll write a comment on this to that post.

Let's start with the technical question:

>Can your argument be extended to this case?

No, I don't think so. Take the class of problems. The agent can pick any distribution over actions. The final payoff is determined only as a function of the implemented action and some finite number of samples generated by Omega from that distribution. Note that the expectation is continuous in the distribution chosen. It can therefore be shown (using e.g. Kakutani's fixed-point theorem) that there is always at least one ratifiable distribution. See Theorem 3 at https://users.cs.duke.edu/~ocaspar/NDPRL.pdf .

(Note that the above is assuming the agent maximizes expected vNM utility. If, e.g., the agent maximizes some lexical utility function, then the predictor can just take, say, two samples and if they differ use a punishment that is of a higher lexicality than the other rewards in the problem.)

Thanks! That's what I wanted to know. Will reply to the philosophical stuff in the comments to the other post.

How often do you encounter a situation where an unpredictable randomization mechanism is unavailable?

I agree with both of Daniel Kokotajlo's points (both of which we also make in the paper in Sections IV.1 and IV.2): Certainly for humans it's normal to not be able to randomize; and even if it was a primarily hypothetical situation without any obvious practical application, I'd still be interested in knowing how to deal with the absence of the ability to randomize.

Besides, as noted in my other comment insisting on the ability to randomize doesn't get you that far (cf. Sections IV.1 and IV.4 on Ratificationism): even if you always have access to some nuclear decay noise channel, your choice of whether to consult that channel (or of whether to factor the noise into your decision) is still deterministic. So you can set up scenarios where if you are punished for randomizing. In the particular case of the Adversarial Offer, the seller might remove all money from both boxes if she predicts the buyer to randomize.

The reason why our main scenario just assumes that randomization isn't possible is that our target of attack in this paper is primarily CDT, which is fine with not being allowed to randomize.

Every day. But even if it was only something that happened in weird hypotheticals, my point would still stand.

Care to elaborate on the every day thing? Aside from literal coins, your cell phone is perfectly capable of generating pseudorandom numbers, and I'm almost never without mine.

I guess whether your point stands depends on whether we are more concerned with abstract theory or practical decision making.

Here are some circumstances where you don't have access to an unpredictable random number generator:

--You need to make a decision very quickly and so don't have time to flip a coin

--Someone is watching you and will behave differently towards you if they see you make the decision via randomness, so consulting a coin isn't a random choice between options but rather an additional option with its own set of payoffs

--Someone is logically entangled with you and if you randomize they will no longer be.

--You happen to be up against someone who is way smarter than you and can predict your coin / RNG / etc.

Admittedly, while in some sense these things happen literally every day to all of us, they typically don't happen for important decisions.

But there are important decisions having to do with acausal trade that fit into this category, that either we our our AI successors will face one day.

And even if that wasn't true, decision theory is decision THEORY. If one theory outperforms another in some class of cases, that's a point in its favor, even if the class of cases is unusual.

EDIT: See Paul Christiano's example below, it's an excellent example because it takes Caspar's paper and condenses it into a very down-to-earth, probably-has-actually-happened-to-someone-already example.

I’ve picked up my game theory entirely informally. But in real world terms, perhaps we’re imagining a situation where a randomization approach isn’t feasible for some other reason than a random number generator being unavailable.

This connects slightly with the debate over whether or not to administer untested COVID vaccine en masse. To pick randomly “feels scary” compared to picking “for a reason,” but to pick “for a reason” when there isn’t an actual evidence basis yet undermines the authority of regulators, so regulators don’t pick anything until they have a “good reason” to do so. Their political calculus, in short, makes them unable to use a randomization scheme.

So in terms of real world applicability, the constraint on a non-randomizing strategy seems potentially relevant, although the other aspects of this puzzle don’t map onto COVID vaccine selection specifically.

Yeah, basically standard game theory doesn't really have anything to say about the scenarios of the paper, because they don't fit the usual game-theoretical models.

By the way, the paper has some discussion of what happens if you insist on having access to an unpredictable randomization device, see Sections IV.1 and the discussion of Ratificationism in Section IV.4. (The latter may be of particular interest because Ratificationism is somewhat similar to Nash equilibrium. Unfortunately, the section doesn't explain Ratificationism in detail.)

I like the following example:

- Someone offers to play rock-paper-scissors with me.
- If I win I get $6. If I lose, I get $5.
- Unfortunately, I've learned from experience that this person beats me at rock-paper-scissors 40% of the time, and I only beat them 30% of the time, so in expectation I lose $0.20 in expectation by playing.
- My decision is set up as allowing 4 options: rock, paper, scissors, or "don't play."

This seems like a nice relatable example to me---it's not uncommon for someone to offer to bet on a rock paper scissors game, or to offer slightly favorable odds, and it's not uncommon for them to have a slight edge.

Are there features of the boxes case that don't apply in this case, or is it basically equivalent?

>If I win I get $6. If I lose, I get $5.

I assume you meant to write: "If I lose, I *lose* $5."

Yes, these are basically equivalent. (I even mention rock-paper-scissors bots in a footnote.)

I've skimmed over the beginning of your paper, and I think there might be several problems with it.

- I don't see where it is explicitly stated, but I think information "seller's prediction is accurate with probability 0,75" is supposed to be common knowledge. Is it even possible for a non-trivial probabilistic prediction to be a common knowledge? Like, not as in some real-life situation, but as in this condition not being logical contradiction? I am not a specialist on this subject, but it looks like a logical contradiction. And you can prove absolutely anything if your premise contains contradiction.
- A minor nitpick compared to the previous one, but you don't specify what you mean by "prediction is accurate with probability 0.75". What kinds of mistakes does seller make? For example, if buyer is going to buy the , then with probability 0.75 the prediction will be "". What about the 0.25? Will it be 0.125 for "none" and 0.125 for ""? Will it be 0.25 for "none" and 0 for ""? (And does buyer knows about that? What about seller knowing about buyer knowing...)

When you write "$1−P (money in Bi | buyer chooses Bi ) · $3 = $1 − 0.25 · $3 = $0.25.", you assume that P(money in Bi | buyer chooses Bi )=0.75. That is, if buyer chooses the first box, seller can't possibly think that buyer will choose none of the boxes. And the same for the case of buyer choosing the second box. You can easily fix it by writing "$1−P (money in Bi | buyer chooses Bi ) · $3 >= $1 − 0.25 · $3 = $0.25" instead. It is possible that you make some other implicit assumptions about mistakes that seller can make, so you might want to check it.

>I think information "seller's prediction is accurate with probability 0,75" is supposed to be common knowledge.

Yes, correct!

>Is it even possible for a non-trivial probabilistic prediction to be a common knowledge? Like, not as in some real-life situation, but as in this condition not being logical contradiction? I am not a specialist on this subject, but it looks like a logical contradiction. And you can prove absolutely anything if your premise contains contradiction.

Why would it be a logical contradiction? Do you think Newcomb's problem also requires a logical contradiction? Note that in neither of these cases does the predictor tell the agent the *result* of a prediction about the agent.

>What kinds of mistakes does seller make?

For the purpose of the paper it doesn't really matter what beliefs anyone has about how the errors are distributed. But you could imagine that the buyer is some piece of computer code and that the seller has an identical copy of that code. To make a prediction, the seller runs the code. Then she flips a coin twice. If the coin does *not* come up Tails twice, she just uses that prediction and fills the boxes accordingly. If the coin *does* come up Tails twice, she uses a third coin flip to determine whether to (falsely) predict one of the two other options that the agent can choose from. And then you get the 0.75, 0.125, 0.125 distribution you describe. And you could assume that this is common knowledge.

Of course, for the exact CDT expected utilities, it does matter how the errors are distributed. If the errors are primarily "None" predictions, then the boxes should be expected to contain more money and the CDT expected utilities of buying will be higher. But for the exploitation scheme, it's enough to show that the CDT expected utilities of buying are strictly positive.

>When you write "$1−P (money in Bi | buyer chooses Bi ) · $3 = $1 − 0.25 · $3 = $0.25.", you assume that P(money in Bi | buyer chooses Bi )=0.75.

I assume you mean that I assume P(money in Bi | buyer chooses Bi )=0.25? Yes, I assume this, although really I assume that the seller's prediction is accurate with probability 0.75 and that she fills the boxes according to the specified procedure. From this, it then follows that P(money in Bi | buyer chooses Bi )=0.25.

>That is, if buyer chooses the first box, seller can't possibly think that buyer will choose none of the boxes.

I don't assume this / I don't see how this would follow from anything I assume. Remember that if the seller predicts the buyer to choose no box, both boxes will be filled. So even if *all* false predictions would be "None" predictions (when the buyer buys a box), then it would still be P(money in Bi | buyer chooses Bi )=0.25.

I assume you mean that I assume P(money in Bi | buyer chooses Bi )=0.25? Yes, I assume this, although really I assume that the seller's prediction is accurate with probability 0.75 and that she fills the boxes according to the specified procedure. From this, it then follows that P(money in Bi | buyer chooses Bi )=0.25.

Yes, you are right. Sorry.

Why would it be a logical contradiction? Do you think Newcomb's problem also requires a logical contradiction?

Okay, it probably isn't a contradiction, because the situation "Buyer writes his decision and it is common knowledge that an hour later Seller sneaks a peek into this decision (with probability 0.75) or into a random false decision (0.25). After that Seller places money according to the decision he saw." seems similar enough and can probably be formalized into a model of this situation.

You might wonder why am I spouting a bunch of wrong things in an unsuccessful attempt to attack your paper. I do that because it looks really suspicious to me for the following reasons:

- You don't use language developed by logicians to avoid mistakes and paradoxes in similar situations.
- Even for something written in more or less basic English, your paper doesn't seem to be rigorous enough for the kinds of problems it tries to tackle. For example, you don't specify what exactly is considered common knowledge, and that can probably be really important.
- You result looks similar to something you will try to prove as a stepping stone to proving that this whole situation with boxes is impossible. "It follows that in this situation two perfectly rational agents with the same information would make different deterministic decisions. Thus we arrived at contradiction and this situation is impossible." In your paper agents are rational in a different ways (I think), but it still looks similar enough for me to become suspicious.

So, while my previous attempts at finding error in your paper failed pathetically, I'm still suspicious, so I'll give it another shot.

When you argue that Buyer should buy one of the boxes, you assume that Buyer knows the probabilities that Seller assigned to Buyer's actions. Are those probabilities also a part of common knowledge? How is that possible? If you try to do the same in Newcomb's problem, you will get something like "Omniscient predictor predicts that player will pick the box A (with probability 1); player knows about that; player is free to pick between A and both boxes", which seem to be a paradox.

Sorry for taking some time to reply!

>You might wonder why am I spouting a bunch of wrong things in an unsuccessful attempt to attack your paper.

Nah, I'm a frequent spouter of wrong things myself, so I'm not too surprised when other people make errors, especially when the stakes are low, etc.

Re 1,2: I guess a lot of this comes down to convention. People have found that one can productively discuss these things without always giving the formal models (in part because people in the field know how to translate everything into formal models). That said, if you want mathematical models of CDT and Newcomb-like decision problems, you can check the Savage or Jeffrey-Bolker formalizations. See, for example, the first few chapters of Arif Ahmed's book, "Evidence, Decision and Causality". Similarly, people in decision theory (and game theory) usually don't specify what is common knowledge, because usually it is assumed (implicitly) that the entire problem description is common knowledge / known to the agent (Buyer). (Since this is decision and not game theory, it's not quite clear what "common knowledge" means. But presumably to achieve 75% accuracy on the prediction, the seller needs to know that the buyer understands the problem...)

3: Yeah, *there exist* agent models under which everything becomes inconsistent, though IMO this just shows these agent models to be unimplementable. For example, take the problem description from my previous reply (where Seller just runs an exact copy of Buyer's source code). Now assume that Buyer knows his source code and is logically omniscient. Then Buyer knows what his source code chooses and therefore knows the option that Seller is 75% likely to predict. So he will take the other option. But of course, this is a contradiction. As you'll know, this is a pretty typical logical paradox of self-reference. But to me it just says that this logical omniscience assumption about the buyer is implausible and that we should consider agents who aren't logically omniscient. Fortunately, CDT doesn't assume knowledge of its own source code and such.

Perhaps one thing to help sell the plausibility of this working: For the purpose of the paper, the assumption that Buyer uses CDT in this scenario is pretty weak, formally simple and doesn't have much to do with logic. It just says that the Buyer assigns some probability distribution over box states (i.e., some distribution over the mutually exclusive and collectively exhaustive s1="money only in box 1", s2= "money only in box 2", s3="money in both boxes"); and that given such distribution, Buyer takes an action that maximizes (causal) expected utility. So you could forget agents for a second and just prove the formal claim that for all probability distributions over three states s1, s2, s3, it is for i=1 or i=2 (or both) the case that

(P(si)+P(s3))*$3 - $1 > 0.

I assume you don't find this strange/risky in terms of contradictions, but mathematically speaking, nothing more is really going on in the basic scenario.

The idea is that everyone agrees (hopefully) that orthodox CDT satisfies the assumption. (I.e., assigns some unconditional distribution, etc.) Of course, many CDTers would claim that CDT satisfies some *additional* assumptions, such as the probabilities being calibrated or "correct" in some other sense. But of course, if "A=>B", then "A and C => B". So adding assumptions cannot help the CDTer avoid the loss of money conclusion if they also accept the more basic assumptions. Of course, *some* added assumptions lead to contradictions. But that just means that they cannot be satisfied in the circumstances of this scenario if the more basic assumption is satisfied and if the premises of the Adversarial Offer help. So they would have to either adopt some non-orthodox CDT that doesn't satisfy the basic assumption or require that their agents cannot be copied/predicted. (Both of which I also discuss in the paper.)

>you assume that Buyer knows the probabilities that Seller assigned to Buyer's actions.

No, if this were the case, then I think you would indeed get contradictions, as you outline. So Buyer does *not* know what Seller's prediction is. (He only knows her prediction is 75% accurate.) If Buyer uses CDT, then of course he assigns some (unconditional) probabilities to what the predictions are, but of course the problem description implies that these predictions aren't particularly good. (Like: if he assigns 90% to the money in box 1, then it immediately follows that *no* money is in box 1.)

How does this allow extracting money from CDTheorists?

Simple analysis:

If 0.75 is correct, and the prediction was of the form

a) Box A will not be chosen - 3 dollars in there

OR

b) Box B will not be chosen - 3 dollars in there

Then the CDTheorist reasons:

(1-0.75) = .25

.25*3 = .75

.75 - 1 = -.25

'Therefore I should not buy a box - I expect to lose (expected) money by doing so.'

Complex analysis:

The seller generates a probability distribution over both boxes. (For simplicity's sake, the buyer is chosen in advance, and given the chance to play.) If it is predicted neither box will be chosen, then BOTH boxes have $3 in them.

In this scenario will a CDTheorist buy? Should they?

>Then the CDTheorist reasons:

>(1-0.75) = .25

>.25*3 = .75

>.75 - 1 = -.25

>'Therefore I should not buy a box - I expect to lose (expected) money by doing so.'

Well, that's not how CDT as it is typically specified reasons about this decision. The expected value 0.25*3=0.75 is the *EDT* expected amount of money in box for both and . That is, it is the expected content of box , *conditional* on taking* *. But when CDT assigns an expected utility to taking box it doesn't condition on taking . Instead, because it cannot causally affect how much money is in box , it uses its *unconditional* estimate of how much is in box . As I outlined in the post, this must be at least .

I suspect that the near-universal rejection of naive-CDT ALSO means that they recognize this offer is outside the bounds of situations that CDT can handle. How does this CDT agent reconcile a belief that the seller's prediction likelihood is different from the buyer's success likelihood?

Note that while people on this forum mostly reject orthodox, two-boxing CDT, many academic philosophers favor CDT. I doubt that they would view this problem as out of CDT's scope, since it's pretty similar to Newcomb's problem. (ETA: We discuss these "out of scope" objections in Section IV.2 of the paper.)

How does this CDT agent reconcile a belief that the seller's prediction likelihood is different from the buyer's success likelihood?

Good question!

many academic philosophers favor CDT.

Excellent - we should ask THEM about it.

Please provide a few links to recent (say, 10 years - not textbooks written long ago) papers or even blog posts that defends CDT and/or advocates 2-boxing in this (or other Newcomb-like) scenarios.

Excellent - we should ask THEM about it.

Yes, that's the plan.

Some papers that express support for CDT:

- https://philarchive.org/archive/ARMCDT-2
- https://link.springer.com/article/10.1007/s11229-011-0022-6 (In general, James Joyce is a well-known defender of CDT.)
- https://link.springer.com/article/10.1007/s11098-018-1206-4
- https://onlinelibrary.wiley.com/doi/full/10.1111/phpr.12466 (argues for two-boxing, but against CDT)
- In his book
*Causality*, Judea Pearl also argues in favor of CDT (though he doesn't explicitly discuss Newcomb's problem).

In case you just want to know why I believe support for CDT/two-boxing to be wide-spread among academic philosophers, see https://philpapers.org/archive/BOUWDP.pdf , which is a survey of academic philosophers, where more people preferred two-boxing than one-boxing in Newcomb's problem, especially among philosophers with relevant expertise. Some philosophers have filled out this survey publicly, so you can e.g. go to https://philpapers.org/surveys/public_respondents.html?set=Target+faculty , click on a name and then on "My Philosophical Views" to find individuals who endorse two-boxing. (I think there's also a way to download the raw data and thereby get a list of two-boxers.)

My paper with my Ph.D. advisor Vince Conitzer titled "Extracting Money from Causal Decision Theorists" has been formally published (Open Access) in

The Philosophical Quarterly. Probably many of you have seen either earlier drafts of this paper or similar arguments that others have independently given on this forum (e.g., Stuart Armstrong posted about an almost identical scenario; Abram Demski's post on Dutch-Booking CDT also has some similar ideas) and elsewhere (e.g., Spencer (forthcoming) and Ahmed (unpublished) both make arguments that resemble some points from our paper).Our paper focuses on the following simple scenario which can be used to, you guessed it, extract money from causal decision theorists:

At least one of the two boxes contains money. Therefore, the average box contains at least $1.5 in (unconditional) expectation. In particular, at least one of the two boxes must contain at least $1.5 in expectation. Since CDT doesn't condition on choosing box Bi when assigning an expected utility to choosing box Bi, the CDT expected utility of at least one of the two boxes is at least $1.5. Thus, CDT agents buy one of the boxes, to the seller's delight.

Most people on this forum are probably already convinced that (orthodox, two-boxing) CDT should be rejected. But I think the Adversarial Offer is one of the more convincing "counterexamples" to CDT. So perhaps the scenario is worth posing to your pro-CDT friends, and the paper worth sending to your pro-academic peer review, pro-CDT friends. (Relating their responses to me would be greatly appreciated – I am quite curious what different causal decision theorists think about this scenario.)