# The Counterfactual Prisoner's Dilemma

4 min read21st Dec 201915 comments

# Ω 10

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Updateless decision theory asks us to make decisions by imagining what we would have pre-committed to ahead of time. There's only one problem - we didn't commit to it ahead of time. So we do we care about what would have happened if we had?

This isn't a problem for the standard Newcomb's problems. Even if we haven't formally pre-committed to an action such as by setting up consequences for failure, we are effectively pre-commited to whatever action we end up taking. After all the universe is deterministic, so from the start of time there was only one possible action we could have taken. So we can one-box and know we'll get the million if the predictor is perfect.

However there are other problems where the benefit accrues to a counterfactual self instead of to us directly such as in Counterfactual Mugging. This is discussed in Abram Demski's post on all-upside and mixed-upside updatelessness. It's the later type that is troublesome.

I posted a question about this a few days ago:

If you are being asked for $100, you know that the coin came up heads and you won't receive the$10000. Sure this means that if the coin would have been heads then you wouldn't have gained the $10000, but you know the coin wasn't heads so you don't lose anything. It's important to emphasise: this doesn't deny that if the coin had come up heads that this would have made you miss out on$10000. Instead, it claims that this point is irrelevant, so merely repeating the point again isn't a valid counter-argument.

A solution

In that post I cover many of the arguments for paying the counterfactual mugger and argue that they don't solve it. However, after posting, both Cousin_it and I independently discovered a thought experiment that is very persuasive (in favour of paying). The setup is as follows:

Omega, a perfect predictor, flips a coin and tell you how it came up. If if comes up heads, Omega asks you for $100, then pays you$10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you$10,000 if it predicts you would have paid if it had come up heads. In this case it was heads.

An updateless agent will get $9900 regardless of which way the coin comes up, while an updateful agent will get nothing. Note that even though you are playing against yourself, it is a counterfactual version of you that sees a different observation, so its action isn't logically tied to yours. Like a normal prisoner's dilemma, it would be possible for heads-you to co-operate and tails-you to defect. So unlike playing prisoner's dilemma against a clone where you have a selfish reason to co-operate, if counterfactual-you decides to be selfish, there is no way to persuade it to co-operate, that is, unless you consider policies as a whole and not individual actions. The lesson I take from this is that policies are what we should be evaluating, not individual actions. Are there any alternatives? I find it hard to imagine an intermediate position that saves the idea of individual actions being the locus of evaluation. For example, I'd be dubious about claims that the locus of evaluation should still be individual decisions, except when we have situations like the prisoner's dilemma. I won't pretend to have a solid argument, but that would just seem to be an unprincipled fudge; like let's just call the gaping hole an exception so we don't have to deal with it; like let's just glue two different kinds of objects together which really aren't alike at all. What does this mean? This greatly undermines the updateful view that you only care about your current counterfactual. Further, the shift to evaluating policies suggests an updateless perspective. For example, it doesn't seem to make sense to decide what you should have done if the coin had come up heads after you see it come up tails. If you've made your decision based on the coin, it's too late for your decision to affect the prediction. And once you've committed to the updateless perspective, the symmetry of the coin flip makes paying the mugger the natural choice, assuming you have a reasonable risk preference. Notes: 1) Unfortunately the rest of this post seems to have been accidentally deleted as far as I can tell the history isn't saved. To be honest, I believe that the most important parts of this post are still present. If you want more information, you can also see this presentation. 2) Blackmail problems also seem to demonstrate the limitations of making decisions by picking the best option compatible with all of your knowledge about the world as you want to be the kind of agent that wouldn't end up in such a position in the first place. # 20 # Ω 10 15 comments, sorted by Highlighting new comments since New Comment I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me$10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).

In the counterfactual mugging, by contrast, the whole point is that paying doesn't cause any positive effects in the real world. So it seems perfectly consistent to pay in the counterfactual prisoner's dilemma, but not in the counterfactual mugging.

You're correct that paying in Counterfactual Prisoner's Dilemma doesn't necessarily commit you to paying in Counterfactual Mugging.

However, it does appear to provide a counter-example to the claim that we ought to adopt the principle of making decisions by only considering the branches of reality that are consistent with our knowledge as this would result in us refusing to pay in Counterfactual Prisoner's Dilemma regardless of the coin flip result.

(Interestingly enough, blackmail problems seem to also demonstrate that this principle is flawed as well).

This seems to suggest that we need to consider policies rather than completely separate decisions for each possible branch of reality. And while, as I already noted, this doesn't get us all the way, it does make the argument for paying much more compelling by defeating the strongest objection.

by only considering the branches of reality that are consistent with our knowledge

I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money. So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into account when reasoning about logical causation", then Counterfactual Prisoner's Dilemma doesn't give us anything new. If by "considering [other branches of reality]" you instead mean "acting to benefit my counterfactual self", then I deny that this is what is happening in CPD. You're acting to benefit your current self, via logical causation, just like in the Twin Prisoner's Dilemma. You don't need to care about your counterfactual self at all. So it's disanalogous to Counterfactual Mugging, where the only reason to pay is to help your counterfactual self. Hmm... that's a fascinating argument. I've been having trouble figuring out how to respond to you, so I'm thinking that I need to make my argument more precise and then perhaps that'll help us understand the situation. Let's start from the objection I've heard against Counterfactual Mugging. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F). Now let's consider Counterfactual Prisoner's Dilemma. If the coin comes up HEADS, then principle F tells us that the counterfactuals need to have the COIN coming up HEADS as well. However, it doesn't tell us how to handle the impact of the agent's policy if they had seen TAILS. I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked. You justify your construction by noting that the agent can figure out that it will make the same decision in both the HEADS and TAILS case. In contrast, my tendency is to exclude information about our decision making procedures. So, if you knew you were a utility maximiser this would typically exclude all but one counterfactual and prevent us saying choice A is better than choice B. Similarly, my tendency here is to suggest that we should be erasing the agent's self-knowledge of how it decides so that we can imagine the possibility of the agent choosing PAY/NOT PAY or NOT PAY/PAY. But I still feel somewhat confused about this situation. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F). The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hitchhiker, my decision not to pay after being picked up logically causes me not to be picked up. The result of that decision would be a counterpossible world: a world in which the same decision algorithm outputs one thing at one point, and a different thing at another point. But in counterfactual mugging, if you choose not to pay, then this doesn't result in a counterpossible world. I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked. The whole point of functional decision theory is that it's very unlikely for these two policies to differ. For example, consider the Twin Prisoner's Dilemma, but where the walls of one room are green, and the walls of the other are blue. This shouldn't make any difference to the outcome: we should still expect both agents to cooperate, or both agents to defect. But the same is true for heads vs tails in Counterfactual Prisoner's Dilemma - they're specific details which distinguish you from your counterfactual self, but don't actually influence any decisions. "The problem is that principle F elides" - Yeah, I was noting that principle F doesn't actually get us there and I'd have to assume a principle of independence as well. I'm still trying to think that through. So [why] do we care about what would have happened if we had? This post demonstrates that ignoring counterfactuals can cause you to do worse even if you only care about your particular branch. This doesn't take you all the way to expected utility over branches, but I can't see any obvious intermediate positions. I was pointing out a typo in the Original Post. That said, that's a great summary. Perhaps an intermediate position could be created as follows: Given a graph of 'the tree' (including the branch you're on), position E is expected utility over branches position B is you only care about your particular branch. Position B seems to care about the future tree (because it is ahead), but not the past tree. So it has a weight of 1 on the current node and it's descendants, but a weight of 0 on past/averted nodes, while Position E has a weight of 1 on the "root node" (whatever that is). (Node weights are inherited, with the exception of the discontinuity in Position B.) An intermediate position is placing some non-zero weight on 'past nodes', going back along the branch, and updating the inherited weights. Aside from a weight of 1/2 being placed along all in branch nodes, another series could be used, for example: r, r^2, r^3, ... for 0<r<1. (This series might allow for adopting an 'intermediate position' even when the branch history is infinitely long.) There's probably some technical details to work out, like making all the weights add up to 1, but for a convergent series that's probably just a matter of applying an appropriate scale factor for normalization. For r=1/2, the infinite sum is 1, so no additional scaling is required. However this might not work (the sum across all node's rewards times their weight might diverge) on an infinite tree where the rewards grow too fast... (This was an attempt at outlining an intermediate position, but it wasn't an argument for it.) This depends on how omega constructs his counterfactuals. Suppose the laws of physics make the coin land heads as part of a deterministic universe. The counterfactual where the coin lands tails must have some difference in starting conditions or physical laws, or non physical behavior. Lets suppose blatently nonphysical behavior like a load of extra angular momentum appearing out of nowhere. You are watching the coin closely. If you see the coin behave nonphysically, then you know that you are in a counterfactual. If you know that omegas counterfactuals are always so crudely constructed, then you would only pay in the counterfactual and get the full$10000.

If you can't tell whether or not you are in the counterfactual, then pay.

We can assume that the coin is flipped out of your sight.

The policy is better than opportunity in the legal filed. If one implements a policy "never steal", he wins against criminal law. If one steal only when there is no chance to be caught, that is, he acts based on opportunity, he will be eventually caught.

Only if the criminal messes up their expected utility calculation

Omega, a perfect predictor, flips a coin. If if comes up heads, Omega asks you for $100, then pays you$10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you$10,000 if it predicts you would have paid if it had come up heads

Having a bit of trouble understanding the setup, maybe it can be framed in a way that avoids confusofactuals.

How about "Omega knows whether you would pay in the counterfactual mugging setup if told that you had lost and will reward you for paying if you lose, but you don't know that you would get rewarded once you pay up". Is there anything I have missed?

If my understanding is correct, then those who would pay gain either $10,000 or$9,900, and those who would not pay gain either $10,000 or nothing, depending on the coin flip. So, in this setup a payer's expected gain ($9,950) is higher than a non-payer's ($5,000). Note that your formulation has a bunch of superfluous stipulations. Omega is a perfect predictor, so you may as well just get informed of the results and given$10,000, $9,900 or nothing. The only difference is emotional, not logical. For example: You are the kind of person who would pay$100 in the counterfactual mugging loss, and you did, sadly, lose, so here is your \$9,900 reward for being such a good boy. Have a good day!

"How about "Omega knows whether you would pay in the counterfactual mugging setup if told that you had lost and will reward you for paying if you lose, but you don't know that you would get rewarded once you pay up". Is there anything I have missed?" - you aren't told that you "lost" as there is no losing coin flip in this scenario since it is symmetric. You are told which way the coin came up. Anyway, I updated the post to clarify this