The Counterfactual Prisoner's Dilemma

by Chris_Leong2 min read21st Dec 20199 comments

20

Ω 10

Counterfactual MuggingPrisoner's DilemmaCounterfactuals
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Updateless decision theory asks us to make decisions by imagining what we would have pre-committed to ahead of time. There's only one problem - we didn't commit to it ahead of time. So we do we care about what would have happened if we had?

This isn't a problem for the standard Newcomb's problems. Even if we haven't formally pre-committed to an action such as by setting up consequences for failure, we are effectively pre-commited to whatever action we end up taking. After all the universe is deterministic, so from the start of time there was only one possible action we could have taken. So we can one-box and know we'll get the million if the predictor is perfect.

However there are other problems where the benefit accrues to a counterfactual self instead of to us directly such as in Counterfactual Mugging. This is discussed in Abram Demski's post on all-upside and mixed-upside updatelessness. It's the later type that is troublesome.

I posted a question about this a few days ago:

If you are being asked for $100, you know that the coin came up heads and you won't receive the $10000. Sure this means that if the coin would have been heads then you wouldn't have gained the $10000, but you know the coin wasn't heads so you don't lose anything. It's important to emphasise: this doesn't deny that if the coin had come up heads that this would have made you miss out on $10000. Instead, it claims that this point is irrelevant, so merely repeating the point again isn't a valid counter-argument.

A solution

In that post I cover many of the arguments for paying the counterfactual mugger and argue that they don't solve it. However, after posting, both Cousin_it and I independently discovered a thought experiment that is very persuasive (in favour of paying). The setup is as follows:

Omega, a perfect predictor, flips a coin and tell you how it came up. If if comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads. In this case it was heads.

An updateless agent will get $9900 regardless of which way the coin comes up, while an updateful agent will get nothing. Note that even though you are playing against yourself, it is a counterfactual version of you that sees a different observation, so its action isn't logically tied to yours. Like a normal prisoner's dilemma, it would be possible for heads-you to co-operate and tails-you to defect. So unlike playing prisoner's dilemma against a clone where you have a selfish reason to co-operate, if counterfactual-you decides to be selfish, there is no way to persuade it to co-operate, that is, unless you consider policies as a whole and not individual actions. The lesson I take from this is that policies are what we should be evaluating, not individual actions.

Are there any alternatives?

I find it hard to imagine an intermediate position that saves the idea of individual actions being the locus of evaluation. For example, I'd be dubious about claims that the locus of evaluation should still be individual decisions, except when we have situations like the prisoner's dilemma. I won't pretend to have a solid argument, but that would just seem to be an unprincipled fudge; like let's just call the gaping hole an exception so we don't have to deal with it; like let's just glue two different kinds of objects together which really aren't alike at all.

What does this mean?

This greatly undermines the updateful view that you only care about your current counterfactual. Further, the shift to evaluating policies suggests an updateless perspective. For example, it doesn't seem to make sense to decide what you should have done if the coin had come up heads after you see it come up tails. If you've made your decision based on the coin, it's too late for your decision to affect the prediction. And once you've committed to the updateless perspective, the symmetry of the coin flip makes paying the mugger the natural choice, assuming you have a reasonable risk preference.

(Unfortunately the rest of this post seems to have been accidentally deleted as far as I can tell the history isn't saved. To be honest, I believe that the most important parts of this post are still present. If you want more information, you can also see this presentation)

20

Ω 10

9 comments, sorted by Highlighting new comments since Today at 10:34 PM
New Comment

So [why] do we care about what would have happened if we had?

This post demonstrates that ignoring counterfactuals can cause you to do worse even if you only care about your particular branch. This doesn't take you all the way to expected utility over branches, but I can't see any obvious intermediate positions.

I was pointing out a typo in the Original Post. That said, that's a great summary.

 

Perhaps an intermediate position could be created as follows:

Given a graph of 'the tree' (including the branch you're on), position E is

expected utility over branches

position B is

you only care about your particular branch.

Position B seems to care about the future tree (because it is ahead), but not the past tree. So it has a weight of 1 on the current node and it's descendants, but a weight of 0 on past/averted nodes, while Position E has a weight of 1 on the "root node" (whatever that is). (Node weights are inherited, with the exception of the discontinuity in Position B.)

An intermediate position is placing some non-zero weight on 'past nodes', going back along the branch, and updating the inherited weights. Aside from a weight of 1/2 being placed along all in branch nodes, another series could be used, for example: r, r^2, r^3, ... for 0<r<1. (This series might allow for adopting an 'intermediate position' even when the branch history is infinitely long.)

There's probably some technical details to work out, like making all the weights add up to 1, but for a convergent series that's probably just a matter of applying an appropriate scale factor for normalization. For r=1/2, the infinite sum is 1, so no additional scaling is required. However this might not work (the sum across all node's rewards times their weight might diverge) on an infinite tree where the rewards grow too fast...

 

(This was an attempt at outlining an intermediate position, but it wasn't an argument for it.)

This depends on how omega constructs his counterfactuals. Suppose the laws of physics make the coin land heads as part of a deterministic universe. The counterfactual where the coin lands tails must have some difference in starting conditions or physical laws, or non physical behavior. Lets suppose blatently nonphysical behavior like a load of extra angular momentum appearing out of nowhere. You are watching the coin closely. If you see the coin behave nonphysically, then you know that you are in a counterfactual. If you know that omegas counterfactuals are always so crudely constructed, then you would only pay in the counterfactual and get the full $10000.

If you can't tell whether or not you are in the counterfactual, then pay.

We can assume that the coin is flipped out of your sight.

The policy is better than opportunity in the legal filed. If one implements a policy "never steal", he wins against criminal law. If one steal only when there is no chance to be caught, that is, he acts based on opportunity, he will be eventually caught.

Only if the criminal messes up their expected utility calculation

Omega, a perfect predictor, flips a coin. If if comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads

Having a bit of trouble understanding the setup, maybe it can be framed in a way that avoids confusofactuals.

How about "Omega knows whether you would pay in the counterfactual mugging setup if told that you had lost and will reward you for paying if you lose, but you don't know that you would get rewarded once you pay up". Is there anything I have missed?

If my understanding is correct, then those who would pay gain either $10,000 or $9,900, and those who would not pay gain either $10,000 or nothing, depending on the coin flip. So, in this setup a payer's expected gain ($9,950) is higher than a non-payer's ($5,000).

Note that your formulation has a bunch of superfluous stipulations. Omega is a perfect predictor, so you may as well just get informed of the results and given $10,000, $9,900 or nothing. The only difference is emotional, not logical. For example:

You are the kind of person who would pay $100 in the counterfactual mugging loss, and you did, sadly, lose, so here is your $9,900 reward for being such a good boy. Have a good day!


"How about "Omega knows whether you would pay in the counterfactual mugging setup if told that you had lost and will reward you for paying if you lose, but you don't know that you would get rewarded once you pay up". Is there anything I have missed?" - you aren't told that you "lost" as there is no losing coin flip in this scenario since it is symmetric. You are told which way the coin came up. Anyway, I updated the post to clarify this