The Counterfactual Prisoner's Dilemma

by Chris_Leong 2 min read21st Dec 20196 comments

20

Ω 8


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Updateless decision theory asks us to make decisions by imagining what we would have pre-committed to ahead of time. There's only one problem - we didn't commit to it ahead of time. So we do we care about what would have happened if we had?

This isn't a problem for the standard Newcomb's problems. Even if we haven't formally pre-committed to an action such as by setting up consequences for failure, we are effectively pre-commited to whatever action we end up taking. After all the universe is deterministic, so from the start of time there was only one possible action we could have taken. So we can one-box and know we'll get the million if the predictor is perfect.

However there are other problems where the benefit accrues to a counterfactual self instead of to us directly such as in Counterfactual Mugging. This is discussed in Abram Demski's post on all-upside and mixed-upside updatelessness. It's the later type that is troublesome.

I posted a question about this a few days ago:

If you are being asked for $100, you know that the coin came up heads and you won't receive the $10000. Sure this means that if the coin would have been heads then you wouldn't have gained the $10000, but you know the coin wasn't heads so you don't lose anything. It's important to emphasise: this doesn't deny that if the coin had come up heads that this would have made you miss out on $10000. Instead, it claims that this point is irrelevant, so merely repeating the point again isn't a valid counter-argument.

A solution

In that post I cover many of the arguments for paying the counterfactual mugger and argue that they don't solve it. However, after posting, both Cousin_it and I independently discovered a thought experiment that is very persuasive (in favour of paying). The setup is as follows:

Omega, a perfect predictor, flips a coin. If if comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads and you were told it was heads

An updateless agent will get $9900 regardless of which way the coin comes up, while an updateful agent will get nothing. Note that even though you are playing against yourself, it is a counterfactual version of you that sees a different observation, so its action isn't logically tied to yours. Like a normal prisoner's dilemma, it would be possible for heads-you to co-operate and tails-you to defect. So unlike playing prisoner's dilemma against a clone where you have a selfish reason to co-operate, if counterfactual-you decides to be selfish, there is no way to persuade it to co-operate, that is, unless you consider policies as a whole and not individual actions. The lesson I take from this is that policies are what we should be evaluating, not individual actions.

Are there any alternatives?

I find it hard to imagine an intermediate position that saves the idea of individual actions being the locus of evaluation. For example, I'd be dubious about claims that the locus of evaluation should still be individual decisions, except when we have situations like the prisoner's dilemma. I won't pretend to have a solid argument, but that would just seem to be an unprincipled fudge; like let's just call the gaping hole an exception so we don't have to deal with it; like let's just glue two different kinds of objects together which really aren't alike at all.

What does this mean?

This greatly undermines the updateful view that you only care about your current counterfactual. Further, the shift to evaluating policies suggests an updateless perspective. For example, it doesn't seem to make sense to decide what you should have done if the coin had come up heads after you see it come up tails. If you've made your decision based on the coin, it's too late for your decision to affect the prediction. And once you've committed to the updateless perspective, the symmetry of the coin flip makes paying the mugger the natural choice, assuming you have a reasonable risk preference.

20

Ω 8