The Counterfactual Prisoner's Dilemma

Chris_Leong

LESSWRONG
LW

The Counterfactual Prisoner's Dilemma — LessWrong

21 The Counterfactual Prisoner's Dilemma

by Chris_Leong

21st Dec 2019

AI Alignment Forum

3 min read

21 Ω 11

Updateless decision theory asks us to make decisions by imagining what we would have pre-committed to ahead of time. There's only one problem - we didn't commit to it ahead of time. So we do we care about what would have happened if we had?

This isn't a problem for the standard Newcomb's problems. Even if we haven't formally pre-committed to an action such as by setting up consequences for failure, we are effectively pre-commited to whatever action we end up taking. After all the universe is deterministic, so from the start of time there was only one possible action we could have taken. So we can one-box and know we'll get the million if the predictor is perfect.

However there are other problems where the benefit accrues to a counterfactual self instead of to us directly such as in Counterfactual Mugging. This is discussed in Abram Demski's post on all-upside and mixed-upside updatelessness. It's the later type that is troublesome.

I posted a question about this a few days ago:

If you are being asked for $100, you know that the coin came up heads and you won't receive the $10000. Sure this means that if the coin would have been heads then you wouldn't have gained the $10000, but you know the coin wasn't heads so you don't lose anything. It's important to emphasise: this doesn't deny that if the coin had come up heads that this would have made you miss out on $10000. Instead, it claims that this point is irrelevant, so merely repeating the point again isn't a valid counter-argument.

A solution

In that post I cover many of the arguments for paying the counterfactual mugger and argue that they don't solve it. However, after posting, both Cousin_it and I independently discovered a thought experiment that is very persuasive (in favour of paying). The setup is as follows:

Omega, a perfect predictor, flips a coin and tell you how it came up. If if comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads. In this case it was heads.

An updateless agent will get $9900 regardless of which way the coin comes up, while an updateful agent will get nothing. Note that even though you are playing against yourself, it is a counterfactual version of you that sees a different observation, so its action isn't logically tied to yours. Like a normal prisoner's dilemma, it would be possible for heads-you to co-operate and tails-you to defect. So unlike playing prisoner's dilemma against a clone where you have a selfish reason to co-operate, if counterfactual-you decides to be selfish, there is no way to persuade it to co-operate, that is, unless you consider policies as a whole and not individual actions. The lesson I take from this is that policies are what we should be evaluating, not individual actions.

Are there any alternatives?

I find it hard to imagine an intermediate position that saves the idea of individual actions being the locus of evaluation. For example, I'd be dubious about claims that the locus of evaluation should still be individual decisions, except when we have situations like the prisoner's dilemma. I won't pretend to have a solid argument, but that would just seem to be an unprincipled fudge; like let's just call the gaping hole an exception so we don't have to deal with it; like let's just glue two different kinds of objects together which really aren't alike at all.

What does this mean?

This greatly undermines the updateful view that you only care about your current counterfactual. Further, the shift to evaluating policies suggests an updateless perspective. For example, it doesn't seem to make sense to decide what you should have done if the coin had come up heads after you see it come up tails. If you've made your decision based on the coin, it's too late for your decision to affect the prediction. And once you've committed to the updateless perspective, the symmetry of the coin flip makes paying the mugger the natural choice, assuming you have a reasonable risk preference.

Notes:

1) Unfortunately the rest of this post seems to have been accidentally deleted as far as I can tell the history isn't saved. To be honest, I believe that the most important parts of this post are still present. If you want more information, you can also see this presentation.

2) Blackmail problems also seem to demonstrate the limitations of making decisions by picking the best option compatible with all of your knowledge about the world as you want to be the kind of agent that wouldn't end up in such a position in the first place.

Counterfactual MuggingCounterfactualsPrisoner's Dilemma

Frontpage

21 Ω 11

The Counterfactual Prisoner's Dilemma

New Comment

17 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:19 AM

[-]Richard_Ngo5yΩ140

I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me $10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).

In the counterfactual mugging, by contrast, the whole point is that paying doesn't cause any positive effects in the real world. So it seems perfectly consistent to pay in the counterfactual prisoner's dilemma, but not in the counterfactual mugging.

[-]Chris_Leong5yΩ120

You're correct that paying in Counterfactual Prisoner's Dilemma doesn't necessarily commit you to paying in Counterfactual Mugging.

However, it does appear to provide a counter-example to the claim that we ought to adopt the principle of making decisions by only considering the branches of reality that are consistent with our knowledge as this would result in us refusing to pay in Counterfactual Prisoner's Dilemma regardless of the coin flip result.

(Interestingly enough, blackmail problems seem to also demonstrate that this principle is flawed as well).

This seems to suggest that we need to consider policies rather than completely separate decisions for each possible branch of reality. And while, as I already noted, this doesn't get us all the way, it does make the argument for paying much more compelling by defeating the strongest objection.

[-]Richard_Ngo5yΩ120

by only considering the branches of reality that are consistent with our knowledge

I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money.

So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into account when reasoning about logical causation", then Counterfactual Prisoner's Dilemma doesn't give us anything new.

If by "considering [other branches of reality]" you instead mean "acting to benefit my counterfactual self", then I deny that this is what is happening in CPD. You're acting to benefit your current self, via logical causation, just like in the Twin Prisoner's Dilemma. You don't need to care about your counterfactual self at all. So it's disanalogous to Counterfactual Mugging, where the only reason to pay is to help your counterfactual self.

[-]Chris_Leong5yΩ120

Hmm... that's a fascinating argument. I've been having trouble figuring out how to respond to you, so I'm thinking that I need to make my argument more precise and then perhaps that'll help us understand the situation.

Let's start from the objection I've heard against Counterfactual Mugging. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

Now let's consider Counterfactual Prisoner's Dilemma. If the coin comes up HEADS, then principle F tells us that the counterfactuals need to have the COIN coming up HEADS as well. However, it doesn't tell us how to handle the impact of the agent's policy if they had seen TAILS. I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked.

You justify your construction by noting that the agent can figure out that it will make the same decision in both the HEADS and TAILS case. In contrast, my tendency is to exclude information about our decision making procedures. So, if you knew you were a utility maximiser this would typically exclude all but one counterfactual and prevent us saying choice A is better than choice B. Similarly, my tendency here is to suggest that we should be erasing the agent's self-knowledge of how it decides so that we can imagine the possibility of the agent choosing PAY/NOT PAY or NOT PAY/PAY.

But I still feel somewhat confused about this situation.

[-]Richard_Ngo5yΩ120

Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hitchhiker, my decision not to pay after being picked up logically causes me not to be picked up. The result of that decision would be a counterpossible world: a world in which the same decision algorithm outputs one thing at one point, and a different thing at another point. But in counterfactual mugging, if you choose not to pay, then this doesn't result in a counterpossible world.

I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked.

The whole point of functional decision theory is that it's very unlikely for these two policies to differ. For example, consider the Twin Prisoner's Dilemma, but where the walls of one room are green, and the walls of the other are blue. This shouldn't make any difference to the outcome: we should still expect both agents to cooperate, or both agents to defect. But the same is true for heads vs tails in Counterfactual Prisoner's Dilemma - they're specific details which distinguish you from your counterfactual self, but don't actually influence any decisions.

[-]Chris_Leong3yΩ120

So I've thought about this argument a bit more and concluded that you are correct, but also that there's a potential fix to get around this objection.

I think that it's quite plausible that an agent will have an understanding of its decision mechanism that a) let's it know it will take the same action in both counterfactuals b) won't tell it what action it will take in this counterfactual before it makes the decision.

And in that case, I think it makes sense to conclude that the Omega's prediction depends on your action such that paying gives you the $10,000 reward.

However, there's a potential fix in that we can construct a non-symmetrical version of this problem where Omega asks you for $200 instead of $100 in the tails case. Then the fact that you would pay in the heads case and combined with making decisions consistently doesn't automatically imply that you would pay in the tails case. So I suspect that with this fix you actually would have to consider strategies instead of just making a decision purely based on this branch.

[-]Chris_Leong5yΩ120

"The problem is that principle F elides" - Yeah, I was noting that principle F doesn't actually get us there and I'd have to assume a principle of independence as well. I'm still trying to think that through.

[-]Pattern5y20

So [why] do we care about what would have happened if we had?

[-]Chris_Leong5y20

This post demonstrates that ignoring counterfactuals can cause you to do worse even if you only care about your particular branch. This doesn't take you all the way to expected utility over branches, but I can't see any obvious intermediate positions.

[-]Pattern5y20

I was pointing out a typo in the Original Post. That said, that's a great summary.

Perhaps an intermediate position could be created as follows:

Given a graph of 'the tree' (including the branch you're on), position E is

expected utility over branches

position B is

you only care about your particular branch.

Position B seems to care about the future tree (because it is ahead), but not the past tree. So it has a weight of 1 on the current node and it's descendants, but a weight of 0 on past/averted nodes, while Position E has a weight of 1 on the "root node" (whatever that is). (Node weights are inherited, with the exception of the discontinuity in Position B.)

An intermediate position is placing some non-zero weight on 'past nodes', going back along the branch, and updating the inherited weights. Aside from a weight of 1/2 being placed along all in branch nodes, another series could be used, for example: r, r^2, r^3, ... for 0<r<1. (This series might allow for adopting an 'intermediate position' even when the branch history is infinitely long.)

There's probably some technical details to work out, like making all the weights add up to 1, but for a convergent series that's probably just a matter of applying an appropriate scale factor for normalization. For r=1/2, the infinite sum is 1, so no additional scaling is required. However this might not work (the sum across all node's rewards times their weight might diverge) on an infinite tree where the rewards grow too fast...

(This was an attempt at outlining an intermediate position, but it wasn't an argument for it.)

[-]Shmi6y20

Omega, a perfect predictor, flips a coin. If if comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads

Having a bit of trouble understanding the setup, maybe it can be framed in a way that avoids confusofactuals.

How about "Omega knows whether you would pay in the counterfactual mugging setup if told that you had lost and will reward you for paying if you lose, but you don't know that you would get rewarded once you pay up". Is there anything I have missed?

If my understanding is correct, then those who would pay gain either $10,000 or $9,900, and those who would not pay gain either $10,000 or nothing, depending on the coin flip. So, in this setup a payer's expected gain ($9,950) is higher than a non-payer's ($5,000).

Note that your formulation has a bunch of superfluous stipulations. Omega is a perfect predictor, so you may as well just get informed of the results and given $10,000, $9,900 or nothing. The only difference is emotional, not logical. For example:

You are the kind of person who would pay $100 in the counterfactual mugging loss, and you did, sadly, lose, so here is your $9,900 reward for being such a good boy. Have a good day!

[-]Chris_Leong6y40

"How about "Omega knows whether you would pay in the counterfactual mugging setup if told that you had lost and will reward you for paying if you lose, but you don't know that you would get rewarded once you pay up". Is there anything I have missed?" - you aren't told that you "lost" as there is no losing coin flip in this scenario since it is symmetric. You are told which way the coin came up. Anyway, I updated the post to clarify this

[-]TAG3y10

So we do we care about what would have happened if we had?

Should that read "So do we care ..." or " So why do we care ..." ?

we are effectively pre-commited to whatever action we end up taking.

The thing about a pre commitment is that it can be determined on much less information than the total state of the universe. A Laplace's Demon could figure out your decision from general determinism , if determinism is true, but an Omega isn't so powerful.

After all the universe is deterministic

That's not a fact.

[-]Donald Hobson6yΩ110

This depends on how omega constructs his counterfactuals. Suppose the laws of physics make the coin land heads as part of a deterministic universe. The counterfactual where the coin lands tails must have some difference in starting conditions or physical laws, or non physical behavior. Lets suppose blatently nonphysical behavior like a load of extra angular momentum appearing out of nowhere. You are watching the coin closely. If you see the coin behave nonphysically, then you know that you are in a counterfactual. If you know that omegas counterfactuals are always so crudely constructed, then you would only pay in the counterfactual and get the full $10000.

If you can't tell whether or not you are in the counterfactual, then pay.

[-]Chris_Leong6yΩ120

We can assume that the coin is flipped out of your sight.

[-]avturchin6y10

The policy is better than opportunity in the legal filed. If one implements a policy "never steal", he wins against criminal law. If one steal only when there is no chance to be caught, that is, he acts based on opportunity, he will be eventually caught.

[-]Chris_Leong6y30

Only if the criminal messes up their expected utility calculation

Moderation Log