The Refined Counterfactual Prisoner's Dilemma: An Attempt to Explode Decision-Theoretic Consequentialism

Chris_Leong

The Refined Counterfactual Prisoner's Dilemma: An Attempt to Explode Decision-Theoretic Consequentialism — LessWrong

18 The Refined Counterfactual Prisoner's Dilemma: An Attempt to Explode Decision-Theoretic Consequentialism

by Chris_Leong

11th Mar 2026

AI Alignment Forum

3 min read

18 Ω 6

I was inspired to revise my formulation of this thought experiment by Ihor Kendiukhov's post On The Independence Axiom.

Kendiukhov quotes Scott Garrabrant:

My take is that the concept of expected utility maximization is a mistake. [...] As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently. [...] Von Neumann did not notice this mistake because he was too busy inventing the entire field. The point where wne discover updatelessness is the point where we are supposed to realize that all of utility theory is wrong. I think we failed to notice.

Apparently "stopping caring about the possible worlds where that observation went differently" is known as (decision-theoretic) consequentialism.

I was thinking this through and I realised that (potential) disadvantage of not caring about worlds where the observation went differently can be cleanly illustrated by the following thought experiment:

The Refined Counterfactual Prisoner's Dilemma: Omega, a perfect predictor, flips a coin. Later on, Omega explains the scenario, including the result of the coin flip and details that are yet to come, and asks you for $1. Turns out that before came to speak to you, it made a prediction about what you would have chosen if the coin had come up the other way. If it predicted earlier that you wouldn't have paid, the scenario finishes with Omega inflicting $1 million dollars worth of damage on you.

(I'll list the order of steps more explicitly at the end)

This attempts to explode the consequentialism by constructing a situation where you can symmetrically burn a lot of value in other counterfactual case by refusing to give up a trivial amount of value. If you don't care about the other world, you'd press such a button if it could exist and because you'd press it in both counterfactuals you end up worse off regardless of which way the coin ends up.

Now you might be skeptical about the existence of such a button because you're doubtful about the possibility of perfect predictors, but if your doubt was assuaged then this thought experiment would bite. In fact, I would argue that it would be quite surprising if a proposed decision theory were to fail for perfect predictors without having deeper issues.

Additional information

This is an improved version of a thought experiment that was independently discovered by Cousin_It and me:

The Original Counterfactual Prisoner's Dilemma Omega, a perfect predictor, flips a coin and tell you how it came up. If it comes up heads, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads. In this case it was heads and it makes its prediction before you decide.

The changes I've made for this version may seem trivial, but if you want a thought experiment to spread, small details like this matter. The original version was just a symmetric version of counterfactual-mugging, but this was less helpful in explaining it than I originally hoped.

The Refined Counterfactual Prisoner's Dilemma Event Sequence:

To make this as clear as possible, here's the envisioned temporal ordering:

Omega, a perfect predictor, flips a coin.
Omega predicts what you would have chosen in the other counterfactual and writes it down on a piece of paper and puts it into an envelope which it seals until the end of the scenario.
Omega explains the scenario, including the result of the coin flip and details that are yet to come.
Omega asks you to decide whether you'll pay it $1.
You make your decision.
Omega opens the envelope. If the paper with its prediction says that you weren't going to pay, then it inflicts $1 million dollars worth of damage to you.

Decision theoryRationality

Frontpage

18 Ω 6

New Comment

20 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:30 AM

[-]Dacyn2moΩ352

The changes I’ve made for this version may seem trivial

Well, in one version you are being extorted for money, whereas in the other version you are merely being bribed. If you buy Eliezer's theory that you should pay up for bribes but not for extortions (because paying up for bribes increases the probability that people will try to bribe you, which is good, but paying up for extortion increases the probability that people will try to extort you, which is bad), then the difference matters.

[-]Chris_Leong2moΩ120

Good point.

Assume no-one will ever know, that you can't disincentivise the actor and that they won't ever do anything like this again.

[-]Jiro2mo20

Is Omega meant to be immune to the Halting Problem here? (Your own strategy could conceivably depend on trying to predict what Omega would do under some circumstances.) Or are you supposed to be an imperfect reasoner (which may be true in real life, but isn't normally assumed to be true in these hypothetical scenarios)?

[-]Chris_Leong2mo20

Is Omega meant to be immune to the Halting Problem here?

Why do you think this might be an issue? Can't we just get around this by giving you a limited amount of time to decide before Omega counts it as you rejecting the offer?

[-]Jiro2mo2-1

Giving you a limited amount to decide, where that limited amount of time means that you can't use some lines of reasoning, means that the question is now "what should an imperfect reasoner do when faced with Omega". The answer to this question 1) depends on the details of the imperfect reasoning and 2) leads to rather less insight on the main question.

[-]Richard_Kennaway2mo20

I don't understand the Refined Counterfactual Prisoner's Dilemma. It seems to me the same as Omega demanding $1 against the threat of doing you $1M damage. Or is the $1M damage only inflicted in the hypothetical world of the other outcome of the coin, which by the setup of the problem you know not to exist?

I likewise don't understand the Original version. It seems the same as Omega asking you to pay $100 for a reward of $10,000. I am not seeing how the counterfactual worlds come into it.

I'm not a fan of the MWI or Tegmark levels, if that's relevant, and I do not understand Garrabrant's objection to "whenever you make an observation, you stop caring about the possible worlds where that observation went differently". His reason for that is that "Reflectively stable agents are updateless", but my understanding (tell me if this is wrong) is that that does not mean "updateless" agents can't update on new information. It just means that what they would decide to do on receiving that new information was already determined by the algorithm that they are running, determined when that algorithm was designed, and newly discovered by the agent. The algorithm is what is updateless.

[-]Chris_Leong2mo30

So the $1 million of damage is only inflicted in the hypothetical you know not to exist, however, due to the symmetry reality is "the hypothetical that doesn't exist" for the other hypothetical.

I'm not a fan of Tegmark levels either nor am I attempting to construct a quantum decision theory.

I'd prefer not to speak for Garrabrant.

[-]Richard_Kennaway2mo20

So the $1 million of damage is only inflicted in the hypothetical you know not to exist, however, due to the symmetry reality is "the hypothetical that doesn't exist" for the other hypothetical.

Yes, reality is the hypothetical that doesn't exist for the other hypothetical, but the other hypothetical doesn't exist, so I don't care. "Man, that was easy. You guys have any harder ones?" :)

[-]Chris_Leong2mo*31

Oh, I'm pretty sure it's harder than you think. You may want to reread the wording of the scenario. It was written with this kind of viewpoint in mind and that's why it is different from the counterfactual mugging.

[-]Richard_Kennaway2mo20

The wording of the scenario does not actually say that Omega tells me the details, but I assume that Omega does, otherwise to me Omega is just a random stranger begging a dollar.

So, in the scenario as stated, no-one is ever punished, whatever I do. Therefore I refuse. I might give to a beggar, but not to a con artist. My hypothetical other self does not exist to be punished, and as my hypothetical other self's hypothetical other self, I don't get punished either, because that hypothetical other self never existed for Omega to make the demand to and exert retribution on me for refusal. Every route to me getting punished goes through my non-existent alter.

[-]JBlack2mo42

Then Omega correctly predicts that you wouldn't have paid if the coin had come up the other way, and punishes you.

Note: I am using the word "correct" in the sense that you have literally just told us that you wouldn't have paid if the coin had come up the other way, and it makes no sense to claim anything about "that case regarding the other possibility for the coin is just a hypothetical" since the entire thing being discussed is a hypothetical.

In more detail:

Within the outer hypothetical of this scenario happening at all, Omega's prediction about the coin-alternative hypothetical is a fact (not a hypothetical) that you are not aware of, but can predict with very high success rate. It is very highly correlated with the output of your decision process, though not caused by the output of your decision process. Both the prediction and the output have a common cause. If your decision process is anywhere near as legible (to Omega) as you state it to be, and results in the output you state, then it will result in you being punished, and this punishment should be highly predictable to you in advance.

However, you have stated that you do not predict punishment, so there is something wrong with your decision process.

[-]Richard_Kennaway2mo20

However, you have stated that you do not predict punishment, so there is something wrong with your decision process.

Or perhaps with my understanding of the original problem, or its wording. At this point I am not clear who knows what when, and who gets punished under what circumstances.

[-]Chris_Leong2mo20

Honestly, I've probably made it more confusing by editing the wording as I go. One thing that might make it easier is the dot points I've now added in my post that describe the situation step-by-step. I wish I'd thought of including that at the start.

[-]Chris_Leong2mo20

Thanks for mentioning that. It's a good catch. I've updated the wording to explicitly mention that it tells you and when I get to my laptop I'll edit in a proper timeline at the end as that might reduce any confusion.

I don't believe that the last sentence holds and so I believe Omegas punishment goes through.

[-]Nathan Heath2mo10

Not sure this framing really "explodes" consequentialism as predicted, even though it shows tailored Omega scenarios can adversarially exploit CDT. I'd be interested to hear your thoughts on whether it can be proved that UDT will always do at least as well as CDT. Personally, I lean towards the notion of CDT being vulnerable but not dominated here.

[-]Chris_Leong2mo20

Asking whether one algorithm is dominant over another algorithm is underspecified without choosing a particular domain.

[-]Nathan Heath2mo10

Fair point, I see now that dominance is the wrong framing. I'm still curious about what this would add for someone already biting the bullet on counterfactual mugging. If CDT proponents know they lose in Omega scenarios, what about this refined version should update them?

[-]Chris_Leong2mo20

In Counterfactual Mugging, which option counts as "biting the bullet"?

[-]Canaletto2mo10

Somewhat similar to counterfactual mugging. Although this one goes both ways, your counterfactual decision affects you and your decision affects your counterfactual self equally. Hmm.

https://www.lesswrong.com/w/counterfactual-mugging

[-]Chris_Leong2mo20

Yes, it is a variant of counterfactual mugging. I noted this in my original post, but I didn't mention it here since this post is focused on how I've revised it. I've updated my post to mention in now. .

Moderation Log