I set out to understand precisely why naive TDT (possibly) fails the counterfactual mugging problem. While doing this I ended up drawing a lot of Bayes nets, and seemed to gain some insight; I'll pass these on, in the hopes that they'll be useful. All errors are, of course, my own.

The grand old man of decision theory: the Newcomb problem

First let's look at the problem that inspired all this research: the Newcomb problem. In this problem, a supremely-insightful-and-entirely-honest superbeing called Omega presents two boxes to you, and tells you that you can either choose box A only ("1-box"), or take box A and box B ("2-box"). Box B will always contain $1K (one thousand dollars). Omega has predicted what your decision will be, though, and if you decided to 1-box, he's put $1M (one million dollars) in box A; otherwise he's put nothing in it. The problem can be cast as a Bayes net with the following nodes:

Your decision algorithm (or your your decision process) is the node that determines what you're going to decide. This leads to "Your decision" (1-box or 2-box) and Ω (puts $1M or zero in box A). These lead to the "Money" node, where you can end up with $1M+1K, $1M, $1K or $0 depending on the outputs of the other nodes. Note that the way the network is set up, you can never have $1M+1K or $0 (since "Ω" and "Your decision" are not independent). But it is the implied "possibility" of getting those two amounts that causes causal decision theory to 2-box in the Newcomb problem.

In TDT, as I understand it, you sever your decision algorithm node from the history of the universe (note this is incorrect, as explained here. In fact you condition on the start of your program, and screen out the history of the universe), and then pick the action that maximises our utility.

But note that the graph is needlessly complicated: "Your decision" and "Ω" are both superfluous nodes, that simply pass on their inputs to their outputs. Ignoring the "History of the Universe", we can reduce the net to a more compact (but less illuminating) form:

Here 1-box leads to $1M and 2-box leads to $1K. In this simplified version, the decision is obvious - maybe too obvious. The decision was entirely determined by the choice of how to lay out the Bayes net, and a causal decision theorist would disagree that the original "screened out" Bayes net was a valid encoding of the Newcomb problem.

The counterfactual mugging

In the counterfactual mugging, Omega is back, this time explaining that he tossed a coin. If the coin came up tails, he would have asked you to give him $1K, giving nothing in return. If the coin came up heads, he would have given you $1M - but only if when you would have given him the $1K in the tails world. That last fact he would have known by predicting your decision. Now Omega approaches you, telling you the coin was tails - what should you do? Here is a Bayes net with this information:

I've removed the "History of the Universe" node, as we are screening it off anyway. Here "Simulated decision" and "Your decision" will output the same decision on the same input. Ω will behave the way he said, based on your simulated decision given tails. "Coin" will output heads or tails with 50% probability, and "Tails" simply outputs tails, for use in Ω's prediction.

Again, this graph is very elaborate, codifying all the problem's intricacies. But most of the nodes are superfluous for our decision, and the graph can be reduced to:

"Coin" outputs "heads" or "tails" and "Your decision algorithm" outputs "Give $1K on tails" or "Don't give $1K on tails". Money is $1M if it receives "heads" and "Give $1K on tails", -$1K if it receives "tails" and "Give $1K on tails", and zero if receives "Don't give $1K on tails" (independent of the coin results).

If our utility does not go down too sharply in money, we should choose "Give $1K on tails", as a 50-50 bet on willing $1M and losing $1K is better than getting nothing with certainty. So precommitting to giving Omega $1K when he asks, leads to the better outcome.

But now imagine that we are in the situation above: Omega has come to us and explained that yes, the coin has come up tails. The Bayes net now becomes:

In this case, the course is clear: "Give $1K on tails" does nothing but lose us $1K. So we should decide not to - and nowhere in this causal graph can we see any problem with that course of action.

So it seems that naive TDT has an inconsistency problem. And these graphs don't seem to fully encode the actual problem properly (ie that the action "Give $1K on tails" corresponds to situations where we truly believe that tails came up).

Thoughts on the problem

Some thoughts that occurred when formalising this problem:

  1. The problem really is with updating on information, vindicating the instincts behind updateless decision theory. The way you would have to behave, conditional on seeing new information, is different from how you want to behave, after seeing that new information.
  2. Naive TDT reaches different conclusions depending on whether Omega simulates you or predicts you. If you are unsure whether you are being simulated or not (but still care about the wealth of the non-simulated version), then TDT acts differently on updates. Being told "tails" doesn't actually confirm that the coin was tails: you might be the simulated version, being tested by Omega. Note that in this scenario, the simulated you is being lied to by the simulated Omega (the "real" coin need not have been tails), which might put the problem in a different perspective.
  3. The tools of TDT (Bayes nets cut at certain connection) feel inadequate. It's tricky to even express the paradox properly in this language, and even more tricky to know what to do about it. A possible problem seems to be that we don't have a way of expressing our own knowledge about the model, within the model - hence "tails" ends up being a fact about the universe, no a fact about our knowledge at the time. Maybe we need to make our map explicit in the territory, and get Bayes nets that go something like these:


New Comment
39 comments, sorted by Click to highlight new comments since:

From quite some time now, I idealized TDT as something that would decide whatever CDT would wish to be able to force itself to decide, if it were warned about that problem in advance. "If I were warned in advance, what would I promise myself to do?"

In the Newcomb problem, CDT would wish to be able to force itself to 1-box, knowing that it would force Omega to give the million.

In the "gimme $100 when we arrive in town or I leave you in this desert to die" problem, CDT would wish to be able to force itself to give the $100 bucks, lest it would have died of thirst.

In the counterfactual mugging, CDT would wish to be able to force itself to give the $1000, as such an advantageous bet would maximize utility.

Now I don't know…

  • if this can be easily formalized,
  • if this is reflectively consistent
  • if this is actually rational.

I know nothing about the math behind that, but my instincts tell me "yes" to all of the above (which is why I do give Omega the hundred bucks). Surely someone thought of my idealization, then hit some snag?

It's want to get counterfactually mugged by Omega if and only if you have very high confidence in your model of how Omega works.

It doesn't feel rational because our brains have very good intuitions about how people work. If a person tried a counterfactual mugging on you, you'd want to refuse. Odds are that sort of proposition out of the blue is merely an attempt to part you from your money, and there'd be no million dollars waiting for you had the coin turned up heads.

Obvious problem with this refinement: suppose you win 3^^^3 units of utility on a heads. You can't be certain enough that it's a scam to reject scam attempts. Actually, never mind, you can scale up your credulity at higher and higher reward amounts - after all, there are many more scammers willing to say 3^^^3 than there are entities that can give you 3^^^3 units of utility.



I didn't intend to solve Pascal's Mugging. Extremely high pay-off with slightly less extremely odds still doesn't feel right. But I have the beginning of a solution: the way the story is told (gimme $5 or I torture 3^^^^^3 people for 9^^^^^^^^9 eons), it doesn't ask you to bet utility on unfathomable odds, but resources. As we have limited resources, the drop in utility is probably sharper than we might think as we give resources up. For instance, would you bet $1000 for $1M at 1/2 odds if it were your last? If having no money causes you to starve, suddenly we're not talking 500 fold return on investment (on average), but betting your very life at 1/2 odds for $1M. And your life is worth 5.2 more than that¹ :-). It doesn't solve everything (replacing "gimme $5" by "kill your light cone" still is a problem), but I'd say its a start: if there is a Lord Outside the Matrix, it means you have a chance to get out of the box and take over, with possibly an even greater pay-off than the original threat. But for that, you probably need your resources.

[1] can't find the study, and I hesitate between 5.2 and 5.8. On average, of course.

In TDT, as I understand it, we sever you decision node from the history of the universe, and then pick the action that maximises our utility:

This looks wrong; in the picture, you sever the "your decision algorithm" node. Indeed, I think that's the 'difference' between naive TDT and naive CDT- naive CDT supposedly severs "your decision" whereas TDT makes the improvement of severing a step sooner, so it recognizes that it can cause Omega (like a CDTer with a correct map of the problem).

Typo corrected! And yes, that is the CDT-TDT debate - but not really relevant here.

In TDT we don't do any severance! Nothing is uncaused, not our decision, nor our decision algorithm either. Trying to do causal severance is a basic root of paradoxes because things are not uncaused in real life. What we do rather is condition on the start state of our program, thereby screening off the universe (not unlawfully severing it), and factor out our uncertainty about the logical output of the program given its input. Since in real life most things we do to the universe should not change this logical fact, nor will observing this logical fact tell us which non-impossible possible world we are living it, it shouldn't give us any news about the nodes above, once we've screened off the algorithm. It does, however, give us logical news about Omega's output, and of course about which boxes we'll end up with.

My reading of this is that you use influence diagrams, not Bayes nets; you think of your decision as influenced by things preceding it, but not as an uncertainty node. Is that a fair reading, or am I missing something?

I stand corrected, and have corrected it.

My instant reaction upon hearing this is to try to come up with cases where they DO change that logical fact. Holding of on proposing solutions for now.

So, for this issue I would note that for the coinflip to influence the decision algorithm, there needs to be an arrow from the coinflip to the decision algorithm. Consider two situations:

  1. Omega explains the counterfactual mugging deal, learns whether you would pay if the coin comes up tails, and then tells you how the coin came up.

  2. Omega tells you how the coin comes up, explains the counterfactual mugging deal, and then learns whether you would pay if the coin comes up tails.

Those have different Bayes nets and so it can be entirely consistent for TDT to output different strategies in each.

(Re)Poll time: would you give Omega $100? [pollid:184]

Incidentally, if you vote yes on this, don't vote anonymously! You want to proclaim your decision to any partial Omega's out there...

Hey, the real Omega is supposed to not be lying, right? If I were to doubt that, given the extremely low prior of Omega submitting me to this test (compared to a fellow human trying to con me), then the math would probably tell me that the lottery has better odds.

As Stuart_Armstrong noted, this poll was intended to be about precommitment. Now, provided you had not had a chance to precommit (assume that Omega is not known for reading LW, or running it), would you pay in a sudden one-shot experiment? [pollid:185]

I think pre-commitment is irrelevant if you assume Omega can actually read your decision algorithm and predict it deterministically. You want to be the type of person that would give Omega $1k if the fair coin toss didn't go your way; if you're not that type of person once the coin has been tossed, it's too late; and not being that type of person is only a good thing if it wasn't actually a fair coin.

What if I'm the type of person who does not honor counterfactual precommitment, only actual one?

Then expect to lose :-)

Lose what?

Lose money, at least in expectation, if ever these issues come up!

Please provide an example or a link to a somewhat realistic situation where this approach (distinguishing between actual and counterfactual precommitment) loses money.

somewhat realistic situation

Oh, you are being difficult, aren't you! :-P

Well, there is this interesting claim:


It's an interesting account indeed, but hardly relevant to the point in hand, which is "you are guaranteed to lose this time if you precommit counterfactually, but should do it anyway".

You're guaranteed to lose if you precommit to giving Omega $1k, if this scenario ever comes up, yes.

However, a precommitment to giving Omega $1k if you lose, and gaining $1M if you win, has positive expected value, and it necessarily entails the precommitment to giving Omega $1k if you lose the coin toss. If you just precommit to giving Omega $1k (and also to refusing the $1M if he ever offers it), then yeah; that's pretty dumb.

Well, I'm more in favour of "you should now precommit to everything of this type that could happen in the future".

I precommit to precommit on all future setups with expected positive payoff except for unannounced setups where I am told that I have already lost.

What about unannounced setups where you are told that you have already lost - but where you would have had a chance of winning big, had you said yes?

Again, a realistic example of where such a situation occurs would help me understand your point.

I suppose you want to measure if people here think having the chance to pre-commit changes the correct answer? Thinking it does is not reflectively consistent, right? Could there be that reflective consistency isn't that important after all?


If it's actually Omega.

(Re)Poll time: would you give Omega $100?

No, because he asked for $1,000. :P

I went with the original description, sorry :)

I liked the article. It was accessible and it showed how various TDT-ish theories still run into problems at the level of "converting the world problem into math". However it felt as though the causal networks were constructed in a way that assumed away the standard disagreements about Newcombe's problem itself...

Specifically, it seems as though the first, second, and fourth, and maybe the last diagrams should have had a "Magic" node that points to "Your Decision" and which is neither determined by the history of the universe nor accessible to Omega. Philosophers who one box on Newcome generally assert that what Omega is claimed to be doing is impossible in the general case.

Another interesting way of disrupting the standard causal interpretation is to posit that Omega's amazing prediction powers are a function of something like a Time Turner, so that your actual decision controls the money which produces a causal loop due to a backwards arrow from a future Omega who observes the monetary outcome and communicates to a past Omega who controls the outcome according to a rule-enforcing intent.

I understand that objections to Omega's reality don't necessarily address Newcombe in a way that speaks to the setup's value for AGI ethics, but it still seems worth footnoting that there is an open question of physical (or metaphysical?) fact that is being assumed for the sake of AGI ethics. If the assumption is false in reality, and an actual AGI was built that naively assumed it was true (as a sort of "philosophical bug"?), then that might have complicated and perhaps unpleasant real world outcomes.

An AGI that functioned as deterministic software would be in actual Newcomb-like situations every time it was copied...

Yes. In that special case the assumption is definitely valid :-)

Modulo the halting problem.

Isn't temporal inconsistency just selfishness? That is, before you know whether the coin came up heads or tails, you care about both possible futures. But after you find out that you're in the tails' universe you stop caring about the heads' universe, because you're selfish. UDT acts differently, because it is selfless, in that it keeps the same importance weights over all conceivable worlds.

It makes perfect sense to me that a rational agent would want to restrict the choices of its future self. I want to make sure that future-me doesn't run off and do his own thing, screwing current-me over.

Doesn't the result of the coin flip fall under history of the universe?

It does, but it's worth separating out to analyse the problem. We can do that since it doesn't causally influence the decision algorithm.