I set out to understand precisely why naive TDT (possibly) fails the counterfactual mugging problem. While doing this I ended up drawing a lot of Bayes nets, and seemed to gain some insight; I'll pass these on, in the hopes that they'll be useful. All errors are, of course, my own.
The grand old man of decision theory: the Newcomb problem
First let's look at the problem that inspired all this research: the Newcomb problem. In this problem, a supremely-insightful-and-entirely-honest superbeing called Omega presents two boxes to you, and tells you that you can either choose box A only ("1-box"), or take box A and box B ("2-box"). Box B will always contain $1K (one thousand dollars). Omega has predicted what your decision will be, though, and if you decided to 1-box, he's put $1M (one million dollars) in box A; otherwise he's put nothing in it. The problem can be cast as a Bayes net with the following nodes:
Your decision algorithm (or your your decision process) is the node that determines what you're going to decide. This leads to "Your decision" (1-box or 2-box) and Ω (puts $1M or zero in box A). These lead to the "Money" node, where you can end up with $1M+1K, $1M, $1K or $0 depending on the outputs of the other nodes. Note that the way the network is set up, you can never have $1M+1K or $0 (since "Ω" and "Your decision" are not independent). But it is the implied "possibility" of getting those two amounts that causes causal decision theory to 2-box in the Newcomb problem.
In TDT, as I understand it, you sever your decision algorithm node from the history of the universe (note this is incorrect, as explained here. In fact you condition on the start of your program, and screen out the history of the universe), and then pick the action that maximises our utility.
But note that the graph is needlessly complicated: "Your decision" and "Ω" are both superfluous nodes, that simply pass on their inputs to their outputs. Ignoring the "History of the Universe", we can reduce the net to a more compact (but less illuminating) form:
Here 1-box leads to $1M and 2-box leads to $1K. In this simplified version, the decision is obvious - maybe too obvious. The decision was entirely determined by the choice of how to lay out the Bayes net, and a causal decision theorist would disagree that the original "screened out" Bayes net was a valid encoding of the Newcomb problem.
The counterfactual mugging
In the counterfactual mugging, Omega is back, this time explaining that he tossed a coin. If the coin came up tails, he would have asked you to give him $1K, giving nothing in return. If the coin came up heads, he would have given you $1M - but only if when you would have given him the $1K in the tails world. That last fact he would have known by predicting your decision. Now Omega approaches you, telling you the coin was tails - what should you do? Here is a Bayes net with this information:
I've removed the "History of the Universe" node, as we are screening it off anyway. Here "Simulated decision" and "Your decision" will output the same decision on the same input. Ω will behave the way he said, based on your simulated decision given tails. "Coin" will output heads or tails with 50% probability, and "Tails" simply outputs tails, for use in Ω's prediction.
Again, this graph is very elaborate, codifying all the problem's intricacies. But most of the nodes are superfluous for our decision, and the graph can be reduced to:
"Coin" outputs "heads" or "tails" and "Your decision algorithm" outputs "Give $1K on tails" or "Don't give $1K on tails". Money is $1M if it receives "heads" and "Give $1K on tails", -$1K if it receives "tails" and "Give $1K on tails", and zero if receives "Don't give $1K on tails" (independent of the coin results).
If our utility does not go down too sharply in money, we should choose "Give $1K on tails", as a 50-50 bet on willing $1M and losing $1K is better than getting nothing with certainty. So precommitting to giving Omega $1K when he asks, leads to the better outcome.
But now imagine that we are in the situation above: Omega has come to us and explained that yes, the coin has come up tails. The Bayes net now becomes:
In this case, the course is clear: "Give $1K on tails" does nothing but lose us $1K. So we should decide not to - and nowhere in this causal graph can we see any problem with that course of action.
So it seems that naive TDT has an inconsistency problem. And these graphs don't seem to fully encode the actual problem properly (ie that the action "Give $1K on tails" corresponds to situations where we truly believe that tails came up).
Thoughts on the problem
Some thoughts that occurred when formalising this problem:
- The problem really is with updating on information, vindicating the instincts behind updateless decision theory. The way you would have to behave, conditional on seeing new information, is different from how you want to behave, after seeing that new information.
- Naive TDT reaches different conclusions depending on whether Omega simulates you or predicts you. If you are unsure whether you are being simulated or not (but still care about the wealth of the non-simulated version), then TDT acts differently on updates. Being told "tails" doesn't actually confirm that the coin was tails: you might be the simulated version, being tested by Omega. Note that in this scenario, the simulated you is being lied to by the simulated Omega (the "real" coin need not have been tails), which might put the problem in a different perspective.
- The tools of TDT (Bayes nets cut at certain connection) feel inadequate. It's tricky to even express the paradox properly in this language, and even more tricky to know what to do about it. A possible problem seems to be that we don't have a way of expressing our own knowledge about the model, within the model - hence "tails" ends up being a fact about the universe, no a fact about our knowledge at the time. Maybe we need to make our map explicit in the territory, and get Bayes nets that go something like these: