Naive TDT, Bayes nets, and counterfactual mugging

5loup-vaillant

6ThrustVectoring

2loup-vaillant

4Vaniver

5Eliezer Yudkowsky

2Stuart_Armstrong

12Eliezer Yudkowsky

2Vaniver

2Stuart_Armstrong

1Armok_GoB

4Vaniver

3shminux

4Stuart_Armstrong

2loup-vaillant

0shminux

3khafra

0shminux

2Stuart_Armstrong

0shminux

0Stuart_Armstrong

0shminux

0Stuart_Armstrong

0shminux

0khafra

0Stuart_Armstrong

0shminux

1Stuart_Armstrong

0shminux

0loup-vaillant

0[anonymous]

0Vaniver

0shminux

1JenniferRM

4Stuart_Armstrong

2JenniferRM

0David_Gerard

0aaronde

0DanielLC

0Stuart_Armstrong

New Comment

39 comments, sorted by Click to highlight new comments since: Today at 8:59 PM

From quite some time now, I idealized TDT as something that would decide whatever CDT would wish to be able to force itself to decide, if it were warned about that problem in advance. "If I were warned in advance, what would I promise myself to do?"

In the Newcomb problem, CDT would wish to be able to force itself to 1-box, knowing that it would force Omega to give the million.

In the "gimme $100 when we arrive in town or I leave you in this desert to die" problem, CDT would wish to be able to force itself to give the $100 bucks, lest it would have died of thirst.

In the counterfactual mugging, CDT would wish to be able to force itself to give the $1000, as such an advantageous bet would maximize utility.

Now I don't know…

- if this can be easily formalized,
- if this is reflectively consistent
- if this is actually rational.

I know nothing about the math behind that, but my instincts tell me "yes" to all of the above (which is why I do give Omega the hundred bucks). Surely someone thought of my idealization, then hit some snag?

It's want to get counterfactually mugged by Omega if and only if you have very high confidence in your model of how Omega works.

It doesn't feel rational because our brains have very good intuitions about how people work. If a person tried a counterfactual mugging on you, you'd want to refuse. Odds are that sort of proposition out of the blue is merely an attempt to part you from your money, and there'd be no million dollars waiting for you had the coin turned up heads.

Obvious problem with this refinement: suppose you win 3^^^3 units of utility on a heads. You can't be certain enough that it's a scam to reject scam attempts. Actually, never mind, you can scale up your credulity at higher and higher reward amounts - after all, there are many more scammers willing to say 3^^^3 than there are entities that can give you 3^^^3 units of utility.

Yes.

Yes.

I didn't intend to solve Pascal's Mugging. Extremely high pay-off with slightly less extremely odds still doesn't feel right. But I have the beginning of a solution: the way the story is told (gimme $5 or I torture 3^^^^^3 people for 9^^^^^^^^9 eons), it doesn't ask you to bet *utility* on unfathomable odds, but *resources*. As we have limited resources, the drop in utility is probably sharper than we might think as we give resources up. For instance, would you bet $1000 for $1M at 1/2 odds if it were your last? If having no money causes you to starve, suddenly we're not talking 500 fold return on investment (on average), but betting your very life at 1/2 odds for $1M. And your life is worth 5.2 more than that¹ :-). It doesn't solve everything (replacing "gimme $5" by "kill your light cone" still is a problem), but I'd say its a start: if there *is* a Lord Outside the Matrix, it means you have a chance to get out of the box and take over, with possibly an even greater pay-off than the original threat. But for that, you probably need your resources.

[1] can't find the study, and I hesitate between 5.2 and 5.8. On *average*, of course.

In TDT, as I understand it, we sever you decision node from the history of the universe, and then pick the action that maximises our utility:

This looks wrong; in the picture, you sever the *"your decision algorithm"* node. Indeed, I think that's the 'difference' between naive TDT and naive CDT- naive CDT supposedly severs "your decision" whereas TDT makes the improvement of severing a step sooner, so it recognizes that it can cause Omega (like a CDTer with a correct map of the problem).

In TDT we don't do any severance! Nothing is uncaused, not our decision, nor our decision algorithm either. Trying to do causal severance is a basic root of paradoxes because *things are not uncaused in real life*. What we do rather is *condition on* the start state of our program, thereby *screening off* the universe (*not* unlawfully severing it), and *factor out* our uncertainty about the logical output of the program given its input. Since *in real life* most things we do to the universe should not change this logical fact, nor will observing this logical fact tell us which non-impossible possible world we are living it, it *shouldn't* give us any news about the nodes above, once we've screened off the algorithm. It does, however, give us logical news about Omega's output, and of course about which boxes we'll end up with.

My reading of this is that you use influence diagrams, not Bayes nets; you think of your decision as influenced by things preceding it, but not as an uncertainty node. Is that a fair reading, or am I missing something?

So, for this issue I would note that for the coinflip to influence the decision algorithm, there needs to be an arrow from the coinflip to the decision algorithm. Consider two situations:

Omega explains the counterfactual mugging deal, learns whether you would pay if the coin comes up tails, and then tells you how the coin came up.

Omega tells you how the coin comes up, explains the counterfactual mugging deal, and then learns whether you would pay if the coin comes up tails.

Those have *different Bayes nets* and so it can be entirely consistent for TDT to output different strategies in each.

I think pre-commitment is irrelevant if you assume Omega can actually read your decision algorithm and predict it deterministically. You want to be the type of person that would give Omega $1k if the fair coin toss didn't go your way; if you're not that type of person once the coin has been tossed, it's too late; and not being that type of person is only a good thing if it wasn't actually a fair coin.

somewhat realistic situation

Oh, you are being difficult, aren't you! :-P

Well, there is this interesting claim:

http://lesswrong.com/lw/1zw/newcombs_problem_happened_to_me/

You're guaranteed to lose if you precommit to giving Omega $1k, if this scenario ever comes up, yes.

However, a precommitment to giving Omega $1k if you lose, *and* gaining $1M if you win, has positive expected value, *and* it necessarily entails the precommitment to giving Omega $1k if you lose the coin toss. If you *just* precommit to giving Omega $1k (and also to refusing the $1M if he ever offers it), then yeah; that's pretty dumb.

I liked the article. It was accessible and it showed how various TDT-ish theories still run into problems at the level of "converting the world problem into math". However it felt as though the causal networks were constructed in a way that assumed away the standard disagreements about Newcombe's problem itself...

Specifically, it seems as though the first, second, and fourth, and maybe the last diagrams should have had a "Magic" node that points to "Your Decision" and which is neither determined by the history of the universe nor accessible to Omega. Philosophers who one box on Newcome generally assert that what Omega is claimed to be doing is *impossible* in the general case.

Another interesting way of disrupting the standard causal interpretation is to posit that Omega's amazing prediction powers are a function of something like a Time Turner, so that your actual decision controls the money which produces a causal loop due to a backwards arrow from a future Omega who *observes* the monetary outcome and communicates to a past Omega who *controls* the outcome according to a rule-enforcing intent.

I understand that objections to Omega's reality don't necessarily address Newcombe in a way that speaks to the setup's value for AGI ethics, but it still seems worth footnoting that there is an open question of physical (or metaphysical?) fact that is being assumed for the sake of AGI ethics. If the assumption is false in reality, and an actual AGI was built that naively assumed it was true (as a sort of "philosophical bug"?), then that might have complicated and perhaps unpleasant real world outcomes.

Isn't temporal inconsistency just selfishness? That is, before you know whether the coin came up heads or tails, you care about both possible futures. But after you find out that you're in the tails' universe you stop caring about the heads' universe, because you're selfish. UDT acts differently, because it is selfless, in that it keeps the same importance weights over all conceivable worlds.

It makes perfect sense to me that a rational agent would want to restrict the choices of its future self. I want to make sure that future-me doesn't run off and do his own thing, screwing current-me over.

I set out to understand precisely why naive TDT (possibly) fails the counterfactual mugging problem. While doing this I ended up drawing a lot of Bayes nets, and seemed to gain some insight; I'll pass these on, in the hopes that they'll be useful. All errors are, of course, my own.

## The grand old man of decision theory: the Newcomb problem

First let's look at the problem that inspired all this research: the Newcomb problem. In this problem, a supremely-insightful-and-entirely-honest superbeing called Omega presents two boxes to you, and tells you that you can either choose box A only ("1-box"), or take box A and box B ("2-box"). Box B will always contain $1K (one thousand dollars). Omega has predicted what your decision will be, though, and if you decided to 1-box, he's put $1M (one million dollars) in box A; otherwise he's put nothing in it. The problem can be cast as a Bayes net with the following nodes:

Your decision algorithm (or your your decision process) is the node that determines what you're going to decide. This leads to "Your decision" (1-box or 2-box) and Ω (puts $1M or zero in box A). These lead to the "Money" node, where you can end up with $1M+1K, $1M, $1K or $0 depending on the outputs of the other nodes. Note that the way the network is set up, you can never have $1M+1K or $0 (since "Ω" and "Your decision" are not independent). But it is the implied "possibility" of getting those two amounts that causes causal decision theory to 2-box in the Newcomb problem.

In TDT, as I understand it, you sever your decision algorithm node from the history of the universe (note this is incorrect, as explained here. In fact you condition on the start of your program, and

screen outthe history of the universe), and then pick the action that maximises our utility.But note that the graph is needlessly complicated: "Your decision" and "Ω" are both superfluous nodes, that simply pass on their inputs to their outputs. Ignoring the "History of the Universe", we can reduce the net to a more compact (but less illuminating) form:

Here 1-box leads to $1M and 2-box leads to $1K. In this simplified version, the decision is obvious - maybe too obvious. The decision was entirely determined by the choice of how to lay out the Bayes net, and a causal decision theorist would disagree that the original "screened out" Bayes net was a valid encoding of the Newcomb problem.

## The counterfactual mugging

In the counterfactual mugging, Omega is back, this time explaining that he tossed a coin. If the coin came up tails, he would have asked you to give him $1K, giving nothing in return. If the coin came up heads, he would have given you $1M - but only if when you would have given him the $1K in the tails world. That last fact he would have known by predicting your decision. Now Omega approaches you, telling you the coin was tails - what should you do? Here is a Bayes net with this information:

I've removed the "History of the Universe" node, as we are screening it off anyway. Here "Simulated decision" and "Your decision" will output the same decision on the same input. Ω will behave the way he said, based on your simulated decision given tails. "Coin" will output heads or tails with 50% probability, and "Tails" simply outputs tails, for use in Ω's prediction.

Again, this graph is very elaborate, codifying all the problem's intricacies. But most of the nodes are superfluous for our decision, and the graph can be reduced to:

"Coin" outputs "heads" or "tails" and "Your decision algorithm" outputs "Give $1K on tails" or "Don't give $1K on tails". Money is $1M if it receives "heads" and "Give $1K on tails", -$1K if it receives "tails" and "Give $1K on tails", and zero if receives "Don't give $1K on tails" (independent of the coin results).

If our utility does not go down too sharply in money, we should choose "Give $1K on tails", as a 50-50 bet on willing $1M and losing $1K is better than getting nothing with certainty. So precommitting to giving Omega $1K when he asks, leads to the better outcome.

But now imagine that we are in the situation above: Omega has come to us and explained that yes, the coin has come up tails. The Bayes net now becomes:

In this case, the course is clear: "Give $1K on tails" does nothing but lose us $1K. So we should decide not to - and nowhere in this causal graph can we see any problem with that course of action.

So it seems that naive TDT has an inconsistency problem. And these graphs don't seem to fully encode the actual problem properly (ie that the action "Give $1K on tails" corresponds to situations where we truly believe that tails came up).

## Thoughts on the problem

Some thoughts that occurred when formalising this problem: