# Ω 1

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Update: I believe that the Counterfactual Prisoner's Dilemma which was discovered by Cousin_it and I independently is resolves the answer to this question

The LessWrong Wiki defines Counterfactual Mugging as follows:

Omega appears and says that it has just tossed a fair coin, and given that the coin came up tails, it decided to ask you to give it $100. Whatever you do in this situation, nothing else will happen differently in reality as a result. Naturally you don't want to give up your$100. But Omega also tells you that if the coin came up heads instead of tails, it'd give you $10000, but only if you'd agree to give it$100 if the coin came up tails. Do you give Omega $100? I expect that most people would say that you should pay because a 50% chance of$10000 for $100 is an amazing deal according to expected value. I lean this way too, but it is harder to justify than you might think. After all, if you are being asked for$100, you know that the coin came up heads and you won't receive the $10000. Sure this means that if the coin would have been heads then you wouldn't have gained the$10000, but you know the coin wasn't heads so you don't lose anything. It's important to emphasise: this doesn't deny that if the coin had come up heads that this would have made you miss out on $10000. Instead, it claims that this point is irrelevant, so merely repeating the point again isn't a valid counter-argument. You could argue that you would have pre-commited to paying if you had known about the situation ahead of time. True, but you didn't pre-commit and you didn't know about it ahead of time, so the burden is on you to justify why you should act as though you did. In Newcomb's problem you want to have pre-committed and if you act as though you were pre-committed then you will find that you actually were pre-committed. However, here it is the opposite. Upon discovering that the coin came up tails, you want to act as though you were not pre-commited to pay and if you act that way, you will find that you actually were indeed not pre-commited. We could even channel Yudkowsky from Newcomb's Problem and Regret of Rationality: "Rational agents should WIN... It is precisely the notion that Nature does not care about our algorithm, which frees us up to pursue the winning Way - without attachment to any particular ritual of cognition, apart from our belief that it wins. Every rule is up for grabs, except the rule of winning... Unreasonable? I am a rationalist: what do I care about being unreasonable? I don't have to conform to a particular ritual of cognition. I don't have to take only box B because I believe my choice affects the box, even though Omega has already left. I can just... take only box B." You can just not pay the$100. (Vladimir Nesov makes this argument this exact same argument here).

Here's another common reason, I've heard as described by Cousin_it: "I usually just think about which decision theory we'd want to program into an AI which might get copied, its source code inspected, etc. That lets you get past the basic stuff, like Newcomb's Problem, and move on to more interesting things. Then you can see which intuitions can be transferred back to problems involving humans."

That's actually a very good point. It's entirely possible that solving this problem doesn't have any relevance to building AI. However, I want to note that: a) it's possible that a counterfactual mugging situation could have been set up before an AI was built b) understanding this could help deconfuse what a decision is - we still don't have a solution to logical counterfactuals c) this is probably a good exercise for learning to cut through philosophical confusion d) okay, I admit it, it's kind of cool and I'd want an answer regardless of any potential application.

Or maybe you just directly care about counterfactual selves? But why? Do you really believe that counterfactuals are in the territory and not the map? So why care about that which isn't real? Or even if they are real, why can't we just imagine that you are an agent that doesn't care about counterfactual selves? If we can imagine an agent that likes being hit on the head with a hammer, why can't we manage that?

Then there's the philosophical uncertainty approach. Even if there's only a 1/50 chance of your analysis being wrong, then you should pay. This is great if you face the decision in real life, but not if you are trying to delve into the nature of decisions.

So given all of this, why should you pay?

# Ω 1

Pingbacks
New Comment

abramdemski

### Dec 20, 2019

9

I'm most fond of the precommitment argument. You say:

You could argue that you would have pre-commited to paying if you had known about the situation ahead of time. True, but you didn't pre-commit and you didn't know about it ahead of time, so the burden is on you to justify why you should act as though you did. In Newcomb's problem you want to have pre-committed and if you act as though you were pre-committed then you will find that you actually were pre-committed. However, here it is the opposite. Upon discovering that the coin came up tails, you want to act as though you were not pre-commited to pay and if you act that way, you will find that you actually were indeed not pre-commited.

I do not think this gets at the heart of the precommitment argument. You mention cousin_it's argument that what we care about is what decision theory we'd prefer a benevolent AI to use. You grant that this makes sense for that case, but you seem skeptical that the same reasoning applies to humans. I argue that it does.

When reasoning abstractly about decision-making, I am (in part) thinking about how I would like myself to make decisions in the future. So it makes sense for me to say to myself, "Ah, I'd want to be counterfactually mugged." I will count being-counterfactually-mugged as a point in favor of proposed ways of thinking about decisions; I will count not-being-mugged as a point against. This is not, in itself, a precommitment; this is just a heuristic about good and bad reasoning as it seems to me when thinking about it ahead of time. A generalization of this heuristic is, "Ah, it seems any case where a decision procedure would prefer to make a commitment ahead of time but would prefer to do something different in the moment is a point against that decision procedure". I will, thinking about decision-making in the abstract as things seem to me now, tend to prefer decision procedures which avoid such self-contradictions.

In other words, thinking about what constitutes good decision-making in the abstract seems a whole lot like thinking about how we would want a benevolent AI to make decisions.

You could argue that I might think such things now, and might think up all sorts of sophisticated arguments which fit that picture, but later, when Omega asks me for $100, if I re-think my decision-theoretic concepts at that time, I'll know better. But, based on what principles would I be reconsidering? I can think of some. It seems to me now, though, that those principles are mistaken, and I should instead reason using principles which are more self-consistent -- principles which, when faced with the question of whether to give Omega$100, arrive at the same answer I currently think to be right.

Of course this cannot be a general argument that I prefer to reason by principles which will arrive at conclusions consistent with my current beliefs. What I can do is consider the impact which particular ways of reasoning about decisions have on my overall expected utility (assuming I start out reasoning with some version of expected utility theory). Doing so, I will prefer UDT-like ways of reasoning when it comes to problems like counterfactual mugging.

You might argue that beliefs are for true things, so I can't legitimately discount ways-of-thinking just because they have bad consequences. But, these are ways-of-thinking-about-decisions. The point of ways-of-thinking-about-decisions is winning. And, as I think about it now, it seems preferable to think about it in those ways which reliably achieve higher expected utility (the expectation being taken from my perspective now).

Nor is this a quirk of my personal psychology, that I happen to find these arguments compelling in my current mental state, and so, when thinking about how to reason, prefer methods of reasoning which are more consistent with precommitments I would make. Rather, this seems like a fairly general fact about thinking beings who approach decision-making in a roughly expected-utility-like manner.

Perhaps you would argue, like the CDT-er sometimes does in response to Newcomb, that you cannot modify your approach to reasoning about decisions so radically. You see that, from your perspective now, it would be better if you reasoned in a way which made you accept future counterfactual muggings. You'd see, in the future, that you are making a choice inconsistent with your preferences now. But this only means that you have different preferences then and now. And anyway, the question of decision theory should be what to do given preferences, right?

You can take that perspective, but it seems you must do so regretfully -- you should wish you could self-modify in that way. Furthermore, to the extent that a theory of preferences sits in the context of a theory of rational agency, it seems like preferences should be the kind of think which tend to stay the same over time, not the sort of thing which change like this.

Basically, it seems that assuming preferences remain fixed, beliefs about what you should do given those preferences and certain information should not change (except due to bounded rationality). IE: certainly I may think I should go to the grocery store but then change my mind when I learn it's closed. But I should not start out thinking that I should go to the grocery store even in the hypothetical where it's closed, and then, upon learning it's closed, go home instead. (Except due to bounded rationality.) That's what is happening with CDT in counterfactual mugging: it prefers that its future self should, if asked for $100, hand it over; but, when faced with the situation, it thinks it should not hand it over. The CDTer response ("alas, I cannot change my own nature so radically") presumes that we have already figured out how to reason about decisions. I imagine that the real crux behind such a response is actually that CDT feels like the true answer, so that the non-CDT answer does not seem compelling even once it is established to have a higher expected value. The CDTer feels as if they'd have to lie to themselves to 1-box. The truth is that they could modify themselves so easily, if they thought the non-CDT answer was right! They protest that Newcomb's problem simply punishes rationality. But this argument presumes that CDT defines rationality. An EDT agent who asks how best to act in future situations to maximize expected value in those situations will arrive back at EDT, since expected-value-in-the-situation is the very criterion which EDT already uses. However, this is a circular way of thinking -- we can make a variant of that kind of argument which justifies any decision procedure. A CDT or EDT agent who asks itself how best to act in future situations to maximize expected value as estimated by its current self will arrive at UDT. Furthermore, that's the criterion it seems an agent ought to use when weighing the pros and cons of a decision theory; not the expected value according to some future hypothetical, but the expected value of switching to that decision theory now. And, remember, it's not the case that we will switch back to CDT/EDT if we reconsider which decision theory is highest-expected-utility when we are later faced with Omega asking for$100. We'd be a UDT agent at that point, and so, would consider handing over the $100 to be the highest-EV action. I expect another protest at this point -- that the question of which decision theory gets us the highest expected utility by our current estimation isn't the same as which one is true or right. To this I respond that, if we ask what highly capable agents would do ("highly intelligent"/"highly rational"), we would expect them to be counterfactually mugged -- because highly capable agents would (by the assumption of their high capability) self-modify if necessary in order to behave in the ways they would have precommitted to behave. So, this kind of decision theory / rationality seems like the kind you'd want to study to better understand the behavior of highly capable agents; and, the kind you would want to imitate if trying to become highly capable. This seems like an interesting enough thing to study. If there is some other thing, "the right decision theory", to study, I'm curious what that other thing is -- but it does not seem likely to make me lose interest in this thing (the normative theory I currently call decision theory, in which it's right to be counterfactually mugged). a) it's possible that a counterfactual mugging situation could have been set up before an AI was built My perspective now already includes some amount of updateless reasoning, so I don't necessarily find that compelling. However, I do agree that even according to UDT there's a subjective question of how much information should be incorporated into the prior. So, for example, it seems sensible to refuse counterfactual mugging on the first digit of pi. Or maybe you just directly care about counterfactual selves? But why? Do you really believe that counterfactuals are in the territory and not the map? It seems worth pointing out that we might deal with this via anthropic reasoning. We don't need to believe that the counterfactual selves literally exist; rather, we are unsure whether we are being simulated. If we are being simulated, then the other self (in a position to get$1000) really does exist.

Caveat ----

There are a few hedge-words and qualifiers in the above which the casual reader might underestimate the importance of. For example, when I say

(except due to bounded rationality)

I really mean that many parts of the argument I'm making crumbles to dust in the face of bounded rationality, not that bounded rationality is a small issue which I set aside for convenience in the argument above. Keep in mind that I've recently been arguing against UDT. However, I do still think it is right to be counterfactually mugged, for something resembling the reasons I gave. It's just that many details of the argument I'm making really don't work for embedded agents -- to such a large extent that I've become pessimistic about UDT-like ideas.

shminux

### Dec 18, 2019

1

I find that the "you should pay" answer is confused and self-contradictory in its reasoning. Like in all the OO (Omniscient Omega) setups, you, the subject, have no freedom of choice as far as OO is concerned, you are just another deterministic automaton. So any "decision" you make to precommit to a certain action has already been predicted (or could have been predicted) by OO, including any influence exerted on your thought process by other people telling you about rationality and precommitment. To make it clearer, anyone telling you to one-box in the Newcomb's problem in effect uses classical CDT (which advises two-boxing), because they assume that you have the freedom to make a decision in a setup where your decisions are predetermined. If that were so, two-boxing would make more sense, defying the OO infallibility assumption.

So, the whole reasoning advocating for one-boxing and for paying the mugger does not hold up to basic scrutiny. A self-consistent answer would be "you are a deterministic automaton, whatever you feel or think or pretend to decide is an artifact of the algorithm that runs you, so the question whether to pay is meaningless, you either will pay or will not, you have no control over it."

Of course, this argument only applies to OO setups. In "reality" there are no OO that we know of, the freedom of choice debate is far from resolved, and if one assumes that we are not automatons whose actions are set in stone (or in the rules of quantum mechanics), then learning to make better decisions is not a futile exercise. One example is the twin prisoner dilemma, where the recommendation to cooperate with one's twin is self-consistent.