I'm most fond of the precommitment argument. You say:
You could argue that you would have pre-commited to paying if you had known about the situation ahead of time. True, but you didn't pre-commit and you didn't know about it ahead of time, so the burden is on you to justify why you should act as though you did. In Newcomb's problem you want to have pre-committed and if you act as though you were pre-committed then you will find that you actually were pre-committed. However, here it is the opposite. Upon discovering that the coin came up tails, you want to act as though you were not pre-commited to pay and if you act that way, you will find that you actually were indeed not pre-commited.
I do not think this gets at the heart of the precommitment argument. You mention cousin_it's argument that what we care about is what decision theory we'd prefer a benevolent AI to use. You grant that this makes sense for that case, but you seem skeptical that the same reasoning applies to humans. I argue that it does.
When reasoning abstractly about decision-making, I am (in part) thinking about how I would like myself to make decisions in the future. So it makes sense for me to say to myself, "Ah, I'd want to be counterfactually mugged." I will count being-counterfactually-mugged as a point in favor of proposed ways of thinking about decisions; I will count not-being-mugged as a point against. This is not, in itself, a precommitment; this is just a heuristic about good and bad reasoning as it seems to me when thinking about it ahead of time. A generalization of this heuristic is, "Ah, it seems any case where a decision procedure would prefer to make a commitment ahead of time but would prefer to do something different in the moment is a point against that decision procedure". I will, thinking about decision-making in the abstract as things seem to me now, tend to prefer decision procedures which avoid such self-contradictions.
In other words, thinking about what constitutes good decision-making in the abstract seems a whole lot like thinking about how we would want a benevolent AI to make decisions.
You could argue that I might think such things now, and might think up all sorts of sophisticated arguments which fit that picture, but later, when Omega asks me for $100, if I re-think my decision-theoretic concepts at that time, I'll know better.
But, based on what principles would I be reconsidering? I can think of some. It seems to me now, though, that those principles are mistaken, and I should instead reason using principles which are more self-consistent -- principles which, when faced with the question of whether to give Omega $100, arrive at the same answer I currently think to be right.
Of course this cannot be a general argument that I prefer to reason by principles which will arrive at conclusions consistent with my current beliefs. What I can do is consider the impact which particular ways of reasoning about decisions have on my overall expected utility (assuming I start out reasoning with some version of expected utility theory). Doing so, I will prefer UDT-like ways of reasoning when it comes to problems like counterfactual mugging.
You might argue that beliefs are for true things, so I can't legitimately discount ways-of-thinking just because they have bad consequences. But, these are ways-of-thinking-about-decisions. The point of ways-of-thinking-about-decisions is winning. And, as I think about it now, it seems preferable to think about it in those ways which reliably achieve higher expected utility (the expectation being taken from my perspective now).
Nor is this a quirk of my personal psychology, that I happen to find these arguments compelling in my current mental state, and so, when thinking about how to reason, prefer methods of reasoning which are more consistent with precommitments I would make. Rather, this seems like a fairly general fact about thinking beings who approach decision-making in a roughly expected-utility-like manner.
Perhaps you would argue, like the CDT-er sometimes does in response to Newcomb, that you cannot modify your approach to reasoning about decisions so radically. You see that, from your perspective now, it would be better if you reasoned in a way which made you accept future counterfactual muggings. You'd see, in the future, that you are making a choice inconsistent with your preferences now. But this only means that you have different preferences then and now. And anyway, the question of decision theory should be what to do given preferences, right?
You can take that perspective, but it seems you must do so regretfully -- you should wish you could self-modify in that way. Furthermore, to the extent that a theory of preferences sits in the context of a theory of rational agency, it seems like preferences should be the kind of think which tend to stay the same over time, not the sort of thing which change like this.
Basically, it seems that assuming preferences remain fixed, beliefs about what you should do given those preferences and certain information should not change (except due to bounded rationality). IE: certainly I may think I should go to the grocery store but then change my mind when I learn it's closed. But I should not start out thinking that I should go to the grocery store even in the hypothetical where it's closed, and then, upon learning it's closed, go home instead. (Except due to bounded rationality.) That's what is happening with CDT in counterfactual mugging: it prefers that its future self should, if asked for $100, hand it over; but, when faced with the situation, it thinks it should not hand it over.
The CDTer response ("alas, I cannot change my own nature so radically") presumes that we have already figured out how to reason about decisions. I imagine that the real crux behind such a response is actually that CDT feels like the true answer, so that the non-CDT answer does not seem compelling even once it is established to have a higher expected value. The CDTer feels as if they'd have to lie to themselves to 1-box. The truth is that they could modify themselves so easily, if they thought the non-CDT answer was right! They protest that Newcomb's problem simply punishes rationality. But this argument presumes that CDT defines rationality.
An EDT agent who asks how best to act in future situations to maximize expected value in those situations will arrive back at EDT, since expected-value-in-the-situation is the very criterion which EDT already uses. However, this is a circular way of thinking -- we can make a variant of that kind of argument which justifies any decision procedure.
A CDT or EDT agent who asks itself how best to act in future situations to maximize expected value as estimated by its current self will arrive at UDT. Furthermore, that's the criterion it seems an agent ought to use when weighing the pros and cons of a decision theory; not the expected value according to some future hypothetical, but the expected value of switching to that decision theory now.
And, remember, it's not the case that we will switch back to CDT/EDT if we reconsider which decision theory is highest-expected-utility when we are later faced with Omega asking for $100. We'd be a UDT agent at that point, and so, would consider handing over the $100 to be the highest-EV action.
I expect another protest at this point -- that the question of which decision theory gets us the highest expected utility by our current estimation isn't the same as which one is true or right. To this I respond that, if we ask what highly capable agents would do ("highly intelligent"/"highly rational"), we would expect them to be counterfactually mugged -- because highly capable agents would (by the assumption of their high capability) self-modify if necessary in order to behave in the ways they would have precommitted to behave. So, this kind of decision theory / rationality seems like the kind you'd want to study to better understand the behavior of highly capable agents; and, the kind you would want to imitate if trying to become highly capable. This seems like an interesting enough thing to study. If there is some other thing, "the right decision theory", to study, I'm curious what that other thing is -- but it does not seem likely to make me lose interest in this thing (the normative theory I currently call decision theory, in which it's right to be counterfactually mugged).
a) it's possible that a counterfactual mugging situation could have been set up before an AI was built
My perspective now already includes some amount of updateless reasoning, so I don't necessarily find that compelling. However, I do agree that even according to UDT there's a subjective question of how much information should be incorporated into the prior. So, for example, it seems sensible to refuse counterfactual mugging on the first digit of pi.
Or maybe you just directly care about counterfactual selves? But why? Do you really believe that counterfactuals are in the territory and not the map?
It seems worth pointing out that we might deal with this via anthropic reasoning. We don't need to believe that the counterfactual selves literally exist; rather, we are unsure whether we are being simulated. If we are being simulated, then the other self (in a position to get $1000) really does exist.
Caveat ----
There are a few hedge-words and qualifiers in the above which the casual reader might underestimate the importance of. For example, when I say
(except due to bounded rationality)
I really mean that many parts of the argument I'm making crumbles to dust in the face of bounded rationality, not that bounded rationality is a small issue which I set aside for convenience in the argument above. Keep in mind that I've recently been arguing against UDT. However, I do still think it is right to be counterfactually mugged, for something resembling the reasons I gave. It's just that many details of the argument I'm making really don't work for embedded agents -- to such a large extent that I've become pessimistic about UDT-like ideas.
I ultimately don't see much of a distinction between humans and AIs, but let me clarify. If we had the ability to perfectly pre-commit then we'd make pre-commitments that effectively would be the same as an AI self-modifying. Without this ability, this argument is slightly harder to make, but I think it still applies. I've attempted making it in the past although I don't really feel I completely succeeded.
... (read more)