Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Cross-posted from Hands and Cities. Lots of stuff familiar to LessWrong folks interested in decision theory.)

I think that you can “control” events you have no causal interaction with, including events in the past, and that this is a wild and disorienting fact, with uncertain but possibly significant implications. This post attempts to impart such disorientation.

My main example is a prisoner’s dilemma between perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own. This, I claim, is extremely weird.

My topic, more broadly, is the implications of this weirdness for the theory of instrumental rationality (“decision theory”). Many philosophers, and many parts of common sense, favor causal decision theory (CDT), on which, roughly, you should pick the action that causes the best outcomes in expectation. I think that deterministic twins, along with other examples, show that CDT is wrong. And I don’t think that uncertainty about “who are you,” or “where your algorithm is,” can save it.

Granted that CDT is wrong, though, I’m not sure what’s right. The most famous alternative is evidential decision theory (EDT), on which, roughly, you should choose the action you would be happiest to learn you had chosen. I think that EDT is more attractive (and more confusing) than many philosophers give it credit for, and that some putative counterexamples don’t withstand scrutiny. But EDT has problems, too.

In particular, I suspect that attractive versions of EDT (and perhaps, attractive attempts to recapture the spirit of CDT) require something in the vicinity of “following the policy that you would’ve wanted yourself to commit to, from some epistemic position that ‘forgets’ information you now know.” I don’t think that the most immediate objection to this – namely, that it implies choosing lower pay-offs even when you know them with certainty – is decisive (though some debates in this vicinity seem to me verbal). But it also seems extremely unclear what epistemic position you should evaluate policies from, and what policy such a position actually implies.

Overall, rejecting the common-sense comforts of CDT, and accepting the possibility of some kind of “acausal control,” leaves us in strange and uncertain territory. I think we should do it anyway. But we should also tread carefully.

I. Grandpappy Omega

Decision theorists often assume that instrumental rationality is about maximizing expected utility in some sense. The question is: what sense? 

The most famous debate is between CDT and EDT. CDT chooses the action that will have the best effects. EDT chooses the action whose performance would be the best news.

More specifically: CDT and EDT disagree about the type of “if” to use when evaluating the utility to expect, if you do X. CDT uses a counterfactual type of “if” — one that holds fixed the probability of everything outside of action X’s causal influence, then plays out the consequences of doing X. In this sense, it doesn’t allow your choice to serve as “evidence” about anything you can’t cause — even when your choice is such evidence

EDT, by contrast, uses a conditional “if.” That is, to evaluate X, it updates your overall picture of the world to reflect the assumption that action X has been been performed, and then sees how good the world looks in expectation. In this sense, it takes all the evidence into account, including the evidence that your having done X would provide.

To see what this difference looks like in action, consider:

Newcomb’s problem: You face two boxes: a transparent box, containing a thousand dollars, and an opaque box, which contains either a million dollars, or nothing. You can take (a) only the opaque box (one-boxing), or (b) both boxes (two-boxing). Yesterday, Omega — a superintelligent AI — put a million dollars in the opaque box if she predicted you’d one-box, and nothing if she predicted you’d two-box. Omega’s predictions are almost always right.

CDT two-boxes. Your choice, after all, is evidence about what’s in the opaque box, but it doesn’t actually affect what’s in the box — by the time you’re choosing, the opaque box is either already empty, or already full. So CDT assigns some probability to the box being full, and then holds that probability fixed in evaluating different actions. Let’s say p is 1%. CDT’s expected payoffs are then: 

  • One-boxing: 1% probability of $1M, 99% probability of nothing = $10K.
  • Two-boxing: 1% probability of $1M + $1K, 99% probability of $1K = $11K.

Note that there’s some ambiguity, here, about whether CDT then updates p based on its knowledge that it’s about to two-box, then recalculates the expected utilities, and only goes forward if it finds equilibrium. And in some problems, this sort of recalculation makes CDT’s decision-making unstable — see e.g. Gibbard and Harper’s (1978) “Death in Damascus.” But in Newcomb’s problem, no matter what p you use, CDT always says that two-boxing is $1K better, and so two-boxes regardless of what it thinks Omega did, or what evidence its own plans provide.

EDT, by contrast, one-boxes. Learning that you one-boxed, after all, is the better news: it means that Omega probably put a million in the opaque box. More specifically, in comparing one-boxing with two-boxing, EDT changes the probability that the box is full. Why? Because, well, the probability is different, conditional on one-boxing vs. two-boxing. Thus, EDT’s pay-offs are: 

  • One-boxing: ~100% chance of $1M = ~$1M.
  • Two-boxing: ~100% chance of $1K = ~$1K.

What’s the right choice? I think: one-boxing, and I’ll say much more about why below. But I feel the pull towards two-boxing, for CDT-ish reasons.

Imagine, for example, that you have a friend who can see what’s in the opaque box (see Drescher (2006) for this framing). You ask them: what choice will leave me richer? They start to answer. But wait: did you even need to ask? Whether the opaque box is empty or full, you know what they’re going to say. Every single time, the answer will be: two-boxing, dumbo. Omega, after all, is gone; the box’s contents are fixed; the past is past. The question now is simply whether you want an extra $1,000, or not. 

I find that my two-boxing intuition strengthens if Omega is your great grandfather, long dead (h/t Amanda Askell for suggesting this framing to me years ago), and if we specify that he’s merely a “pretty good” predictor; one who is right, say, 80% of the time (EDT still says to one-box, in this case). Suppose that he left the boxes in the attic of your family estate, for you to open on your 18th birthday. At the appointed time, you climb the dusty staircase; you brush the cobwebs off the antique boxes; you see the thousand through the glass. Are you really supposed to just leave it there, sitting in the attic? What sort of rationality is that?

Sometimes, one-boxers object: if two-boxers are so rational, why do the one-boxers end up so much richer? But two-boxers can answer: because Omega has chosen to give better options to agents who will choose irrationally. Two-boxers make the best of a worse situation: they almost always face a choice between nothing or $1K, and they, rationally, choose $1K. One-boxers, by contrast, make the worse of a better situation: they almost always face a choice between $1M or $1M+$1K, and they, irrationally, choose $1M.

But wouldn’t a two-boxer want to modify themselves, ahead of Omega’s prediction, to become a one-boxer? Depending on the modification and the circumstances: yes. But depending on the modification and the circumstances, it can be rational to self-modify into any old thing — especially if rich and powerful superintelligences are going around rewarding irrationality. If Omega will give you millions if you believe that Paris is in Ohio, self-modifying to make such a mistake might be worth it; but the Eiffel Tower stays put. At the very least, then, arguments from incentives towards self-modification require more specificity. (Though we might try to provide this specificity, by focusing on self-modifications whose advantages are sufficiently robust, and/or on a restricted class of cases that we deem “fair.”)

CDT’s arguments and replies to objections here are simple, flat-footed, and I think, quite strong. Indeed, many philosophers are convinced by something in the vicinity (see e.g. the 2009 Phil Papers survey, in which two-boxing, at 31%, beats one-boxing, at 21%, with the other 47% answering “other” – though we might wonder what “other” amounts to in a case with only two options). And more broadly, that I think that relative to EDT at least, CDT fits better with a certain kind of common sense. Action, we think, isn’t about manipulating our evidence about what’s already the case – what David Lewis calls “managing the news.” Rather, action is about causing stuff. In this sense, CDT feels to me like a basic and hard-headed default. In my head, it’s the “man on the street’s” decision theory. It’s not trying to get “too fancy.” It can feel like solid ground.

II. Writing on whiteboards light-years away

Nevertheless, I think that CDT is wrong. Here’s the case that convinces me most.

Perfect deterministic twin prisoner’s dilemma: You’re a deterministic AI system, who only wants money for yourself (you don’t care about copies of yourself). The authorities make a perfect copy of you, separate you and your copy by a large distance, and then expose you both, in simulation, to exactly identical inputs (let’s say, a room, a whiteboard, some markers, etc). You both face the following choice: either (a) send a million dollars to the other (“cooperate”), or (b) take a thousand dollars for yourself (“defect”). 

(Prisoner’s dilemmas, with varying degrees of similarity between the participants, are common in the decision theory literature: see e.g. Lewis (1979), and Hofstadter (1985)).

CDT, in this case, defects. After all, your choice can’t causally influence your copy’s choice: you’re in your room, and he’s in his, far away. Indeed, we can specify that such influence is physically impossible – by the time information about your choice, traveling at the speed of light, can reach him, he’ll have already chosen (and vice versa). And regardless of what he chooses, you get more money by taking the thousand.

But defecting in this case, I claim, is totally crazy. Why? Because absent some kind of computer malfunction, both of you will make the same choice, as a matter of logical necessity. If you press the defect button, so will he; if you cooperate, so will he. The two of you, after all, are exact mirror images. You move in unison; you speak, and think, and reach for buttons, in perfect synchrony. Watching the two of you is like watching the same movie on two screens.

Indeed, for all intents and purposes, you control what he does. Imagine, for example, that you want to get something written on his whiteboard: let’s say, the words “I am the egg man; you are the walrus.” What to do? Just write it on your own whiteboard. Go ahead, try it. It will really work. When you two rendezvous after this is all over, his whiteboard will bear the words you chose. In this sense, your whiteboard is a strange kind of portal; a slate via which you can etch your choices into his far-away world; a chance to act, spookily, at a distance.

And it’s not just whiteboards: you can make him do whatever you want – dance a silly samba, bang his head against the wall, press the cooperate button — just by doing it yourself. He is your puppet. Invisible strings, more powerful and direct than any that operate via mere causality, tie every movement of your mind and body to his.

What’s more: such strings can’t be severed. Try, for example, to make the two whiteboards different. Imagine that you’ll get ten million dollars if you succeed. It doesn’t matter: you’ll fail. Your most whimsical impulse, your most intricate mental acrobatics, your special-est snowflake self, will never suffice: you can no more write “up” while he writes “down” than you can floss while the man in the bathroom mirror brushes his teeth. In this sense, if you find yourself reasoning about scenarios where he presses one button, and you press another – e.g., “even if he cooperates, it would be better for me to defect” – then you are misunderstanding your situation. Those scenarios just aren’t on the table. The available outcomes here are only defect-defect, and cooperate-cooperate. You can get a thousand, by defecting, or you can get a million, by cooperating; but you can’t get less, or more.

To me, it’s an extremely easy choice. Just press the “give myself a million dollars” button! Indeed, at this point, if someone tells me “I defect on a perfect, deterministic copy of myself, exposed to identical inputs,” I feel like: really? 

Note that this doesn’t seem like a case where any idiosyncratic predictors are going around rewarding irrationality. Nor, indeed, does feel to me like “cooperating is an irrational choice, but it would be better for me to be the type of person who makes such a choice” or “You should pre-commit to cooperating ahead of time, however silly it will seem in the moment” (I’ll discuss cases that have more of this flavor later). Rather, it feels like what compels me is a direct, object-level argument, which could be made equally well before the copying or after. This argument recognizes a form of acausal “control” that our everyday notion of agency does not countenance, but which, pretty clearly, needs to be taken into account. Indeed, in effect, I feel like the case discovers a kind of magic; a mechanism for writing on whiteboards light-years away; a way of moving my copy’s hand to the cooperate button, or the defect button, just by moving mine. Ignoring this magic feels like ignoring a genuine and decision-relevant feature of the real world. 

III. Who is the eggman, and who is the walrus?

I want to acknowledge and emphasize, though, that this kind of magic is extremely weird. Recognizing it, I think, involves a genuinely different way of understanding your situation, and your power. It makes your choices reverberate in new directions; it gives you a new type of control, over things you once thought beyond your sphere of influence – including, I’ll suggest, over events in the past (more on this below).

What’s more, I think, it changes – and clarifies — your sense of what your agency amounts to. Consider: who is the eggman, here, and who is the walrus? Suppose you want to send your copy a message: “hello, this is a message from your copy.” So you write it on your whiteboard, and thus on his. You step back, and see a message on your own whiteboard: “hello, this is a message from your copy.” Did he write that to you? Was that your way of writing to him? Are you actually alone, writing to yourself? All of three at once. I said earlier that your copy is your puppet. But equally, you are his puppet. But more truly, neither of you are puppets. Rather, you are both free men, in a strange but actually possible situation. You stand in front of your whiteboard, and it is genuinely up to you what you write, or do. You can write “I am a little lollypop, booka booka boo.” You can draw a demon kitten eating a windmill. You can scream, and dance, and wave your arms around, however you damn well please. Feel the wind on your face, cowboy: this is liberty. And yet, he will do the same. And yet, you two will always move in unison.

We can think of the magic, here, as arising centrally because compatibilism about free will is true. Let’s say you got copied on Monday, and it’s Friday, now – the day both copies will choose. On Monday, there was already an answer as to what button you and your copy will press, given exposure to the Friday inputs. Maybe we haven’t computed the answer yet (or maybe we have); but regardless, it’s fixed: we just need to crunch the numbers, run the deterministic code. From this sort of pre-determination comes a classic argument against free will: if the past and the physical laws (or their computational analogs, e.g. your state on Monday, and the rest of the code that will be run on Friday) are only compatible with your performing one of (a) or (b), then you can’t be free to choose either, because this would imply that you are free to choose the past/or the physical laws, which you can’t. Here, though, we pull a “one person’s reductio is another’s discovery”: because only one of (a) or (b) is compatible with the past/the physical laws, and because you are free to choose (a) or (b), it turns out that in some sense, you’re free to choose the past/the physical laws (or, their computational analogs).

What? That can’t be right. But isn’t it, in the practically relevant sense? Consider: the case is basically one where, if it’s the case that your state on Monday (call this Monday-Joe), copied and evolved according to deterministic process P, outputs “cooperate,” then you get a million dollars; and if it outputs “defect,” you get a thousand dollars (see e.g. Ahmed (2014)‘s “Betting on the Past” for an even simpler version of this). It’s Friday now. The state of Monday-Joe is fixed; Monday-Joe lives in the past. And process P, let’s say, was fixed on Monday, too. In this sense, the question of what Monday-Joe + process P outputs is already fixed. You, on Friday, are evolving-Joe: that is, Monday-Joe-in-the-midst-of-evolving-according-to-process-P. If you choose cooperate, it will always have been the case that Monday-Joe + process P outputs cooperate. If you choose defect, it will always have been the case that Monday-Joe + process P outputs defect. In this very real sense – the same sense at stake in every choice in a deterministic world – you get to choose what will have always been the case, even before your choice.

Try it. It will really work. Make your Friday choice, then leave the simulation, go get an old and isolated copy of Monday-Joe and Process P – one that’s been housed, since Monday, somewhere you could not have touched or tampered with — press play, and watch what comes out the other end. You won’t be surprised.

Is that changing the past? In one sense: no. It’s not that Joe’s state on Monday was X, but then because of what Evolving-Joe did on Friday, Joe’s state on Monday became Y instead. Nor does the output of Monday-Joe + Process P alter over the course of the week. Don’t be silly. You can’t change these things like you can change the contents of your fridge: milk on one day, juice on the next. It’s not milk at noon on Monday, and then on Friday, juice at noon on Monday instead. We must distinguish between the ability to “change things” in this sense, and the ability to “control” them in some broader sense.

But nevertheless: you get to decide, on Friday, the thing that will always have been true; the one thing that will always have been in your fridge, since the beginning of time. And perhaps this approaches, ultimately, the full sense of compatibilist decision-making, compatibilist “control,” even in cases of causal influence. Perhaps, that is, you can change the past, here, about as much as you can change the future in a deterministic world: that is, not at all, and enough to matter for practical purposes. After all, in such a world, the future is already fixed by the past. Your ability to decide that future was, therefore, always puzzling. Perhaps your ability to decide the past isn’t much more so (though certainly, it’s no less).   

CDT can’t handle this kind of thing. CDT imagines that we have severed the ties between you and your copy, between you and the history that determines every aspect of you. It imagines that you can hold your copy’s arm fixed, and move yours freely; that you can break apart the future from the past, and let the future swing, at your pleasure, along some physically (indeed, logically!) impossible hinge. But you can’t. The echoes of your choice started before you chose. You are implicated in a structure that reverberates in all directions. You pull your arm, and the past and the universe trail behind; and yet, the past and universe push your arm; and yet, neither: you, the past, the future, the universe, are all born in the same timeless instant — free, fixed, consistent, a full and living painting of someone painting it as they go along.

And CDT’s mistake, here, is not just abstract misconception: rather, it misleads you in straightforward and practically-relevant ways. In particular, it prompts CDT to compare actions using expected utilities that you shouldn’t actually expect – which, when you step back, seems pretty silly. Suppose, for example, that as a CDT agent, you start out with a credence p that your copy will defect of 99%. Thus, as in Newcomb’s problem above, your payoffs are:

  • Expected utility from defecting: $1K guaranteed + $10K from a 1% probability of getting a million from my copy = $11K.
  • Expected utility from cooperating: $10K from a 1% probability of getting a million from my copy = $10K.

But you shouldn’t actually expect only $10k, if you cooperate, given the logical necessity of his doing what you do. That’s just … not the right number. So why are you considering it? This is no time to play around with fantasy distributions over outcomes; there’s real money on the line. And of course, this sort of objection will hold for any p. As long as you and your copy’s choice are correlated, CDT is going to ignore that correlation, hold p constant given different actions, and in that sense, prompt you to choose as though your probabilities are wrong.

EDT does better, here, of course: choosing based on what utility you, as a Bayesian, should actually expect, given different actions, is EDT’s forté by definition, and a powerful argument in its favor (see e.g. Christiano’s “simple argument for EDT” here). And the considerations about compatibilism and determinism I’ve been discussing seem friendly to EDT as well. After all, if you are a living in already-painted painting, it seems unsurprising if choice comes down to something like “managing the news.” The problem with managing the news, after all, was supposed to be that the news was already fixed. But in an already-painted painting, the future has already been fixed, too: you just don’t know what it is. And when you act, you start to find out. Insofar as you can choose how to act – and per compatibilism, you can – then you can choose what you’re going to find out, and in that sense, influence it. Do you hope that this already-fixed universe is one where you eat a sandwich? Well, go make a sandwich! If you do, you’ll discover that your dream for the universe has always been true, since the beginning of time. If you don’t make a sandwich, though, your dream will die. Why should the applicability of such reasoning be limited by the scope of “causation” (whatever that is)?

IV. What if the case is less clean?

I took pains, above, to specify that the copying process was perfect, and the inputs received exactly identical. It’s perfectly possible to satisfy this constraint, and we don’t need to use “atom-for-atom” copies and the like, or assume determinism at a physical level; we can just make you an AI system running in a deterministic simulation. What’s more, this constraint helps make the point more vivid; and it suffices, I think, to show that CDT is wrong.

However, I don’t think it’s necessary. Consider, for example, a version where there are small errors in the copying process; or in which you get a blue hat, and your copy, a red; or in which your environment involves some amount of randomness. These may or may not suffice to ruin your ability to write exactly what you want on his whiteboard. But very plausibly, the strong correlation between your choice of button, and his, will persist: and to the extent it does, this information is worthy of inclusion in your decision-making process.

What if you know that your copy has already chosen, before you make your choice? To the extent that the correlations between your choice and his persist in such conditions, I think that the same argument applies. Note, though, that your knowing that he’s already chosen means that the two of you got different inputs in a sense that seems more likely to affect your decision-making than getting different colored hats. That is, you saw a light indicating “your copy has already chosen”; he didn’t; and some people, faced with a light of that kind, start acting all weird about how “his choice is already made, I can’t affect it, might as well defect” and so on, in a way that they don’t when the light is off. So the question of what sorts of correlations are still at stake is more up for grabs. Does learning that you cooperate, after seeing such a light, still make it more likely that he cooperated, without seeing one? If so, that seems worth considering.

(This sort of “different inputs” dynamic also blocks certain types of loops/contradictions that could come from learning what a deterministic copy of you already did. E.g., if you learn what he chose — say, that he cooperated — before you make your choice, it’s still compatible with the case’s set up that you defect, as long as he got different inputs: e.g., he didn’t also learn that you cooperated. If he did “learn” that you cooperated, then things are getting more complicated. In particular, either you will in fact cooperate, or some feature of the case’s set-up is false. This is similar to how, if you travel back in time and try to kill you grandfather, either you will in fact fail, or the case’s set-up is false. Or to how, if you hear an infallible prediction that you’ll do X, then either you will in fact do X, or the prediction wasn’t infallible after all.)

V. Monopoly money

I think that “perfect deterministic twin prisoner’s dilemma”-type cases suffice to show that CDT is wrong. But I also want to note another type of argument I find persuasive, in the context of Newcomb’s problem, and which also evokes the type of “magic” I have in mind.

Imagine doing “tryout runs” of Newcomb’s problem, using monopoly money, as many times as you’d like, before facing the real case (h/t Drescher (2006) again). You try different patterns of one-boxing and two-boxing, over and over. Every time you one-box, the opaque box is full. Each time you two-box, it’s empty. 

You find yourself thinking: “wow, this Omega character is no joke.” But you try getting fancier. You fake left, then go right — reaching for the one box, then lunging for the second box too at the last moment. You try increasingly complex chains of reasoning. Before choosing, you try deceiving yourself, bonking yourself on the head, taking heavy doses of hallucinogens. But to no avail. You can’t pull a fast one on ol’ Omega. Omega is right every time.

Indeed, pretty quickly, it starts to feel like you can basically just decide what the opaque box will contain. “Shazam!” you say, waving your arms over the boxes: “I hereby make it the case that Omega put a million dollars into the box.” And thus, as you one box, it is so. “Shazam!” you say again, waving your arms over a new set of boxes: “I hereby make it the case that Omega left the box empty.” And thus, as you two-box, it is so. With Omega’s help, you feel like you have become a magician. With Omega’s help, you feel like you can choose the past. 

Now, finally, you face the true test, the real boxes, the legal tender. What will you choose? Here, I expect some feeling like: “I know this one; I’ve played this game before.” That is, I expect to have learned, in my gut, what one-boxing, or two-boxing, will lead to — to feel viscerally that there are really only two available outcomes here: I get a million dollars, by one boxing, or I get a thousand, by two-boxing. The choice seems clear.

VI. Against undue focus on folk-theoretical names

Of course, the same two-boxing responses I noted above apply here, too. It’s true that every time you one-box, you would’ve gotten an extra $1,000 if you’d two-boxed, assuming CDT’s “counterfactual” construal of “would.” It’s true that you leave the $1,000 dollars on the table; that is this is predictably regrettable for some sense of “regret”; and we can say, for this reason, that “Omega is just play-rewarding your play-irrationality.” I don’t have especially deep responses to these objections. But I find myself persuaded, nevertheless, that one-boxing is the way to go.

Or at least, it’s my way. When I step back in Newcomb’s case, I don’t feel especially attached to the idea that it's the way, the only “rational” choice (though I admit I feel this non-attachment less in perfect twin prisoner’s dilemmas, where defecting just seems to me pretty crazy). Rather, it feels like my conviction about one-boxing start to bypass debates about what’s “rational” or “irrational.” Faced with the boxes, I don’t feel like I’m asking myself “what’s the rational choice?” I feel like I’m, well, deciding what to do. In one sense of “rational” – e.g., the counterfactual sense – two-boxing is rational. In another sense – the conditional sense — one-boxing is. What’s the “true sense,” the “real rationality”? Mu. Who cares? What’s that question even about? Perhaps, for the normative realists, there is some “true rationality,” etched into the platonic realm; a single privileged way that the normative Gods demand that you arrange your mind, on pain of being… what? “Faulty”? Silly? Subject to a certain sort of criticism? But for the anti-realists, there is just the world, different ways of doing things, different ways of using words, different amounts of money that actually end up in your pocket. Let’s not get too hung up on what gets called what.

There’s a great line from David Lewis, which I often think of on those rare and clear-cut occasions when philosophical debate starts to border on the terminological.

“Why care about objective value or ethical reality? The sanction is that if you do not, your inner states will fail to deserve folk-theoretical names. Not a threat that will strike terror into the hearts of the wicked! But whoever thought that philosophy could replace the hangman?”

I want to highlight, in particular, the idea of “failing to deserve folk-theoretical names.” Too often, philosophy – especially normative philosophy — devolves into a debate about what kind of name-calling is appropriate, when. But faced with the boxes, or the buttons, our eyes should not be on the folk-theoretical names at stake. Rather, our eyes should be on the choice itself.

Note that my point here is not that “rationality is about winning” (see e.g. Yudkowsky (2009)). “Winning,” here, is subject to the same ambiguity as “rational.” One-boxers tend to end up richer, yes. But faced with a choice between $1k, or nothing (the choice that the two-boxer is actually presented with), $1k is the winning choice. Still, I am with Yudkowsky in spirit, in that I think that too much interest in the word “rational” here is apt to move our eyes from the prize.

(All that said, I’m going to continue, in what follows, to use the standard language of “what’s rational,” “what you should do,” etc, in discussing these cases. I hope that this language will be interpreted in a sense that connects directly to the actual, visceral process of deciding what to do, name-calling be damned. I acknowledge, though, that there’s a possible motte-and-bailey dynamic here, where the one-boxer goes in hard for claims like “CDT is wrong” and “c’mon, defecting in perfect twin prisoner’s dilemmas is just ridiculous!” and then backs off to “hey man, you’ve got your way, I’ve got my way, what’s all this obsession with the word ‘rationality’?” when pressed about the counterintuitive consequences of their own position. And more broadly, it can be hard to combine object level normative debate, which often reflects with a kind of “realist” flavor, with adequate consciousness and communication of some more fundamental meta-ethical arbitrariness. If necessary, we might go back through the whole post and try to rewrite it in more explicitly anti-realist terms — e.g., “I reject CDT.” But I’ll skip that, partly because I suspect that something beyond naive meta-ethical realism gets lost in this sort of move, even if we don’t have an explicit account of what it is.)

VII. Identity crises are no defense of CDT

I’ve now covered two data-points that I take to speak very strongly against CDT: namely, that one should cooperate in a twin prisoner’s dilemma, and that one should one-box in Newcomb’s problem. I want to briefly discuss an unusual way of trying to get CDT to one-box: namely, by appealing to uncertainty about whether you faced with the real boxes, or whether you are in a simulation being used by Omega to predict your future choice (see e.g. Aaronson (2005) and Critch (2017) for suggestions in this vein, though not necessarily in these specific terms). Basically, I don’t think this move works, in general, as a way of saving CDT, though the type of uncertainty in question might be relevant in other ways.

How is the story supposed to go? Imagine that you know that the way Omega predicts whether you’ll one-box, or two-box, is by running an extremely high-fidelity simulation of you. And suppose that both real-you and sim-you only care about what happens to real-you. By hypothesis, sim-you shouldn’t be able to figure out whether he’s simulated or real, because then he’ll serve as worse evidence about real-you’s future behavior (for example, if sim-you appears in a room with writing on the wall saying “you’re the sim,” then he can just one-box, thereby causing Omega to add the money to the opaque box, thereby allowing real-you, appearing in a room saying “you’re the real one,” to two-box, get the full million-point-one, and make Omega’s “prediction” wrong). So it needs to be the case that you’re uncertain – let’s say, 50-50 — about whether you’re simulated or not. Thus, the thought goes, you should one-box, because there’s a 50% chance that doing so will cause Omega to put the million in the box, and your real-self (who will also, presumably, one-box, given the similarity between you) will get it.

(Calculation: feel free to skip. Suppose that you currently expect yourself to one-box, as both real-you and sim-you, with 99% probability. Then the CDT calculation runs as follows:

  • 50% chance you’re the sim, in which case:
    • EV of one-boxing = 99% chance real-you gets a $1M, 1% chance real-you gets $1M + $1K = $1,000,010.
    • EV of two-boxing = 99% chance real-you gets nothing, 1% chance real-you gets $1K = $10.
  • 50% chance you’re real, in which case:
    • EV of one-boxing: 99% chance real-you gets $1M, 1% chance real-you gets nothing = $990,000.
    • EV of two-boxing: 99% chance real-you gets $1M + $1K, 1% chance real-you gets $1k = $991,000.
  • So overall:
    • EV of one-boxing = 50% * $1,000,010 + 50% * $990,000 = $995,005.
    • EV of two-boxing = 50% * $10 + 50% * $991,000 = $495,505.

Depending on the details, CDT may then need to adjust its probability that both sim-you and real-you one-box. But high-confidence that both versions of you one-box is a stable equilibrium (e.g., CDT still one-boxes, give such a belief); whereas high-confidence that both will two-box is not (e.g, CDT one-boxes, given such a belief). There are also some problems, here, with making such calculations consistent with assigning a specific probability to Omega being right in her prediction, but I’m setting those aside.)

My objections here are:

  1. This move doesn’t work if you’re indexically selfish (e.g., you don’t care about copies of yourself).
  2. This move doesn’t work for twin prisoner’s dilemma cases more broadly.
  3. It’s not clear that simulations are necessary for predicting your actions in the relevant cases.
  4. In general, it really doesn’t feel like this type of thing is driving my convictions about these cases.

Let’s start with (1). Suppose that real-you and sim-you aren’t united in sole concern for real-you. Rather, suppose that you’re both out for yourselves. Sim-you, let’s suppose, faces bleak prospects: whatever happens, Omega is going to shut down the simulation right after sim-you’s choice gets made. So sim-you doesn’t give a shit about this whole ridiculous situation with the god-damn boxes; the world is dust and ashes. Real-you, by contrast, is a CDT agent. So real-you, left to his own devices, is a two-boxer. Hence, sim-you doesn’t care, and real-you wants to two-box; and thus, uncertain about who you are, you two-box.

(Calculation, feel free to skip. Suppose you start out 99% confident that both versions of you will two-box. Thus:

  • 50% chance you’re the sim, in which case: you get nothing no matter what.
  • 50% chance you’re real, in which case:
    • EV of one-boxing: 1% chance of $1M, 99% chance of nothing = $10,000.
    • EV of two-boxing: 1% chance of $1M + $1K, 99% chance of $1k = $11,000
  • So overall:
    • EV of one-boxing = 50% * $0 + 50% * $10,000 = $5,000.
    • EV of two-boxing = 50% * $0 + 50% * $11,000= $5,500.

This dynamic holds regardless of your initial probabilities on how different versions of you will act, and regardless of your probability on being the sim vs. being real.)

Of course, real-you can try to “acausally induce” sim-you to one-box, by one-boxing himself. But “acausally inducing” other versions of yourself to do stuff isn’t the CDT way; rather, it’s the type of magical thinking silliness that CDT is supposed to eschew.

Perhaps one objects: sim-you should care about real-you! For one thing, though, this seems unobvious: indexical selfishness seems perfectly consistent and understandable (and indeed, for anti-realists, you can care about whatever you want). But more importantly, it’s an objection to a utility function, rather than to two-boxing per se; and decision theorists don’t generally go in for objecting to utility functions. If the claim is that “CDT is compatible with indexically altruistic agents one-boxing in Newcomb cases involving simulations,” then fair enough. But what about everyone else?

This leads us to objection (2): namely, that the twin prisoner’s dilemma, which I take to be one of the strongest reasons to reject CDT, is precisely a case of indexical selfishness. Perhaps I am uncertain about which copy I am; but regardless, I only care about myself; and on CDT, whatever that other guy does, I should defect. But defecting on your perfect deterministic twin, I claim, is totally crazy, even if you are indexically selfish. So CDT, I think, is still wrong.

What’s more, as I noted above, we can imagine versions of the case where I do know who I am; for example, I am the one with the blue hat, he’s the one with the red hat; I am the one who want to create flourishing Utopias, and he (the authorities changed my values during the copying process) wants to create paperclips. Unlike “sim vs. real,” these distinctions that are epistemically accessible. Still, though, if my choices are sufficiently correlated with those of my copy (and mutual cooperation is sufficiently beneficial), I should cooperate.

This is related to objection (3): namely, that not all cases where CDT gives the wrong verdicts involve simulations, or uncertainty about “who you are.” Twin prisoner’s dilemmas, where you are slightly but discernably different from your twin, are one example: no simulations or predictions necessary. But we might also wonder about Newcomb cases more broadly. Does Omega really need to be predicting your behavior via a simulation or model that you might actually be, in order for one-boxing to be the right call? This seems, at least, a substantively additional claim. And we might wonder about e.g. predicting your behavior via your genes (see e.g. Oesterheld (2015)), by observing lots of people who are “a lot like you,” or some via other unknown method.

That said, I want to acknowledge that one of the arguments for one-boxing that I find most persuasive – e.g., running the case lots of times with “play money,” before deciding what to do for real – works a lot better in contexts with very fine-grained prediction capabilities. This is because when I’m “playing around” with no real stakes, it makes more sense to imagine me using intricate and arbitrary decision-making processes, which the incentives at stake in the real case will not constrain. Thus, for example, maybe I try forms of pseudo-randomization (“I’ll one-box if the number of letters in the sentence I’m about to make up is odd” – see Aaronson here); maybe I try spinning myself around with my eyes closed, then pressing whichever button I see first; and so on. In order for Omega’s predictions to stay well-correlated with my behavior, here, it seems plausible she needs a very (unrealistically?) high-fidelity model. And we can say something similar about the twin prisoner’s dilemma. That is, the argument for cooperating is most compelling when his arm literally moves in logically-necessary lock-step with your own, as you reach towards the buttons. Once that’s not true, if we try to imagine a “play money” version of the case, then even with fairly minor psychology differences, you and your copy’s modes of “playing around” might de-correlate fast.   

This feature of the intuitive landscape seems instructive. The sense that you acausally “control” what Omega predicts, or what your copy does, seems strongest when you can, as it were, do any old thing, for any old reason, and the correlation will remain. Once the correlation requires further constraints, the intuitive case weakens. That said, if you’re in the real case, with the real incentives, then it’s ultimately the correlation given those incentives that seems relevant: e.g., maybe Omega is accurate only for real-money cases; maybe you and your copy are only highly correlated when the real money comes out. In such a case, I think, you should still one-box/cooperate.

My final objection to the “appeal to uncertainty about who are you” sort of view just: it doesn’t feel like uncertainty about whether I’m a simulation is actually driving my one-boxing impulse. In the play-money Newcomb case, for example, I feel like what actually persuades me is a visceral sense that “one-boxing is going to result in me having a million dollars, two-boxing is going to result in me having a thousand dollars.” Questions about whether I’m a simulation, or whether Omega needs to simulate me in order to achieve this level of accuracy, just aren’t coming into it.

I conclude, then, that simulation uncertainty and related ideas can’t save CDT. Aaronson thinks that he can “pocket the $1,000,000, but still believe that the future doesn’t affect the past.” I think he’s wrong — at least in many cases where one wants the million, and can get it. He should face, I think, a weirder music.

VIII. Maybe EDT?

But what sort of music, exactly? And exactly how weird are we talking? I don’t know.

Consider, for example, EDT – CDT’s most famous rival. I think that a lot of philosophers write off EDT too quickly. As I mentioned earlier, EDT has the unique and compelling distinction of being the only view to use the utility you should actually expect, given the performance of action X, in order to calculate the expected utility of performing action X. In this sense, it’s the basic, simple-minded Bayesian’s decision theory; the type of decision theory you would use if you were, you know, trying to predict the outcomes of different actions.

What’s more, I think, a number of prominent objections to EDT seem to me, at least, much more complicated than they’re often made out to be. Consider, for example, the accusation that EDT endorses attempts to “manage the news.” There’s something true about this, but we should also be puzzled by it. Managing the news is obviously fine when you can influence the events the news is about. It’s fine, for example, to “manage the news” about whether you get a promotion, by working harder at the office. And it’s interestingly hard to “manage the news” successfully – e.g., change your rational credence in how good the future will be – with respect to things you can’t influence. Suppose, for example, that you’re worried (at, say, 70% credence) that your favored candidate lost yesterday’s election. Do you “manage the news” by refusing to read the morning’s newspaper, or by scribbling over the front page “Favored Candidate Wins Decisively!”? No: if you’re rational, your credence in the loss is still 70%.

Or take a somewhat more complicated case, discussed in Ahmed (2014). Suppose that you wake up not knowing what time it is, and all your clocks are broken. You hope that you’re not already late to work, and you consider running, to avoid either being late at all, or being later. Suppose, further, that people who run to work tend to be already late. Should you refrain from running, on the grounds that running would make it more likely that you’re already late? No. But plausibly, EDT doesn’t say you should, because running to work, in this case, wouldn’t be additional evidence that you’re already late, once we condition on the fact that you don’t know when you woke up, the reasons (including the subtle hunches about what time it might be) that you’d be running, and so on. After all, many of the already-late people running for work know that they’re already late, and are running for that reason. Your situation is different.

OK, so what does it take for the problematic type of news-management to be possible? This question matters, I think, because in some of the examples where EDT is supposed to go in for the problematic type of news-management, it’s not clear that the news-management in question would succeed. Consider:

Smoking lesion: Almost everyone who smokes has a fatal lesion, and almost everyone who doesn’t smoke doesn’t have this lesion. However, smoking doesn’t cause the lesion. Rather, the lesion causes people to smoke. Dying from the lesion is terrible, but smoking is pretty good. Should you smoke?

EDT, the objection goes, doesn’t smoke, here, because smoking increases your credence that you have the lesion. But this, the thought goes, is stupid. You’ve either already got the lesion, or you don’t have it and won’t get it. Either way, you should smoke. Not smoking is just “managing the news.”

I used to treat this case a fairly decisive reason to reject EDT. Now I feel more confused about it. For starters, EDT clearly smokes in some versions of the case. Suppose, for example, that the way the lesion causes people to smoke is by making them want to smoke. Conditional on someone wanting to smoke, though, there’s no additional correlation between actually smoking and having the lesion. Thus, if you notice that you want to smoke (e.g., you feel a “tickle”), then that’s the bad news right there: you’ve already got all the smoking-related evidence you’re going to get about whether you’ve got the lesion. Actually smoking, or not, doesn’t change the news: so, no need for further management. This sort of argument will work for any mechanism of influence on your decision that you notice and update on. Thus the so-called “Tickle Defense” of EDT.

Ok, but what if you don’t notice any tickle, or whatever other mechanism of influence is at stake? As Ahmed (2014, p. 91) characterizes it, the tickle defense assumes that all the inputs to your decision-making are “transparent” to you. But this seems like a strong condition, and granted greater ignorance, my sense is that in some versions of the case (for example, versions where the lesion makes you assign positive utility to smoking, but you don’t know what your utility function is, even as you use it in making decisions), EDT is indeed going to give the intuitively wrong result (see e.g., Demski’s “Smoking Lesion Steelman” for a worked example). Christiano argues that this is fine – “No matter how good your decision procedure is, if you don’t know a critical fact about the situation then you can make a decision that looks bad” – but I’m not so sure: prima facie, not smoking in smoking-lesion type cases seems like the type of mistake one ought to be able to avoid, even granted uncertainty about some aspects of your own psychology, and/or how the lesion works.

More generally, though, my sense is that really trying to dig into the details of tickle-defense type moves gets complicated fast, and that there’s some tension between (a) trying to craft a version of EDT where the “tickle defense” always works – e.g., one that somehow updates on everything influencing its decision-making (I’m not sure how this is supposed to work) – and (b) keeping EDT meaningfully distinct from CDT (see e.g. Demski’s sequence “EDT = CDT?”). Maybe some people are OK with collapsing the distinction, and OK, even, if EDT starts two-boxing in Newcomb’s problems (see e.g. Demski’s final comments here), and defecting on deterministic twins (I’ve been setting this possibility aside above, and following the standard understanding of how EDT acts in these cases). But for my part, a key reason I’m interested in EDT at all is because I’m interested in one-boxing and cooperating. Maybe I can get this in other ways (see e.g. the discussion of “follow the policy you would’ve committed to” below); but then, I think, EDT will lose much of its appeal (though not all; I also like the “basic Bayesian-ness” of it).

One other note on smoking lesion. You might think that the “do it over and over with monopoly money” type argument that I found persuasive earlier will give the intuitively wrong verdict on smoking lesion, suggesting that such an argument shouldn’t be trusted. After all, we might think, almost every time you smoke in a “play life,” you’ll end up with the play-lesion; and every time you don’t, you won’t. But note that when we dig in on this, the smoking lesion case can start to break in a maybe-instructive manner.

Suppose, for example, that I know that the base rate of lesions in the population is 50%, and I get “spawned” over and over into the world, where I can choose to smoke, or not. How can my “playing around” remain consistent with this 50% base rate? Imagine, for example, that I decide to refrain from smoking a million times in a row. If the case’s hypothesized correlations hold, then I will in fact spawn, consistently, without the lesion. In that case, though, it starts to look like my choice of whether to smoke or not actually is exerting a type of “control” over whether I get born as someone with the lesion – in defiance of the base rate. And if my choice can do that, then it’s not actually clear to me that non-smoking, here, is so crazy.

Maybe we could rule this out by fiat? “Well, if the base rate is 50%, then it turns you will, in fact, decide to ‘play around’ in a way that involves smoking ~50% of the time” (thanks to Katja Grace for discussion). But this feels a bit forced, and inconsistent with the spirit of “play around however you want; it’ll basically always work” – the spirit that I find persuasive in Newcomb’s case and sufficiently-high-fidelity twin prisoner’s dilemmas. Alternatively, we could specify that I’m not allowed to know the base rate, and then we can shift it around to remain consistent with my making whatever play choices I want and spawning at the base rate. But now it looks like I can control the base rate of lesions! And if I can do that, once again, I start to wonder about whether non-smoking is so crazy after all.

That said, maybe the right thing to say here is just that the correlations posited in smoking lesion don’t persist under conditions of “play around however you want” – something that I expect holds true of various versions of Newcomb’s problem and Twin Prisoner’s dilemma as well.

What about other putative counter-examples to EDT? There are lots to consider, but at least one other one – namely, “Yankees vs. Red Sox” (see Arntzenius (2008)) — strikes me as dubious (though also, elegant). In this case, the Yankees win 90% of games, and you face a choice between the following bets:

                                                  Yankees win                Red Sox win

            You bet on Yankees             1                                 -2

            You bet on Red Sox            -1                                 2

Or, if we think of the outcomes here as “you win your” and “you lose your bet” instead, we get:

                                                 You win your bet             You lose your bet

            You bet on Yankees             1                                 -2

            You bet on Red Sox             2                                 -1

Before you choose your bet, an Oracle tells you whether you’re going to win your next bet. The issue is that once you condition on winning or losing (regardless of which), you should always bet on the Red Sox. So, the thought goes, EDT always bets on the Red Sox, and loses money 90% of the time. Betting on the Yankees every time does much better.  

But something is fishy here. Specifically, the Oracle’s prediction, together with your knowledge of your own decision, leaks information that should render your decision-making unstable. Suppose, for example, that the Oracle tells you that you will lose your next bet. You then reason: “Conditional on knowing that I will lose my bet, I should bet on the Red Sox. But given that I’ll lose, this means that the Yankees will win, which means I should bet on the Yankees, which means I will win my bet. But I can’t win my bet, so the Yankees will lose, so I should bet on the Red Sox,” and so on. That is, you oscillate between reasoning using the second matrix, and reasoning using the first; and you never settle down.

(Note that if we allow for playing around with monopoly money, then this case, too, suffers from the same base-rate related problems as smoking lesion: e.g., either you can change the base rates of Yankee victory at will, or you’re somehow forced to play around in a manner consistent with both the 90% base rate and the Oracle’s accuracy, or somehow the Oracle’s accuracy doesn’t hold in conditions where you can play around.)

Even if we set aside smoking lesion and Yankees vs. Red Sox, though, there is at least one counterexample to EDT that seems to me pretty solidly damning, namely:

XOR blackmail: Termites in your house is a million-dollar loss, and you don’t know if you have them. A credible and accurate predictor finds out if you have termites, then writes the following letter: “I am sending you this letter if and only if (a) I predict that you will pay me $1,000 dollars upon receiving it, or (b) you have termites, but not both.” She then makes her prediction and follows the letter’s outlined procedure. If you receive the letter, should you pay?

(See Yudkowsky and Soares (2017), p. 24).

EDT pays, here. Why? Because conditional on paying, it’s much less likely that you’ve got termites, so paying is much better news than not paying. If you refuse to pay, you should call the exterminator (or do whatever you do with termites) pronto; if you pay, you can relax.

Or at least, you can relax for a bit. But if you’re EDT, you’re getting these letters all the time. Maybe the predictor decides to pull this stunt every day. You’re flooded with letters, all reflecting the prediction that you’ll pay. If you’d only stop paying, the letters would slow to a base-rate-of-termites-sized trickle. Try it with monopoly money: as you spawn over and over, you’ll find you can modulate the frequency of letter receipt at will, just by deciding to pay, or not, on the next round. But in real life, once you’ve got the letter, do you ever wise up, and decide, instead of paying, to already have termites? On EDT, it’s not clear (at least to me) why you would, absent some other change to the situation. Termites, after all, are terrible. And look at this letter, already sitting in your hand! It only comes given one of two conditions…     

Perhaps one thinks: the core issue here isn’t that you’re getting so many letters. Even if you know that the predictor is only going to pull this stunt once, paying seem pretty silly. Why? It’s that old thing about the past having already happened, about the opaque box already being empty or full. You’ve either already got termites, or you don’t, dude: stop trying to manage the news.

But is that the core issue? Consider:

More active termite blackmail: The predictor gets more aggressive. Once a year, she writes the following letter: “I predicted that you would pay me $1,000 upon receipt of this letter. If I predicted ‘yes,’ I left your house alone. If I predicted ‘no,’ I gave you termites.” Then she predicts, obeys the procedure, and sends. If you receive the letter, should you pay?

Here, the “it’s too late, dude” objection still applies. CDT ignores letters like this. But CDT also gets given termites once a year. EDT, by contrast, pays, and stays termite free. What’s more, by hypothesis, the stunt gets pulled on everyone the same number of times, regardless of their payment patterns. In this sense, it’s more directly analogous to Newcomb’s problem. And I find that paying, here, seems more intuitive than in the previous case (though the fact that you ultimately want to deter this sort of behavior from occurring at all may bring in additional complications; if it helps, we can specify that the predictor’s not actually in this for getting money or for giving people termites — rather, she just likes putting people in weird decision-theory situations, and will do this regardless of how her victims respond).

We can consider other problems with EDT as well, beyond XOR blackmail. For example, a naïve formulation of EDT has trouble with cases where it starts out certain about what it’s going to do, or even very confident (see e.g. the “cosmic ray problem” on p. 24 of Yudkowsky and Soares (2017)). And more generally, the “managing the news” flavor of EDT makes it feel, to me, like the type of thing one could come up with counter-examples to. But it’s XOR blackmail, I find, that currently gives me the most pause (and note, too, that in XOR blackmail, we can imagine that you have arbitrary introspective access, such that tickle-defense type questions about whether all the factors influencing your decision are “transparent” or not don’t really apply). And I think that the importance of the way paying influences how many letters you get, as opposed to its trying to “control the past” more broadly, may be instructive.

Summarizing this section, then: my current sense is that:

  1. EDT’s “basic Bayesianism” makes it attractive.
  2. Really digging into EDT, especially re: tickle defenses, can get kind of gnarly.
  3. Yankees vs. Red Sox isn’t a good counterargument to EDT.
  4. EDT messes up in XOR blackmail.
  5. There are probably a bunch of other problems with EDT that I’m not really considering/engaging with.

Does this make EDT better or worse than CDT? Currently, I’m weakly inclined to say “better” – at least in theory. But trying to actually implement EDT also seems more liable to lead to pretty silly stuff. I’ll discuss some of this silly stuff in the final section. First, though, and motivated by XOR blackmail, I want to discuss one more broad bucket of decision-theoretic options and examples – namely, those associated with following policies you would’ve wanted yourself to commit to, even when it hurts.

IX. What would you have wanted yourself to commit to?


Parfit’s hitchhiker: You are stranded in the desert without cash, and you’ll die if you don’t get to the city soon. A selfish man comes along in a car. He is an extremely accurate predictor, and he’ll take you to the city if he predicts that once you arrive, you’ll go to an ATM, withdraw ten thousand dollars, and give it to him. However, once you get to the city, he’ll be powerless to stop you from not paying.

If you get to the city, should you pay him? Both CDT and EDT answer: no. By the time you get to the city, the risk of death in the desert is gone. Paying him, then, is pure loss (assuming you don’t value his welfare, and there are no other downstream consequences). Because they answer this way, though, both CDT and EDT agents rarely make it to the city: the man predicts, accurately, that they won’t pay.

Is this a problem? Some might answer: no, because paying in the city is clearly irrational. In particular, it violates what MacAskill (2019) calls:

Guaranteed Payoffs: When you’re certain about what the pay-offs of your different options would be, you should choose the option with the highest pay-off.

Guaranteed Payoffs, we should all agree, is an attractive principle, at least in the abstract. If you’re not taking the higher payoff, when you know exactly what payoffs your different actions will lead to, then what the heck are you doing, and why would we call it “rationality”?

On the other hand, is paying the driver really so silly? To me, it doesn’t feel that way. Indeed, I feel happy to pay, here (though I also think that the case brings in extra heuristics about promise-keeping and gratitude that may muddy the waters; better to run it with a mean and non-conscious AI system who demands that you just burn the money in the street, and kills itself before you even get to the ATM). What’s more, I want to be the type of person who pays. Indeed: if, in the desert, I could set-up some elaborate and costly self-binding scheme – say, a bomb that blows off my arm, in the city, if I don’t pay — such that paying in the city becomes straightforwardly incentivized, I would want to do it. But if that’s true, we might wonder, why not skip all this expensive faff with the bomb, and just, you know, pay in the city? After all, what if there are no bombs around to strap to my arm? What if I don’t know how to make bombs? Need my survival be subject to such contingencies? Why not learn, and practice, that oh-so valuable (and portable, and reliably available) skill instead: how to make, and actually keep, commitments? (h/t Carl Shulman, years ago, for suggesting this sort of framing.)

That said, various questions tend to blur together here – and once we pull them apart, it’s not clear to me how much substantive (as opposed to merely verbal) debate remains. Everyone agrees that it’s better to be the type of person who pays. Everyone agrees that if you can credibly commit to paying, you should do it; and that the ability to make and keep commitments is an extremely useful one. Indeed, everyone agrees that, if you’re a CDT or EDT agent about to face this case, it’s better, if you can, to self-modify into some other type of agent – one that will pay in the city (and are commitments and self-modifications really so different? Is cognition itself so different from self-modification?). As far as I can tell (and I’m not alone in thinking this), the only remaining dispute is whether, given these facts, we should baptize the action of paying in the city with the word “rational,” or if we should instead call it “an irrational action, but one that follows from a disposition it’s rational to cultivate, a self-modification it’s rational to make, a policy its rational to commit to,” and so on.

Is that an interesting question? What’s actually at stake, when we ask it? I’m not sure. As I mentioned above, I tend towards anti-realism about normativity; and for anti-realists, debates about the “true rationality” aren’t especially deep. Ultimately, there are just different ways of arranging your mind, different ways of making decisions, different shapes that can be given to this strange clay of self and world. Ultimately, that is, the question is just: what you in fact do in the city, and what in fact that decision means, implies, causes, and so on. We talk about “rationality” as a means of groping towards greater wisdom and clarity about these implications, effects, and so on; but if you understand all of this, and make your decisions in light of full information, additional disputes about what compliments and insults are appropriate don’t seem especially pressing.

All that said, terminology aside, I do think that Parfit’s hitchhiker-type cases can lead to genuinely practical and visceral forms of internal conflict. Consider:

Deterrence: You have a button that will destroy the world. The aliens want to invade, but they want the world intact, and they won’t invade if they predict that you’ll destroy the world upon observing their invasion. Being enslaved by the aliens is better than death; but freedom far better. The aliens predict that you won’t press the button, and so start to invade. Should you destroy the world?

This is far from a fanciful thought experiment. Rather, this is precisely the type of dynamic that decision-makers with real nuclear codes at their fingertips have to deal with. Same with tree-huggers chaining themselves to trees, teenagers playing chicken, and so on.

Or, more fancifully, consider:

Counterfactual mugging: Omega doesn’t know whether the X-th digit of pi is even or odd. Before finding out, she makes the following commitment. If the X-th digit of pi is odd, she will ask you for a thousand dollars. If the X-th digit is even, she will predict whether you would’ve given her the thousand had the X-th digit been odd, and she will give you a million if she predicts “yes.” The X-th digit is odd, and Omega asks you for the thousand. Should you pay?

(I use logical randomness, rather than e.g. coin-flipping, to make it more difficult to appeal to concern about versions of yourself that live in other quantum branches, possible worlds, and so on. Thanks to Katja Grace for suggesting this. That said, perhaps some such appeals are available regardless. For example, how did X get decided?)

Finally, consider a version of Newcomb’s problem in which both boxes are transparent – e.g., you can see how Omega has predicted you’ll behave. Suppose you find that Omega has predicted that you’ll one-box, and so left the million there. Should you one-box, or two-box? What if Omega has predicted that you’ll two-box?

We can think of all these cases as involving an inconsistency between the policy that an agent would want to adopt, at some prior point in time/from some epistemic position (e.g., before the aliens invade, before we know the value of the X-th digit, before Omega makes her predictions), and the action that Guaranteed Payoffs would mandate given full information. And there are lots of other cases in this vein as well (see e.g., The Absent-Minded Driver, and the literature on dynamical inconsistency in game theory).

There is a certain broad class of decision theories, a number of which are associated with the Machine Intelligence Research Institute (MIRI), that put resolving this type of inconsistency in favor of something like “the policy you would’ve wanted to adopt” at center stage. (In general, MIRI’s work on decision theory has heavily influenced my own thinking – influence on display throughout this post. See also Meacham (2010) for another view in this vein, as well as the work of Wei Dai and others on “updatelessness.”) There are lots of different ways to do this (see e.g. the discussion of the 2x2x3 matrix here), and I don’t feel like I have a strong grip on all of the relevant choice-points. Many of these views are united, though, in violating Guaranteed Payoffs, for reasons that feel, spiritually, pretty similar.

What’s more, and importantly, these theories tend to get cases like XOR blackmail right, where e.g. classic EDT gets them wrong. Consider, for example, whether before you receive any letter, you would want to commit to paying, or not paying, upon receipt. If we assume that the base rate of termites will stay constant regardless, then committing to not paying seems the clear choice. After all, doing so won’t make it more likely that you get termites; rather, it’ll make it less likely that you get letters.  

If necessary, these theories can also get results like one-boxing, and cooperating with your twin, without appeal to any weird magic about controlling the past. After all, one-boxing and cooperating are both policies that you would want yourself to commit to, at least from some epistemic positions, even in a plain-old, common-sense, CDT-spirited world. Maybe executing these policies looks like trying to execute some kind of acausal control — and maybe, indeed, advocates of such policies talk in terms of such control. But maybe this is just talk. After all, executing policies that violate Guaranteed Payoffs looks pretty weird in general (for example, it looks like burning money for certain), and perhaps we need not take decisions about how to conceptualize such violations all that seriously: the main thing is what happens with the money.

A key price of this approach, though, is the whole “burning money for certain” thing; and here, perhaps, some people will want to get off the train. “Look, I was down for one-boxing, or for cooperating with my twin, when I didn’t actually know the payoffs in question. But violating Guaranteed Payoffs is just too much! You’re just destroying value for certain. That’s all. That’s the whole thing you do. You blow up the world, trying to prevent something that you know has already happened. Yes, it’s good to commit to doing that ex ante. But ex post, isn’t it also just obviously stupid?”

For people with this combination of views, though, I think it’s important to keep in mind the spiritual continuity between violating Guaranteed Payoffs, and one-boxing/cooperating more generally. After all, one of the strongest arguments for two-boxing is that, if you knew what was in the box (like, e.g., your friend does), you’d be in a Guaranteed Payoffs-type situation, and then a follower of Guaranteed Payoffs would two-box every time. Indeed, I think that part why “great grandpappy Omega, now long dead, leaves the boxes in the attic” prompts a two-boxing intuition is that in the attic, you sense that you’re about to move from a non-transparent Newcomb’s problem to a transparent one. That is, after you bring the one-box down from the attic, and open it, the other box isn’t going to disappear. The attic door is still open. The stairs still beckon. You could just go back up there and get that thousand. Why not do it? If you got the million, it’s not going to evaporate. And if you didn’t get the million, what’s the use of letting a thousand go to waste? But that’s just the type of thinking that leads to empty boxes…

X. Building statues to the average of all logically-impossible Gods

Overall, I don’t see violations of Guaranteed Payoffs as a decisive reason to reject approaches in the vein of “act in line with the policy you would’ve wanted to commit to from some epistemic position P” – and some disputes in this vicinity strike me as verbal rather than substantive. That said, I do want to flag an additional source of uncertainty about such approaches: namely, that it seems extremely unclear what they actually imply.

In particular, all the “violate Guaranteed Payoffs” cases above rely on some implied “prior” epistemic position (e.g., before the aliens invade, before Omega has made her prediction, etc), relative to which the policy in question is evaluated. But why is that the position, instead of some other one? Even if we were just “rewinding” your own epistemology (e.g., to back before you knew that the aliens were invading, but after you learned that about how they were going to make their decision), there would be a question of how far to rewind. Back to your childhood? Back to before you were born, and were an innocent platonic soul about to be spawned into the world? What features does this soul have? In what order were those features added? Does your platonic soul know basic facts about logic? What credence does it have that it’ll get born as a square circle, or into a world where 2+2=5? What in the goddamn hell are we talking about?

Also, it isn’t just a question of “rewinding” your own epistemology to some earlier epistemic position you (or even, a stripped-down version of you) held. There may be no actual time when you knew the information you’re trying to “remember” (e.g., that Omega is going to pull a counter-factual-mugging type stunt) but not the information you’re trying to “forget” (e.g., that the X-th digit of pi is odd). So it seems like the epistemic position in question may need to be one that no one – and certainly not you — has ever, in fact, occupied. How are we supposed to pick out such a position? What desiderata are even relevant? I haven’t engaged much with questions in this vein, but currently, I basically just don’t know how this is supposed to work. (I’m also not the only one with these questions. See e.g. Demski here, on the possibility that “updatelessness is doomed,” and Christiano here. And they’ve thought more about it.)

What’s more, some (basically all?) of these epistemic positions don’t seem particularly exciting from a “winning” perspective — and not just because they violate Guaranteed Payoffs. For example: weren’t you a member of some funky religion as a child — one that you now reject? And weren’t you more generally kind of dumb and ignorant? Are you sure you want to commit to a policy from that epistemic position (see e.g. Kokotajlo (2019) for more )? Or are we, maybe, imagining a superintelligent version of your childhood self, who knows everything? But wait: don’t forget to forget stuff, too, like what will end up in the boxes. But what should we “forget,” what should we “remember,” and what should we learn-for-the-first-time-because-apparently-we’re-talking-about-superintelligences now? 

And even if we had such an attractive and privileged epistemic position identified, it seems additionally hard (totally impossible?) to know what policy this position would actually imply. Suppose, to take a normal everyday example that definitely doesn’t involve any theoretical problems, that you are about to be inserted as a random “soul” into a random “world.” What policy should you commit to? As Parfit’s Hitchhiker, should you pay in the city? Or should you, perhaps, commit to not getting into the man’s car at all, even if doing so is free, in order to disincentivize your younger self from taking ill-advised trips into the desert? Or should you, perhaps, commit to carving the desert sands into statues of square circles, and then burning yourself at the stake as an offering to the average of all logically impossible Gods? One feels, perhaps, a bit at sea; and a bit at risk of, as it were, doing something dumb. After all, you’ve already gone in for burning value for certain; you’ve already started trying to reason like someone you’re not, in a situation that you aren’t in. And without constraints like “don’t burn value for certain” as a basic filter on your action space, the floodgates open wide. One worries about swimming well in such water.  

XI. Living with magic

Overall, the main thing I want to communicate in this post is: I think that the perfect deterministic twin’s prisoner’s dilemma case basically shows that there is such a thing as “acausal control,” and that this is super duper weird. For all intents and purposes, you can decide what gets written on whiteboards light-years away; you can move another man’s arm, in lock-step with your own, without any causal contact between him and you. It actually works, and that, I think, is pretty crazy. It’s not the type of power we think of ourselves as having. It’s not the type of power we’re used to trying to wield.

What does trying to wield it actually look it, especially in our actual lives? I’m not sure. I don’t have a worked out decision-theory that makes sense of this type of thing, let alone a view about how to apply it. As a first pass, though, I’d probably start by trying to figure out what EDT actually implies, once you account for (a) tickle-defense type stuff, and (b) decorrelations between your decision and the decisions of others that arise because you’re doing some kind of funky EDT-type reasoning, and they probably aren’t.

For example: suppose that you want other people to vote in the upcoming election. Does this give you reason to vote, not out of some sort of abstract “be the change you want to see in the world” type of ethic, but because, more concretely, your voting, even in causal isolation from everyone else, will literally (if acausally) increase non-you voter turnout? Let’s first stop and really grok that voting for this reason is a weird thing to do. You’re not just trying to obey some Kantian maxim, or to do your civic duty. You’re not just saying “what if everyone acted like that?” in the abstract, like a schoolteacher to an errant child, with no expectation that “everyone,” as it were, actually will. And you’re certainly not knocking on doors or driving neighbors to the polls. Rather, you’re literally trying to influence the behavior of other people you’ll never interact with, by walking down to the voting booth on your causally isolated island. Indeed, maybe your island is in a different time zone, and you know that the polls everywhere else are closed. Still, you reason, your choice’s influence can slip the surly bonds of space and time; the evening news can still be managed (indeed, some non-EDT decision theories vote even after they’ve seen the evening news).

Is this sort of thinking remotely sensible? Well, note that the EDT version, at least, makes sense only if you should actually expect a higher non-you voter turnout, conditional on you voting for this sort of reason, than otherwise. If the voting population is “perfect deterministic copies of myself who will see the exact same inputs,” this condition holds; and it holds in various weaker conditions, too. How much does it hold in the real world, though? That’s much less clear; and as ever, if you’re considering trying to manage the news, the first thing to check is whether the news is actually manageable. 

In particular, as Abram Demski emphasizes here, the greater the role of weird-decision-theory type calculations in your thinking, the less correlated your decisions will be those of others who are thinking in less esoteric ways. Perhaps you should consider the influence of your behavior on the other people interested in non-causal decision-theories (evening news: “the weird decision theorists turn out in droves!”); but it’s a smaller demographic. That said, what sorts of correlations are at stake here is an empirical question, and there’s no guarantee that something common-sensical will emerge victorious. It seems possible, for example, that many people are implicitly implementing some proto-version of your decision theory, even if they’re not explicit about it.

Here’s another case that seems to me even weirder. Suppose that you’re reading about some prison camps from World War I. They sound horrible, but the description leaves many details unspecified, and you find yourself hoping that the guards in the prison camps were as nice as would be compatible with the historical evidence you’ve seen thus far. Does this give you, perhaps, some weak reason to be nicer to other people, in your own life, on the grounds that there is some weak correlation between your niceness, and the niceness of the guards? You’re all, after all, humans; you’ve got extremely similar genes; you’re subject to broadly similar influences; perhaps you and some of the guards are implementing vaguely similar decision procedures at some level; perhaps even (who knows?) there was some explicit decision theory happening in the trenches. Should you try to be the change you want to see in the past? Should you, now, try to improve the conditions in World War I prison camps? And if so: have you, perhaps, lost your marbles?

Perhaps some people will answer: look, the correlations are too weak, here, for such reasoning to get off the ground. To others, though, this will seem the wrong sort of reply. The issue isn’t that you’re wrong, empirically, about the correlations at stake – indeed, the extent of such correlations seems, in some sense, an open question. The issue is that you’re trying to improve the past at all.

There are other weird applications to consider as well. For example, once you can “control” things you have no causal interaction with, your sphere of possible control could in principle expand throughout a very large universe, allowing you to “influence” the behavior of aliens, other quantum branches, and so on (see e.g. Oesterheld (2017) for more). Indeed, there’s an argument for treating yourself as capable of such influence, even if you have comparatively low credence on the relevant funky decision theories, because being able to influence the behavior of tons of agents raises the stakes of your choice (see e.g. MacAskill et al (2019)). And taken seriously enough, the possibility of non-causal influence can lead to a very non-standard picture of the future – one in which “interactions” between causally-isolated civilizations throughout the universe/multi-verse move much closer to center stage.

Once you’ve started trying to acausally influence the behavior of aliens throughout the multiverse, though, one starts to wonder even more about the whole lost-your-marbles thing. And even if you’re OK with this sort of thing in principle, it’s a much further question whether you should expect any efforts in this broad funky-decision-theoretic vein to go well in practice. Indeed, my strong suspicion is that with respect to multiverse-wide whatever whatevers, for example, any such efforts, undertaken with our current level of understanding, will end up looking very misguided in hindsight, even if the decision theory that motivated them ends up vindicated. Here I think of Bostrom’s “ladder of deliberation,” in which one notices that whether an intervention seems like a good or bad idea switches back and forth as one reasons about it more, with no end in sight, thus inducing corresponding pessimism about the reliability of one’s current conclusions. Even if the weird-decision-theory ladder is sound, we are, I think, on a pretty early rung.

Overall, this whole “acausal control” thing is strange stuff. I think we should be careful with it, and generally avoid doing things that look stupid by normal lights, especially in the everyday situations our common-sense is used to dealing with. But the possibility of new, weird forms of control over the world also seems like the type of thing that could be important; and I think that perfect deterministic twins demonstrate that something in this vicinity is, at least sometimes, real. Its nature and implications, therefore, seem worth attention.

(My thanks to Paul Christiano, Bastian Stern, Nisan Stiennon, and especially to Katja Grace and Ketan Ramakrishnan, for discussion. And thanks, as well, to Abram Demski, Scott Garrabrant, Nick Beckstead, Rob Bensinger, and Ben Pace, for this exchange on related topics.)

New to LessWrong?

New Comment
94 comments, sorted by Click to highlight new comments since: Today at 2:39 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Agent's policy determines how its instances act, but in general it also determines which instances exist, and that motivates thinking of the agent as the algorithm channeled by instances rather than as one of the instances controlling the others, or as all instances controlling each other. For example, in Newcomb's problem, you might be sitting inside the box with the $1M, and if you two-box, you have never existed. Grandpa decides to only have children if his grandchildren one-box. Or some copies in distant rooms numbered (on the outside) 1 to 5 writing integers on blackboards, with only the rooms whose number differs from the integer written by at most 1 being occupied. In the occupied rooms, the shape of the digits is exactly the same, but the choice of the integers determines which (if any) of the rooms are occupied. You may carefully write a 7, and all rooms are empty.

If you are the algorithm, which algorithm are you, and what instances are running you? Unfortunate policy decisions, such as thinking too much, can sever control over some instances, as in ASP, or when (as an instance) retracting too much knowledge (UDT-style) and then (as a resulting algorithm) having to examine... (read more)

Indeed, many philosophers are convinced by something in the vicinity (see e.g. the 2009 Phil Papers survey, in which two-boxing, at 31%, beats one-boxing, at 21%, with the other 47% answering “other” – though we might wonder what “other” amounts to in a case with only two options). 

You can set the respondent details to "fine" to see the specific "other" answers given:

Insufficiently familiar with the issue219 / 931 (23.5%)
Agnostic/undecided124 / 931 (13.3%)
Skip44 / 931 (4.7%)
The question is too unclear to answer19 / 931 (2.0%)
There is no fact of the matter15 / 931 (1.6%)
Accept another alternative6 / 931 (0.6%)
Other5 / 931 (0.5%)
Reject both5 / 931 (0.5%)
Accept both3 / 931 (0.3%)
Accept an intermediate view1 / 931 (0.1%)
1Joe Carlsmith3y

Curated. This is a clear, engaging, and scholarly account of basic decision theories. I really like that this is written as the author's feelings about each theory rather than just an impartial textbook description. The main point, "it's weird we that we can have 'acasual control'", is well-worth engaging with if you hadn't already thought about it. I might quibble that the treatment of functional/updateless/MIRI-esque decision theories could be treated in more depth, but overall I heartily recommend this to anyone not already well-versed in the decision-theory literature.


I agree that figuring out what you "should have" precommitted can be fraught.

One possible response to that problem is to set aside some time to think about hypotheticals and figure out now what precommitments you would like to make, instead of waiting for those scenarios to actually happen.  So the perspective is "actual you, at this exact moment".

I sometimes suspect you could view MIRI's decision theories as an example of this strategy.

Alice:  Hey, Bob, have you seen this "Newcomb's problem" thing?

Bob:  Fascinating.  As we both have unshakable faith in CDT, we can easily agree that two-boxing is correct if you are surprised by this problem, but that you should precommit to one-boxing if you have the opportunity.

Alice:  I was thinking--now that we've realized this, why not precommit to one-boxing right now?  You know, just in case.  The premise of the problem is that Omega has some sort of access to our actual decision-making algorithm, so in principle we can precommit just by deciding to precommit.

Bob:  That seems unobjectionable, but not very useful in expectation; we're very unlikely to encounter this exact scenario.  It seems like wh

... (read more)

The output of this process is something people have taken to calling Son-of-CDT; the problem (insofar as we understand Son-of-CDT well enough to talk about its behavior) is that the resulting decision theory continues to neglect correlations that existed prior to self-modification.

(In your terms: Alice and Bob would only one-box in Newcomb variants where Omega based his prediction on them after they came up with their new decision theory; Newcomb variants where Omega's prediction occurred before they had their talk would still be met with two-boxing, even if Omega is stipulated to be able to predict the outcome of the talk.)

This still does not seem like particularly sane behavior, which means, unfortunately, that there's no real way for a CDT agent to fix itself: it was born with too dumb of a prior for even self-modification to save it.

Thanks. After thinking about your explanation for a while, I have made a small update in the direction of FDT.  This example makes FDT seem parsimonious to me, because it makes a simpler precommitment. I almost made a large update in the direction of FDT, but when I imagined explaining the reason for that update I ran into a snag.  I imagined someone saying "OK, you've decided to precommit to one-boxing.  Do you want to precommit to one-boxing when (a) Omega knows about this precommitment, or (b) Omega knows about this precommitment, AND the entangled evidence that Omega relied upon is 'downstream' of the precommitment itself?  For example, in case (b), you would one-box if Omega read a transcript of this conversation, but not if Omega only read a meeting agenda that described how I planned to persuade you of option (a)." But when phrased that way, it suddenly seems reasonable to reply:  "I'm not sure what Omega would predict that I do if he could only see the meeting agenda.  But I am sure that the meeting agenda isn't going to change based on whether I pick (a) or (b) right now, so my choice can't possibly alter what Omega puts into the box in that case.  Thus, I see no advantage to precommiting to one-boxing in that situation." If Omega really did base its prediction just on the agenda (and not on, say, a scan of the source code of every living human), this reply seems correct to me.  The story's only interesting because Omega has god-like predictive abilities. Which I guess shouldn't be surprising, because if there were a version of Newcomb's problem that cleanly split FDT from CDT without invoking extreme abilities on Omega's part, I would expect that to be the standard version. I'm left with a vague impression that FDT and CDT mostly disagree about "what rigorous mathematical model should we take this informal story-problem to be describing?" rather than "what strategy wins, given a certain rigorous mathematical model of the game?"  CDT thinks you are cho
One way of noticing the Son-of-CDT issue dxu mentioned is thinking of CDT as not just being unable to control the events outside the future lightcone, but as not caring about the events outside the future lightcone. So even if it self-modifies, it's not going to accept tradeoffs between the future and not-the-future of the self-modification event, as that would involve changing its preference (and somehow reinventing preference for the events it didn't care about just before the self-modification event). With time, CDT continually becomes numb to events outside its future, loses parts of its values. Self-modifying to Son-of-CDT stops further loss, but doesn't reverse past loss.

DALL-E 2 images:

A statue dedicated to the average of all logically-impossible Gods. Digital art.

A detailed 3D render of a statue dedicated to the average of all logically-impossible Gods. Digital art.

2Joe Carlsmith2y
:) -- nice glasses

For a long time, I could more-or-less follow the logical arguments related to e.g. Newcomb’s problem, but I didn’t really get it, like, it still felt wrong and stupid at some deep level. But when I read Joe’s description of “Perfect deterministic twin prisoner’s dilemma” in this post, and the surrounding discussion, thinking about that really helped me finally break through that cloud of vague doubt, and viscerally understand what everyone’s been talking about this whole time. The whole post is excellent; very strong recommend for the 2021 review.


Suppose you run your twins scenario, and the twins both defect.  You visit one of the twins to discuss the outcome.

Consider the statement:  "If you had cooperated, your twin would also have cooperated, and you would have received $1M instead of $1K."  I think this is formally provable, given the premises.

Now consider the statement:  "If you had cooperated, your twin would still have defected, and you would have received $0 instead of $1K."  I think this is also formally provable, given the premises.  Because we have assumed a ... (read more)

2Tentative Fate3y
Indeed. My reading is that the crux of the argument here is: causality implies no free will in twin PD, so equally, free will implies no causality; therefore, we can use our free will to break causality. The relevant quote: To me, Occam's razor prefers no-free-will to causality-breaking. Granted, causality is as mysterious as free will. But causality is more fundamental, more basic — it exists in non-agent systems too. Free will, on the other hand, privileges agents, as if there's something metaphysical about them. By the way, the causality view is still consistent with one-boxing. I go with causality.
Agreed. I think this type of reflection is the decision theory equivalent of calculating the perfect launch sequence in Kerbal Space Program. If you sink enough time into it, you can probably achieve it, but by then you'll have loooong passed the point of diminishing returns, and very little of what you've learned will be applicable in the real world, because you've spent all your energy optimizing strategies that immediately fall apart the second any uncertainty or fuzziness is introduced into your simulation.
How so? Functional Decision Theory handles these situations beautifully, with or without uncertainty.
Has Functional Decisions Theory ever been tested "on the field", so to speak? Is there any empirical evidence that it actually helps people / organizations / AIs make better decisions in the real world?
Zvi would tell you that yes it has: How I Lost 100 Pounds Using TDT.
Look, I'm going to be an asshole, but no, that doesn't count. There are millions of stories of the type "I lost lots of weight thanks to X even though nothing else had worked" around. They are not strong evidence that X works.
FWIW, in your comment above you had asked for "any empirical evidence". I agree that Zvi's story is not "strong evidence", but I don't think that means it "doesn't count" — a data point is a data point, even if inconclusive on its own. (And I think it's inappropriate to tell someone that a data point "doesn't count" in response to a request for "any empirical evidence". In other words, I agree with your assessment that you were being a little bit of an asshole in that response ;-) )
Alright, sorry. I should have asked "is there any non-weak empirical evidence that...". Sorry if I was condescending.
When deciding to skip the gym, FDT would tell you your decision procedure now is similar to the one you use each day when deciding to go to the gym. So you'd best go now, because then you always go. (This is a bit simplified, as the situation may not be the same each day, but the point stands.) Furthermore, FDT denies voting is irrational when there are enough voters who are enough similarly-minded to you (who vote when you vote, since their decision procedure is the same). This is a pretty cool result. Also, it may be worth noting that many real-life scenarios are Newcomblike: e.g. people predict what you will do using your microexpressions. Newcomb's Problem is just a special case.

Do you “manage the news” by refusing to read the morning’s newspaper, or by scribbling over the front page “Favored Candidate Wins Decisively!”? No: if you’re rational, your credence in the loss is still 70%.

I feel like the "No; if you're rational" bit is missing some of the intuition against EDT. Physical humans do refuse to read the morning's newspaper, or delay opening letters, or similar things, I think because of something EDT-ish 'close to the wire'. (I think this is what's up with ugh fields.)

I think there's something here--conservation of expected ... (read more)

"Here’s another case that seems to me even weirder. Suppose that you’re reading about some prison camps from World War I. They sound horrible, but the description leaves many details unspecified, and you find yourself hoping that the guards in the prison camps were as nice as would be compatible with the historical evidence you’ve seen thus far. Does this give you, perhaps, some weak reason to be nicer to other people, in your own life, on the grounds that there is some weak correlation between your niceness, and the niceness of the guards?"

I am wondering ... (read more)

I just love this quote. (And, I need it in isolation so I can hyperlink to it.)

"When I step back in Newcomb’s case, I don’t feel especially attached to the idea that it the way, the only “rational” choice (though I admit I feel this non-attachment less in perfect twin prisoner’s dilemmas, where defecting just seems to me pretty crazy). Rather, it feels like my conviction about one-boxing start to bypass debates about what’s “rational” or “irrational.” Faced with the boxes, I don’t feel like I’m asking myself “what’s the rational choice?” I feel like I’m, w... (read more)

Great post! I wonder if the 'weirdness' be partially due to intuitions about human freedom of choice. For instance, it seems nonsensical to ask whether unicellular organisms could alter their behaviour to modify models predicting said behaviour, and thus 'control' their fate. Are humans in the same boat? 

perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own.


In this situation, you can  draw a diagram of the whole thing, including all identical copies, on the whiteboard. However you can't point out which copy is you. 

In this scenario, I don't think you can say that you are one copy or the other. You are both copies.

There could be external information you and your copy are not aware of that would distinguish you two, e.g. how far different stars appear, time since the big bang. And we can still talk about things outside Hubble volumes. These are mostly relational properties that can be used to tell spacetime locations apart.
Any two identical things could be distinguished by their spacetime locations...while still being identical in their own intrinsic properties. Basically , space.and time are what allow numerical.non-identity in spite of qualitative identity.

I think this can be somewhat clarified (and made less spooky) by observing that it's closely related to the concepts of kin selection and inclusive fitness (in evolutionary biology). It is in fact a good evolutionary strategy to be more cooperative when dealing with organisms that are closely related to you. The "Perfect deterministic twin prisoner’s dilemma" you propose is simply a special case of this where the organism you're dealing with is a clone.

In the "software twins" thought exercise, you have a "perfect, deterministic copy". But if it's a perfect copy and deterministic, than you're also deterministic. As you say, compatibilism is central to making this not incoherent, presumably no decision theory is relevant if there are no decisions to be made.

I think a key idea in compatibilism is that decisions are not made at a particular instant in time. If a decision is made on the spot, disconnected from the past, it's not compatibilism. If a decision is a process that takes place over time, the only wa... (read more)

I liked the writing style. But it seems that no one in the comments noted the obvious that it’s not “you control the aliens”, but “the aliens control you” (although this sounds even crazier and like a freak in everyday life), in other words, you are in any case a simulation, but whose results can predict the decision of another agent, including a decision based on prediction, and so on. This removes all questions about magic (although it can be fun to think about it). Although this can cause a problem for the calculator, which, instead of determining the result "5 + 7", will try to determine what it will give out for the query "5 + 7", but will not work on calculating the sum.

Re: the perfect deterministic twin prisoner's dilemma:

You’re a deterministic AI system, who only wants money for yourself (you don’t care about copies of yourself). The authorities make a perfect copy of you, separate you and your copy by a large distance, and then expose you both, in simulation, to exactly identical inputs (let’s say, a room, a whiteboard, some markers, etc). You both face the following choice: either (a) send a million dollars to the other (“cooperate”), or (b) take a thousand dollars for yourself (“defect”). 

If we say there are two... (read more)

This doesn't make any sense to me. People are made of atoms. People make choices. Nothing is inconsistent about that. If two people were atomically identical, they'd make the same choices. But that wouldn't change anything about how the choice was happening. Right? Suppose we made an atom-by-atom copy of you, as in the post. Does the existence of this copy mean that you stop choosing your own decisions? Have I just misunderstood what you're saying?
Thanks, this gives me another chance to try to lay out this argument (which is extra-useful because I don't think I've hit upon the clearest way of making the point yet): Absolutely. But "choice", like agency, is a property of the map not of the territory. If you full specify the initial position of all of the atoms making up my body and their velocities, etc. -- then clearly it's not useful to speak of me making any choices. You are in the position of Laplace's demon: you know where all my atoms are right now, you know where they will be in one second, and the second after that, and so on. We can only meaningfully talk about the concept of choice from a position of partial ignorance. (Here I'm speaking from a Newtonian framing, with atoms and velocities, but you could translate this to QM.) Similarly. If you performed your experiment and made an atom-by-atom copy of me, then you know that I will make the same choice as my clone. It doesn't make sense to talk from your perspective about how I should make my "choice" -- what I and my clone will do is already baked in by the law of motion for my atoms, from the assumption that you know we're atom-by-atom copies. (If "I" am operating from an ignorant perspective, then "I" can still talk about "making a choice" from "my" perspective.)   Does that make sense, do you see what I'm trying to say? Do you see any flaws if so?
Here's a related old comment from @Anders_H that I think frames the issue nicely, for my own reference at the very least:   (He goes on to say -- less relevantly for the discussion here, but again I like the framing so am recording to remind future-me -- "CDT and TDT differ in how they operationalize choice, and therefore whether the decision theories are consistent with free will. In Causal Decision theory, the agents choose actions from a choice set. In contrast, from my limited understanding of TDT/UDT, it seems as if agents choose their source code. This is not only inconsistent with my (perhaps naive) subjective experience of free will, it also seems like it will lead to an incoherent concept of "choice" due to recursion.")

Why can't I think of myself as a randomly sampled voter? 

Same reason you can't ignore other relevant pieces of information -- doing so makes your probability assignments less accurate. For example, if you know that John is a vocal supporter of the less-popular party, you're not going to ignore that information and assign a high probabiity to the proposition that he votes for the winner.

If you're looking at this ex ante, your probability of voting for the winner is ~50% because your vote is uncorrelated with everyone else's. For every possible arrangem... (read more)

The coin flip argument doesn't work because your vote isn't drawn from the same distribution as most votes. E.g. if 60% of voters vote for A and 40% for B, then A will win, but your coin flip only has 50% probability of picking A, so it's 50-50 for you. (Or well, slightly better odds than 50-50 because your vote exerts a causal force on who wins, but unless the voting population is smallish this effect is negligible. And it disappears in the presence of an opponent.)

I use acausal control between my past and future selves. I have a manual password-generating algorithm based on the name and details of a website. Sometimes there are ambiguities (like whether to use the name of a site vs. the name of the platform, or whether to use the old name or the new name).

Instead of making rules about these ambiguities, I just resolve them arbitrarily however I feel like it (not "randomly" though). Later, future me will almost always resolve that ambiguity in the same way!


"What’s more: such strings can’t be severed. Try, for example, to make the two whiteboards different. Imagine that you’ll get ten million dollars if you succeed. It doesn’t matter: you’ll fail. Your most whimsical impulse, your most intricate mental acrobatics, your special-est snowflake self, will never suffice: you can no more write “up” while he writes “down” than you can floss while the man in the bathroom mirror brushes his teeth. "

I'd just flip a coin a bunch of times and write its results, or do some similar process to introduce entropy!

But wait, th... (read more)

This is a fact about the world, not about the room. I don't see what the issue is with giving the agent the definition of the world and proving that yes, there are two instances there and there. If the agent knows the room, they can check that it's the room that is in these two locations, though you would need to stipulate that the presented definition of the world is correct, or that the agent already knew it.

Nice post! As you can probably imagine, I agree with most of the stuff here.

>VII. Identity crises are no defense of CDT

On 1 and 2: This is true, but I'm not sure one-boxing / EDT alone solves this problem. I haven't thought much about selfish agents in general, though.

Random references that might be of interest:

>V. Monopoly money

As far as I can tell, this kind of point was first made on p. 108 here:

Gardner, Martin (1973). “Free will revisited, with a mind-bending prediction paradox by William Newcomb”. In: Scientific American 229.1, pp. 104–109.

Cf. h... (read more)

I know it's off topic, but I hope Omega is precise in how it phrases questions, because Paris is in Ohio, and the Eiffel Tower is in Cincinnati.

indeed they are now. retrocausality in action? :)

That said, if you’re in the real case, with the real incentives, then it’s ultimately the correlation gives those incentives that seems relevant

Should that be "given those incentives"?

3Joe Carlsmith3y
Yes, edited :)

(A lot of this has been covered before.)

Showing that you can control such things* doesn't seem to disprove CDT. It seems to motivate different CDT dynamics. (In case that's a source of confusion, it could be called something else like Control Decision Theory.)

*taking this as given

Instead of picking one option you could randomize. (If Newcomb can read my mind, then a coin flip should be no problem.)

Are you really supposed to just leave it there, sitting in the attic? What sort of [madness] is that?

If it's for someone else...

Sometimes, one-boxers object: if
... (read more)

This feels like the kind of philosophical pondering that only makes any amount of sense in a world of perfect spherical cows, but immediately falls apart when you consider realistic real-world parameters.

Like... to go back to the Newcomb's problem... perfect oracles that can predict the future obviously don't exist. I mean, I know the author knows that. But I think we disagree on how relevant that is?

Discussions of Newcomb's problem usually handwave the oracle problem away; eg "Omega’s predictions are almost always right"... but the "almost" is pulling a l... (read more)

I haven't read yet, but seems relevant to the Facebook group Effective altruism: past people

I also made a post on Less Wrong sketching our reasons why backwards causation might not necessarily be absurd, but more for physics-related reasons. I would be keen to see someone with more physics knowledge develop this argument in greater depth.

I also feel that the Perfect deterministic twin prisoner’s dilemma is the strongest counter-example for CDT and really liked the "play money" intuition pump that you provided.

We can think of the magic, here, as arising centrally because compatibilism about free will is true... Is that changing the past? In one se

... (read more)
I think a good way of setting up augmented reality is with CDT-style surgery on an algorithm. By uncoupling an enactable event (action/decision/belief) from its definition, you allocate a new free variable (free will) that the world will depend on, and eventually set that variable to whatever you decide it to be, ensuring that the cut closes. The trick is to set up the cut in a way that won't be bothered by the possibility of divergence between the enactable variable and its would-be definition, and that's easier to ensure in a specifically constructed setting of an agent's abstract algorithm rather than in a physical world of unclear nature.
CDT surgery is pretty effective most of the time, but the OP describes some of its limitations. I'm confused - are you just claiming it is effective most of the time or that we shouldn't worry too much about these limitations?
Surgery being performed on the algorithm (more carefully, on the computation specified by the algorithm) rather than on instances in the world is the detail that makes the usual problems with CDT go away, including the issues discussed in the post.

I think a lot of writing/thinking about this topic is needlessly complicated.  CDT clearly doesn't work if it's causal model is wrong.  I don't get why there's any controversy about that.  Further, it's incredibly misleading to use the word "control", when you mean "correlated".  In the cases of constrained behavior (by superior modeling or simulation of the perception/decision mechanism), that's not actually a free cause - the "choice" is actually caused by some upstream event or state of the universe.

You can describe the same thing at two levels of abstraction: "I moved the bishop to threaten my opponent's queen" vs "I moved the bishop because all the particles and fields in the universe continued to follow their orderly motions according to the fundamental laws of physics, and the result was that I moved the bishop". The levels are both valid, but it's easy to spout obvious nonsense by mixing them up: "Why is the chess algorithm analyzing all those positions? So much work and wasted electricity, when the answer is predetermined!!!!" :-P

Anyway, I think (maybe) your comment mixes up these two levels. When we say "control" we're talking a thing that only makes sense at the higher (algorithm) level. When we say "the state of the universe is a constraint", that only makes sense at the lower (physics) level.

For the identical twins, I think if we want to be at the higher (algorithm) level, we can say (as in Vladimir_Nesov's comment) that "control" is exerted by "the algorithm", and "the algorithm" is some otherworldly abstraction on a different plane of existence, and "the algorithm" has two physical instantiations, and when "the algorithm" exerts "control", it controls both of the t... (read more)

Yup, there are different levels of abstraction to model/predict/organize our understanding of the universe.  These are not exclusive, nor independent, though - there's only one actual universe!   Mixing (or more commonly switching among) levels is not a problem if all the levels are correct on the aspects being predicted. The lower levels are impossibly hard to calculate, but they're what actually happens.  The higher levels are more accessible, but sometimes wrong.  When you get contradictory results at a high level, you know something's wrong, and have to look at the lower levels (and when this isn't possible, as it so often isn't, you kind of have to guess.  but you need to be clear that you're guessing and that your model is being used outside it's validity domain).   This is relevant when talking about "control", as there are some things that "feel" possible (say, moving the bishop to a different-colored space on the board), but actually aren't (because the lower-level rules don't work that way).

I am surprised that, to this day, there are people on LW who haven't yet dissolved free will, despite the topic being covered explicitly (both in the Sequences and in a litany of other posts) over a decade ago.

No, "libertarian free will" (the idea that there exists some notion of "choice" independent of and unaffected by physics) does not exist. Yes, this means that if you are a god sitting outside of the universe, you can (modulo largely irrelevant fuzz factors like quantum indeterminacy) compute what any individual physical subsystem in the universe will do, including subsystems that refer to themselves as "agents".

But so what? In point of fact, you are not a god. So what bearing does this hypothetical god's perspective have on you, in your position as a merely-physical subsystem-of-the-universe? Perhaps you imagine that the existence of such a god has deep relevance for your decision-making right here and now, but if so, you are merely mistaken. (Indeed, for some people who consistently make this mistake, it may even be the case that hearing of the god's existence has harmed them.)

Suppose, then, that you are not God. It follows that you cannot know what you will decide before ac... (read more)

There is no clear account of this topic. It's valuable to remain aware of that, so that the situation may be improved. Many of the points you present as evident are hard to interpret, let alone ascertain, it's not a situation where doubt and disagreement are inappropriate.
The notion of control that makes sense to me is enacting a particular self-fulfilling belief out of a collection of available correct self-fulfilling beliefs. This is not correlation (in case of correlation one should look for a single belief in control of the correlated events), but the events controlled by such beliefs may well be instantiated by processes unrelated to the algorithm that determines which of the possible self-fulfilling beliefs gets enacted, that is unrelated to the algorithm that controls the events. The belief itself doesn't have to be explicitly instantiated at all, it's part of the algorithm's abstract computation. The processes instantiating the events, and those channeling the algorithm, only have to be understood by the algorithm that discovers correct self-fulfilling beliefs and decides which one to enact, these processes don't have to themselves be controlled by it, in fact them not being controlled makes for a better setup, this way the belief under consideration is more specific. (I don't understand the preferred alternative you refer to with "free cause".)
I typed too fast, and combined "free choice" and "uncaused action".   I don't have a gears-level understanding of causality that admits of the BOTH predictability of decision, AND an algorithm that "decides which one to enact".  It seems to me that in order to be predictable, the decision has to be caused by some observable configuration BEFORE the prediction.  That is, there is an upstream cause to the prediction and to the decision.
5Rob Bensinger3y
I disagree that "control" is misleading. Or rather, I think that the concept of "controlling" something is sort of weird and maybe doesn't make sense when you drill down on it; and the best ways of turning it into a concept that makes sense (and has practical importance) tends to require taking some correlations into account. Just saying "correlation" also isn't sufficient, because not all correlations are preserved across different actions you can take.
I think we're agreed there.  IMO, THAT is the complexity to focus on - what causes decisions, and what counterfactual decisions are "possible".  
It would require asymmetries and counterfactuals, but they exist or at least there are good enough approximations to then. PS. What effect would the incomprehensibility of control have on the Control Problem?
This. The deterministic prisoner's dilemma reminds me a lot of quantum entanglement and Bell's theorem experiments - except it doesn't even have THAT amount of mystery, it's just plain old correlation. If I pick two boxes, put $1000 into one, and send them both at near lightspeed in opposite directions, you're not doing FTL signalling when you open one and find the money, thus deducing instantly that the other is empty. This is the same, but it feels weird because intelligences; however, unless you believe in supernatural source of free will (in which case CDT is the right choice regardless, and you could reasonably defect), intelligences should be subject to the same exact causal chains as boxes full with money.
I agree. EPR mean lack of local determinism , so the solution is ambiguous between indeterminism and nonlocal determinism.
1[comment deleted]3y

In yankees vs red sox, the described oracle (predicting "win" or "lose") is not always possible. The behavior of the agent changes depending on the oracle's prediction. There may not be a possible oracle prediction such that the agent behaves according to it.

1. Omega predictors are impossible

They are unstable/impossible not just in practice but in theory as well. It's theoretically not possible for Omega to exist because the decision of 1-2 box is recursive. You're essentially invoking a magical agent that somehow isn't affected by infinite recursion.

"Omega can 'snap' through the infinite recursive loop." No it can't. And if you claim it can you're essentially dropping a nuke inside your logical system that can probably produce all sorts of irrational true=false theorems.

2. Writing on whiteboards, acausal cont... (read more)

You should have a look at the conference on retrocausation.  And it would also be valuable to look at Garret Moddel's experiments on the subject.

You're not a randomly sampled voter to begin with. You're yourself. In this specific case, you could model yourself as a random voter among the voters who used a coinflip. But that's different from modelling yourself as a random voter overall.

In your "active termite blackmail" example, you say that the "it's too late" objection still applies.  That might be true as regards this year's termites, but you specified this recurs every year.  It seems to me there's plenty of room for this year's decision to (causally) influence next year's chances of termites; whether you pay this year seems like strong evidence about whether you will pay next year.

(EDIT: Fixed typo.)

This is probably too trivial a point to mention, but FWIW:

“Try, for example, to make the two whiteboards different. Imagine that you’ll get ten million dollars if you succeed. It doesn’t matter: you’ll fail. Your most whimsical impulse, your most intricate mental acrobatics, your special-est snowflake self, will never suffice”

I know you specified that both AIs are internally deterministic and have identical inputs (rooms, whiteboards etc), but if I were them I’d try and seek out some indeterminacy elsewhere. Eg go out of the door, get some dice (or quantum... (read more)


Indeed, for all intents and purposes, you control what he does. Imagine, for example, that you want to get something written on his whiteboard: let’s say, the words “I am the egg man; you are the walrus.” What to do? Just write it on your own whiteboard. Go ahead, try it. It will really work. When you two rendezvous after this is all over, his whiteboard will bear the words you chose. In this sense, your whiteboard is a strange kind of portal; a slate via which you can etch your choices into his far-away world; a chance to act, spookily, at a distance.

I... (read more)

This has been voted down too much. I think it's a pretty good objection. What does it even mean to say that you can "control" your duplicate if we are postulating that what you and your duplicate do is a deterministic function of your current states? What does it even mean to say that you can control or decide anything under these circumstances?
It means that the relationship that you call (from your subjective perspective) "my controlling what I do" between the Deciding and the Everything Else, is the same as the relationship between the Deciding in you and the Everything Else in your duplicate (as well as D-dup:EE-dup and D-dup:EE-you).
But if control is inherently asymmetric, it can't be a relationship that is "the same" or symmetric. That might not be your definition of control. I can explain control in terms of asymmetry and counterfactuals. I haven't seen an alternative explanation.
(I don't understand what you're saying.)
By control, I mean that I can control for example where my hand goes. That's asymmetric between me and my hand, but it's very similar to the relationship between you and your hand; and it's even more similar to the relationship between my duplicate and their hand.
The obvious answer to this is that under these circumstances, you don't control your hand or anything else either.
Of course you know that sometimes the right way to understand the situation is to say that you control your hand. Right? That's what we're talking about here.
Ok. You're controlling your hand, not vice versa. But what about the relationship between you and your duplicate....who is the controller and who is controlled?
The parts of you and your duplicate that do the controlling, are the same thing.
No , they are exact duplicates but numerically distinct.
It's neither causality nor correlation: it's subjunctive dependence, of which causality is a special case. Since your counterpart is implementing the same decision procedure as you, making decision X in situation Y means your counterpart does X in Y too.
So is subjunctive dependence control?
(First, a correction: I said it's neither causality nor correlation, but it is of course correlation; it's just stronger than that.) I'd say yes, but I haven't thought much about control yet. If I cooperate, so does my twin. If, counterfactually, I defected instead, then my twin would also have defected. I'd see that as control, but it depends on your definition I guess.
You and your twin would be synchronised. It's literally synchronicity, an acausal connecting principal.
Yes, it's acausal. No disagreement there!
[+][comment deleted]3y10
[+][comment deleted]3y10