# 8

Trigger warning: In a thought experiment in this post, I used a hypothetical torture scenario without thinking, even though it wasn't necessary to make my point. Apologies, and thanks to an anonymous user for pointing this out. I'll try to be more careful in the future.

Should you pay up in the counterfactual mugging?

I've always found the argument about self-modifying agents compelling: If you expected to face a counterfactual mugging tomorrow, you would want to choose to rewrite yourself today so that you'd pay up. Thus, a decision theory that didn't pay up wouldn't be reflectively consistent; an AI using such a theory would decide to rewrite itself to use a different theory.

But is this the only reason to pay up? This might make a difference: Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and so you shouldn't.

I've since changed my mind, and I've recently talked about part of the reason for this, when I introduced the concept of an l-zombie, or logical philosophical zombie, a mathematically possible conscious experience that isn't physically instantiated and therefore isn't actually consciously experienced. (Obligatory disclaimer: I'm not claiming that the idea that "some mathematically possible experiences are l-zombies" is likely to be true, but I think it's a useful concept for thinking about anthropics, and I don't think we should rule out l-zombies given our present state of knowledge. More in the l-zombies post and in this post about measureless Tegmark IV.) Suppose that Omega's coin had come up the other way, and Omega had turned the sky green. Then you and I would be l-zombies. But if Omega was able to make a confident guess about the decision we'd make if confronted with the counterfactual mugging (without simulating us, so that we continue to be l-zombies), then our decisions would still influence what happens in the actual physical world. Thus, if l-zombies say "I have conscious experiences, therefore I physically exist", and update on this fact, and if the decisions they make based on this influence what happens in the real world, a lot of utility may potentially be lost. Of course, you and I aren't l-zombies, but the mathematically possible versions of us who have grown up under a green sky are, and they reason the same way as you and me—it's not possible to have only the actual conscious observers reason that way. Thus, you should pay up even in the blue-sky mugging.

But that's only part of the reason I changed my mind. The other part is that while in the counterfactual mugging, the answer you get if you try to use Bayesian updating at least looks kinda sensible, there are other thought experiments in which doing so in the straight-forward way makes you obviously bat-shit crazy. That's what I'd like to talk about today.

*

The kind of situation I have in mind involves being able to influence whether you exist, or more precisely, influence whether the version of you making the decision exists as a conscious observer (or whether it's an l-zombie).

Suppose that you wake up and Omega explains to you that it's kidnapped you and some of your friends back in 2014, and put you into suspension; it's now the year 2100. It then hands you a little box with a red button, and tells you that if you press that button, Omega will slowly torture you and your friends to death; otherwise, you'll be able to live out a more or less normal and happy life (or to commit painless suicide, if you prefer). Furthermore, it explains that one of two things have happened: Either (1) humanity has undergone a positive intelligence explosion, and Omega has predicted that you will press the button; or (2) humanity has wiped itself out, and Omega has predicted that you will not press the button. In any other scenario, Omega would still have woken you up at the same time, but wouldn't have given you the button. Finally, if humanity has wiped itself out, it won't let you try to "reboot" it; in this case, you and your friends will be the last humans.

There's a correct answer to what to do in this situation, and it isn't to decide that Omega's just given you anthropic superpowers to save the world. But that's what you get if you try to update in the most naive way: If you press the button, then (2) becomes extremely unlikely, since Omega is really really good at predicting. Thus, the true world is almost certainly (1); you'll get tortured, but humanity survives. For great utility! On the other hand, if you decide to not press the button, then by the same reasoning, the true world is almost certainly (2), and humanity has wiped itself out. Surely you're not selfish enough to prefer that?

The correct answer, clearly, is that your decision whether to press the button doesn't influence whether humanity survives, it only influences whether you get tortured to death. (Plus, of course, whether Omega hands you the button in the first place!) You don't want to get tortured, so you don't press the button. Updateless reasoning gets this right.

*

Let me spell out the rules of the naive Bayesian decision theory ("NBDT") I used there, in analogy with Simple Updateless Decision Theory (SUDT). First, let's set up our problem in the SUDT framework. To simplify things, we'll pretend that FOOM and DOOM are the only possible things that can happen to humanity. In addition, we'll assume that there's a small probability $\textstyle \varepsilon$ that Omega makes a mistake when it tries to predict what you will do if given the button. Thus, the relevant possible worlds are $\textstyle \Omega = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{correct},\mathrm{incorrect}\}$. The precise probabilities you assign to these doesn't matter very much; I'll pretend that FOOM and DOOM are equiprobable, $\textstyle \mathbb{P}(x,\mathrm{incorrect}) = \varepsilon/2$ and $\textstyle \mathbb{P}(x,\mathrm{correct}) = (1-\varepsilon)/2$.

There's only one situation in which you need to make a decision, $\textstyle \mathcal{I} = \{*\}$; I won't try to define NBDT when there is more than one situation. Your possible actions in this situation are to press or to not press the button, $\textstyle \mathcal{A}(*) = \{P,\neg P\}$, so the only possible policies are $\textstyle \pi_P$, which presses the button ($\textstyle \pi_P(*) = P$), and $\textstyle \pi_{\neg P}$, which doesn't ($\textstyle \pi_{\neg P}(*) = \neg P$); $\textstyle \Pi = \{\pi_P,\pi_{\neg P}\}$.

There are four possible outcomes, specifying (a) whether humanity survives and (b) whether you get tortured: $\textstyle \mathcal{O} = \{\mathrm{foom}, \mathrm{doom}\} \times \{\mathrm{torture},\neg\mathrm{torture}\}$. Omega only hands you the button if FOOM and it predicts you'll press it, or DOOM and it predicts you won't. Thus, the only cases in which you'll get tortured are $\textstyle o((\mathrm{foom},\mathrm{correct}),\pi_P) = (\mathrm{foom},\mathrm{torture})$ and $\textstyle o((\mathrm{doom},\mathrm{incorrect}),\pi_P) = (\mathrm{doom},\mathrm{torture})$. For any other $\textstyle x\in\{\mathrm{foom},\mathrm{doom}\}$, $\textstyle y\in\{\mathrm{correct},\mathrm{incorrect}\}$, and $\textstyle \pi\in\Pi$, we have $\textstyle o((x,y),\pi) = (x,\neg\mathrm{torture})$.

Finally, let's define our utility function by $u((\mathrm{foom},\neg\mathrm{torture})) = L$, $u((\mathrm{foom},\mathrm{torture})) = L-1$, $u((\mathrm{doom},\neg\mathrm{torture})) = -L$, and $u((\mathrm{doom},\mathrm{torture})) = -L-1$, where $\textstyle L$ is a very large number.

This suffices to set up an SUDT decision problem. There are only two possible worlds $\textstyle \omega\in\Omega$ where $\textstyle u(o(\omega,\pi_P))$ differs from $\textstyle u(o(\omega,\pi_{\neg P}))$, namely $\textstyle (\mathrm{foom},\mathrm{correct})$ and $\textstyle (\mathrm{doom},\mathrm{incorrect})$, where $\textstyle \pi_P$ results in torture and $\textstyle \pi_{\neg P}$ doesn't. In each of these cases, the utility of $\textstyle \pi_P$ is lower (by one) than that of $\textstyle \pi_{\neg P}$. Hence, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi_P))] < \mathbb{E}[u(o(\boldsymbol{\omega},\pi_{\neg P}))]$, implying that SUDT says you should choose $\textstyle \pi_{\neg P}$.

*

For NBDT, we need to know how to update, so we need one more ingredient: a function specifying in which worlds you exist as a conscious observer. In anticipation of future discussions, I'll write this as a function $\textstyle \mu(i;\omega,\pi)$, which gives the "measure" ("amount of magical reality fluid") of the conscious observation $\textstyle i\in\mathcal{I}$ if policy $\textstyle \pi\in\Pi$ is executed in the possible world $\textstyle \omega\in\Omega$. In our case, $\textstyle i = *$ and $\textstyle \mu(*;\omega,\pi)\in\{0,1\}$, indicating non-existence and existence, respectively. We can interpret $\textstyle \mu(i;\omega,\pi)$ as the conditional probability of making observation $\textstyle i$, given that the true world is $\textstyle \omega$, if plan $\textstyle \pi$ is executed. In our case, $\textstyle \mu(*;(\mathrm{foom},\mathrm{correct}),\pi_P) =$ $\textstyle \mu(*;(\mathrm{foom},\mathrm{incorrect}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{correct}),\pi_{\neg P}) =$ $\textstyle \mu(*;(\mathrm{doom},\mathrm{incorrect}),\pi_P) = 1$, and $\textstyle \mu(*;\omega,\pi) = 0$ in all other cases.

Now, we can use Bayes' theorem to calculate the posterior probability of a possible world, given information $\textstyle i = *$ and policy $\textstyle \pi$: $\textstyle \mathbb{P}(\omega\mid i;\pi) = \mathbb{P}(\omega)\cdot\mu(i;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(i;\omega',\pi)$. NBDT tells us to choose the policy $\textstyle \pi$ that maximizes the posterior expected utility, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid i;\pi]$.

In our case, we have $\textstyle \mathbb{P}((\mathrm{foom},\mathrm{correct}) \mid *;\pi_P) = \mathbb{P}((\mathrm{doom},\mathrm{correct}) \mid *;\pi_{\neg P}) = 1-\varepsilon$ and $\textstyle \mathbb{P}((\mathrm{doom},\mathrm{incorrect}) \mid *;\pi_P) = \mathbb{P}((\mathrm{foom},\mathrm{incorrect}) \mid *;\pi_{\neg P}) = \varepsilon$. Thus, if we press the button, our expected utility is dominated by the near-certainty of humanity surviving, whereas if we don't, it's dominated by humanity's near-certain doom, and NBDT says we should press.

*

But maybe it's not updating that's bad, but NBDT's way of implementing it? After all, we get the clearly wacky results only if our decisions can influence whether we exist, and perhaps the way that NBDT extends the usual formula to this case happens to be the wrong way to extend it.

One thing we could try is to mark a possible world $\textstyle \omega$ as impossible only if $\textstyle \mu(*;\omega,\pi) = 0$ for all policies $\textstyle \pi$ (rather than: for the particular policy $\textstyle \pi$ whose expected utility we are computing). But this seems very ad hoc to me. (For example, this could depend on which set of possible actions $\textstyle \mathcal{A}(*)$ we consider, which seems odd.)

There is a much more principled possibility, which I'll call pseudo-Bayesian decision theory, or PBDT. PBDT can be seen as re-interpreting updating as saying that you're indifferent about what happens in possible worlds in which you don't exist as a conscious observer, rather than ruling out those worlds as impossible given your evidence. (A version of this idea was recently brought up in a comment by drnickbone, though I'd thought of this idea myself during my journey towards my current position on updating, and I imagine it has also appeared elsewhere, though I don't remember any specific instances.) I have more than one objection to PBDT, but the simplest one to argue is that it doesn't solve the problem: it still believes that it has anthropic superpowers in the problem above.

Formally, PBDT says that we should choose the policy $\textstyle \pi$ that maximizes $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\boldsymbol{\omega},\pi)]$ (where the expectation is with respect to the prior, not the updated, probabilities). In other words, we set the utility of any outcome in which we don't exist as a conscious observer to zero; we can see PBDT as SUDT with modified outcome and utility functions.

When our existence is independent on our decisions—that is, if $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$—then it turns out that PBDT and NBDT are equivalent, i.e., PBDT implements Bayesian updating. That's because in that case, $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\mid *;\pi] =$ $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega\mid *;\pi)$ $\textstyle = \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) / \sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi)$. If $\textstyle \mu(*;\omega,\pi)$ doesn't depend on $\textstyle \pi$, then the whole denominator doesn't depend on $\textstyle \pi$, so the fraction is maximized if and only if the numerator is. But the numerator is $\textstyle \sum_{\omega\in\Omega} u(o(\omega,\pi))\cdot\mathbb{P}(\omega)\cdot \mu(*;\omega,\pi) =$ $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$, exactly the quantity that PBDT says should be maximized.

Unfortunately, although in our problem above $\mu(*;\omega,\pi)$ does depend of $\pi$, the denominator as a whole still doesn't: For both $\pi_P$ and $\pi_{\neg P}$, there is exactly one possible world with probability $(1-\varepsilon)/2$ and one possible world with probability $\varepsilon/2$ in which $*$ is a conscious observer, so we have $\textstyle\sum_{\omega'\in\Omega} \mathbb{P}(\omega')\cdot\mu(*;\omega',\pi) = 1/2$ for both $\pi\in\Pi$. Thus, PBDT gives the same answer as NBDT, by the same mathematical argument as in the case where we can't influence our own existence. If you think of PBDT as SUDT with the utility function $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then intuitively, PBDT can be thought of as reasoning, "Sure, I can't influence whether humanity is wiped out; but I can influence whether I'm an l-zombie or a conscious observer; and who cares what happens to humanity if I'm not? Best to press to button, since getting tortured in a world where there's been a positive intelligence explosion is much better than life without torture if humanity has been wiped out."

I think that's a pretty compelling argument against PBDT, but even leaving it aside, I don't like PBDT at all. I see two possible justifications for PBDT: You can either say that $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ is your real utility function—you really don't care about what happens in worlds where the version of you making the decision doesn't exist as a conscious observer—or you can say that your real preferences are expressed by $u(o(\omega,\pi))$, and multiplying by $\mu(*;\omega,\pi)$ is just a mathematical trick to express a steelmanned version of Bayesian updating. If your preferences really are given by $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$, then fine, and you should be maximizing $\textstyle \mathbb{E}[u(o(\boldsymbol{\omega},\pi))\cdot\mu(*;\omega,\pi)]$ (because you should be using (S)UDT), and you should press the button. Some kind of super-selfish agent, who doesn't care a fig even about a version of itself that is exactly the same up till five seconds ago (but then wasn't handed the button) could indeed have such preferences. But I think these are wacky preferences, and you don't actually have them. (Furthermore, if you did have them, then $u(o(\omega,\pi))\cdot\mu(*;\omega,\pi)$ would be your actual utility function, and you should be writing it as just $u(o(\omega,\pi))$, where $o(\omega,\pi)$ must now give information about whether $*$ is a conscious observer.)

If multiplying by $\mu(*;\omega,\pi)$ is just a trick to implement updating, on the other hand, then I find it strange that it introduces a new concept that doesn't occur at all in classical Bayesian updating, namely the utility of a world in which $*$ is an l-zombie. We've set this to zero, which is no loss of generality because classical utility functions don't change their meaning if you add or subtract a constant, so whenever you have a utility function where all worlds in which $*$ is an l-zombie have the same utility $u_0$, then you can just subtract $u_0$ from all utilities (without changing the meaning of the utility function), and get a function where that utility is zero. But that means that the utility functions I've been plugging into PBDT above do change their meaning if you add a constant to them. You can set up a problem where the agent has to decide whether to bring itself into existence or not (Omega creates it iff it predicts that the agent will press a particular button), and in that case the agent will decide to do so iff the world has utility greater than zero—clearly not invariant under adding and subtracting a constant. I can't find any concept like the utility of not existing in my intuitions about Bayesian updating (though I can find such a concept in my intuitions about utility, but regarding that see the previous paragraph), so if PBDT is just a mathematical trick to implement these intuitions, where does that utility come from?

I'm not aware of a way of implementing updating in general SUDT-style problems that does better than NBDT, PBDT, and the ad-hoc idea mentioned above, so for now I've concluded that in general, trying to update is just hopeless, and we should be using (S)UDT instead. In classical decision problems, where there are no acausal influences, (S)UDT will of course behave exactly as if it did do a Bayesian update; thus, in a sense, using (S)UDT can also be seen as a reinterpretation of Bayesian updating (in this case just as updateless utility maximization in a world where all influence is causal), and that's the way I think about it nowadays.