Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As some of you may know, I disagree with many of the criticisms leveled against evidential decision theory (EDT). Most notably, I believe that Smoking lesion-type problems don't refute EDT. I also don't think that EDT's non-updatelessness leaves a lot of room for disagreement, given that EDT recommends immediate self-modification to updatelessness. However, I do believe there are some issues with run-of-the-mill EDT. One of them is naturalized induction. It is in fact not only a problem for EDT but also for causal decision theory (CDT) and most other decision theories that have been proposed in- and outside of academia. It does not affect logical decision theories, however.

The role of naturalized induction in decision theory

Recall that EDT prescribes taking the action that maximizes expected utility, i.e.

where is the set of available actions, is the agent's utility function, is a set of possible world models, represents the agent's past observations (which may include information the agent has collected about itself). CDT works in a – for the purpose of this article – similar way, except that instead of conditioning on in the usual way, it calculates some causal counterfactual, such as Pearl's do-calculus: . The problem of naturalized induction is that of assigning posterior probabilities to world models (or or whatever) when the agent is naturalized, i.e., embedded into its environment.

Consider the following example. Let's say there are 5 world models , each of which has equal prior probability. These world models may be cellular automata. Now, the agent makes the observation . It turns out that worlds and don't contain any agents at all, and contains no agent making the observation . The other two world models, on the other hand, are consistent with . Thus, for and for . Let's assume that the agent has only two actions and that in world model the only agent making observation takes action and in the only agent making observation takes action , then and . Thus, if, for example, , an EDT agent would take action to ensure that world model is actual.

The main problem of naturalized induction

This example makes it sound as though it's clear what posterior probabilities we should assign. But in general, it's not that easy. For one, there is the issue of anthropics: if one world model contains more agents observing than another world model , does that mean ? Whether CDT and EDT can reason correctly about anthropics is an interesting question in itself (cf. Bostrom 2002Armstrong 2011; Conitzer 2015), but in this post I'll discuss a different problem in naturalized induction: identifying instantiations of the agent in a world model.

It seems that the core of the reasoning in the above example was that some worlds contain an agent observing and others don't. So, besides anthropics, the central problem of naturalized induction appears to be identifying agents making particular observations in a physicalist world model. While this can often be done uncontroversially – a world containing only rocks contains no agents –, it seems difficult to specify how it works in general. The core of the problem is a type mismatch of the "mental stuff" (e.g., numbers or Strings) and the "physics stuff" (atoms, etc.) of the world model. Rob Bensinger calls this the problem of "building phenomenological bridges" (BPB) (also see his Bridge Collapse: Reductionism as Engineering Problem).

Sensitivity to phenomenological bridges

Sometimes, the decisions made by CDT and EDT are very sensitive to whether a phenomenological bridge is built or not. Consider the following problem:

One Button Per Agent. There are two similar agents with the same utility function. Each lives in her own room. Both rooms contain a button. If agent 1 pushes her button, it creates 1 utilon. If agent 2 pushes her button, it creates -50 utilons. You know that agent 1 is an instantiation of you. Should you press your button?

Note that this is essentially Newcomb's problem with potential anthropic uncertainty (see the second paragraph here) – pressing the button is like two-boxing, which causally gives you $1k if you are the real agent but costs you $1M if you are the simulation.  

If agent 2 is sufficiently similar to you to count as an instantiation of you, then you shouldn't press the button. If, on the other hand, you believe that agent 2 does not qualify as something that might be you, then it comes down to what decision theory you use: CDT would press the button, whereas EDT wouldn't (assuming that the two agents are strongly correlated).

It is easy to specify a problem where EDT, too, is sensitive to the phenomenological bridges it builds:

One Button Per World. There are two possible worlds. Each contains an agent living in a room with a button. The two agents are similar and have the same utility function. The button in world 1 creates 1 utilon, the button in world 2 creates -50 utilons. You know that the agent in world 1 is an instantiation of you. Should you press the button?

If you believe that the agent in world 2 is an instantiation of you, both EDT and CDT recommend you not to press the button. However, if you believe that the agent in world 2 is not an instantiation of you, then naturalized induction concludes that world 2 isn't actual and so pressing the button is safe.

Building phenomenological bridges is hard and perhaps confused

So, to solve the problem of naturalized induction and apply EDT/CDT-like decision theories, we need to solve BPB. The behavior of an agent is quite sensitive to how we solve it, so we better get it right.

Unfortunately, I am skeptical that BPB can be solved. Most importantly, I suspect that statements about whether a particular physical process implements a particular algorithm can't be objectively true or false. There seems to be no way of testing any such relations.

Probably we should think more about whether BPB really is doomed. There even seems to be some philosophical literature that seems worth looking into (again, see this Brian Tomasik post; cf. some of Hofstadter's writings and the literatures surrounding "Mary the color scientist", the computational theory of mind, computation in cellular automata, etc.). But at this point, BPB looks confusing/confused enough to look into alternatives.

Assigning probabilities pragmatically?

One might think that one could map between physical processes and algorithms on a pragmatic or functional basis. That is, one could say that a physical process A implements a program p to the extent that the results of A correlate with the output of p. I think this idea goes into the right direction and we will later see an implementation of this pragmatic approach that does away with naturalized induction. However, it feels inappropriate as a solution to BPB. The main problem is that two processes can correlate in their output without having similar subjective experiences. For instance, it is easy to show that Merge sort and Insertion sort have the same output for any given input, even though they have very different "subjective experiences". (Another problem is that the dependence between two random variables cannot be expressed as a single number and so it is unclear how to translate the entire joint probability distribution of the two into a single number determining the likelihood of the algorithm being implemented by the physical process. That said, if implementing an algorithm is conceived of as binary – either true or false –, one could just require perfect correlation.)

Getting rid of the problem of building phenomenological bridges

If we adopt an EDT perspective, it seems clear what we have to do to avoid BPB. If we don't want to decide whether some world contains the agent, then it appears that we have to artificially ensure that the agent views itself as existing in all possible worlds. So, we may take every world model and add a causally separate or non-physical entity representing the agent. I'll call this additional agent a logical zombie (l-zombie) (a concept introduced by Benja Fallenstein for a somewhat different decision-theoretical reason). To avoid all BPB, we will assume that the agent pretends that it is the l-zombie with certainty. I'll call this the l-zombie variant of EDT (LZEDT). It is probably the most natural evidentialist logical decision theory.

Note that in the context of LZEDT, l-zombies are a fiction used for pragmatic reasons. LZEDT doesn't make the metaphysical claim that l-zombies exist or that you are secretly an l-zombie. For discussions of related metaphysical claims, see, e.g., Brian Tomasik's essay Why Does Physics Exist? and references therein.

LZEDT reasons about the real world via the correlations between the l-zombie and the real world. In many cases, LZEDT will act as we expect an EDT agent to act. For example, in One Button Per Agent, it doesn't press the button because that ensures that neither agent pushes the button.

LZEDT doesn't need any additional anthropics but behaves like anthropic decision theory/EDT+SSA, which seems alright.

Although LZEDT may assign a high probability to worlds that don't contain any actual agents, it doesn't optimize for these worlds because it cannot significantly influence them. So, in a way LZEDT adopts the pragmatic/functional approach (mentioned above) of, other things equal, giving more weight to worlds that contain a lot of closely correlated agents.

LZEDT is automatically updateless. For example, it gives the money in counterfactual mugging. However, it invariably implements a particularly strong version of updatelessness. It's not just updatelessness in the way that "son of EDT" (i.e., the decision theory that EDT would self-modify into) is updateless, it is also updateless w.r.t. its existence. So, for example, in the One Button Per World problem, it never pushes the button, because it thinks that the second world, in which pushing the button generates -50 utilons, could be actual. This is the case even if the second world very obviously contains no implementation of LZEDT. Similarly, it is unclear what LZEDT does in the Coin Flip Creation problem, which EDT seems to get right.

So, LZEDT optimizes for world models that naturalized induction would assign zero probability to. It should be noted that this is not done on the basis of some exotic ethical claim according to which non-actual worlds deserve moral weight.

I'm not yet sure what to make of LZEDT. It is elegant in that it effortlessly gets anthropics right, avoids BPB and is updateless without having to self-modify. On the other hand, not updating on your existence is often counterintuitive and even regular updateless is, in my opinion, best justified via precommitment. Its approach to avoiding BPB isn't immune to criticism either. In a way, it is just a very wrong approach to BPB (mapping your algorithm into fictions rather than your real instantiations). Perhaps it would be more reasonable to use regular EDT with an approach to BPB that interprets anything as you that could potentially be you?

Of course, LZEDT also inherits some of the potential problems of EDT, in particular, the 5-and-10 problem.

CDT is more dependant on building phenomenological bridges

It seems much harder to get rid of the BPB problem in CDT. Obviously, the l-zombie approach doesn't work for CDT: because none of the l-zombies has a physical influence on the world, "LZCDT" would always be indifferent between all possible actions. More generally, because CDT exerts no control via correlation, it needs to believe that it might be X if it wants to control X's actions. So, causal decision theory only works with BPB.

That said, a causalist approach to avoiding BPB via l-zombies could be to tamper with the definition of causality such that the l-zombie "logically causes" the choices made by instantiations in the physical world. As far as I understand it, most people at MIRI currently prefer this flavor of logical decision theory.


Most of my views on this topic formed in discussions with Johannes Treutlein. I also benefited from discussions at AISFP.

New Comment
15 comments, sorted by Click to highlight new comments since:

It seems to me that the original UDT already incorporated this type of approach to solving naturalized induction. See here and here for previous discussions. Also, UDT, as originally described, was intended as a variant of EDT (where the "action" in EDT is interpreted as "this source code implements this policy (input/output map)". MIRI people seem to mostly prefer a causal variant of UDT, but my position has always been that the evidential variant is simpler so let's go with that until there's conclusive evidence that the evidential variant is not good enough.

LZEDT seems to be more complex than UDT but it's not clear to me that it solves any additional problems. If it's supposed to have advantages over UDT, can you explain what those are?

I hadn’t seen these particular discussions, although I was aware of the fact that UDT and other logical decision theories avoid building phenomenological bridges in this way. I also knew that others (e.g., the MIRI people) were aware of this.

I didn't know you preferred a purely evidential variant of UDT. Thanks for the clarification!

As for the differences between LZEDT and UDT:

  • My understanding was that there is no full formal specification of UDT. The counterfactuals seem to be given by some unspecified mathematical intuition module. LZEDT, on the other hand, seems easy to specify formally (assuming a solution to naturalized induction). (That said, if UDT is just the updateless-evidentialist flavor of logical decision theory, it should be easy to specify as well. I haven’t seen people UDT characterize in this way, but perhaps this is because MIRI’s conception of UDT differs from yours?)
  • LZEDT isn’t logically updateless.
  • LZEDT doesn’t do explicit optimization of policies. (Explicit policy optimization is the difference between UDT1.1 and UDT1.0, right?)

(Based on a comment you made on an earlier past post of mine, it seems that UDT and LZEDT reason similarly about medical Newcomb problems.)

Anyway, my reason for writing this isn’t so much that LZEDT differs from other decision theories. (As I say in the post, I actually think LZEDT is equivalent to the most natural evidentialist logical decision theory — which has been considered by MIRI at least.) Instead, it’s that I have a different motivation for proposing it. My understanding is that the LWers’ search for new decision theories was not driven by the BPB issue (although some of the motivations you listed in 2012 are related to it). Instead it seems that people abandoned EDT — the most obvious approach — mainly for reasons that I don’t endorse. E.g., the TDT paper seems to give medical Newcomb problems as the main argument against EDT. It may well be that looking beyond EDT to avoid naturalized induction/BPB leads to the same decision theories as these other motivations.

Last time I looked at a post of yours about this, you got something very basic wrong. That is:

Note that in non-Newcomb-like situations, P(s|do(a)) and P(s|a) yield the same result, see ch. 3.2.2 of Pearl’s Causality.

is wrong. You never replied. Why do you post if you don't engage with criticism? Are you "write-only"?

I apologize for not replying to your earlier comment. I do engage with comments a lot. E.g., I recall that your comment on that post contained a link to a ~1h talk that I watched after reading it. There are many obvious reasons that sometimes cause me not reply to comments, e.g. if I don't feel like I have anything interesting to say, or if the comment indicates lack of interest in discussion (e.g., your "I am not actually here, but ... Ok, disappearing again"). Anyway, I will reply your comment now. Sorry again for not doing so earlier.

In the section: "The role of naturalized induction in decision theory" a lot of variables seem to be missing.

Have you seen the "XOR Blackmail" in the Death in Damascus paper? That's a much better problem with EDT than the smoking lesion problem, in my view. And it's simple to describe:

An agent has been alerted to a rumor that her house has a terrible termite infestation, which would cost her $1,000,000 in damages. She doesn’t know whether this rumor is true. A greedy and accurate predictor with a strong reputation for honesty has learned whether or not it’s true, and drafts a letter:

I know whether or not you have termites, and I have sent you this letter iff exactly one of the following is true: (i) the rumor is false, and you are going to pay me $1,000 upon receiving this letter; or (ii) the rumor is true, and you will not pay me upon receiving this letter.

The predictor then predicts what the agent would do upon receiving the letter, and sends the agent the letter iff exactly one of (i) or (ii) is true. Thus, the claim made by the letter is true. Assume the agent receives the letter. Should she pay up?

EDT doesn't pay if it is given the choice to commit to not paying ex-ante (before receiving the letter). So the thought experiment might be an argument against ordinary EDT, but not against updateless EDT. If one takes the possibility of anthropic uncertainty into account, then even ordinary EDT might not pay the blackmailer. See also Abram Demski's post about the Smoking Lesion. Ahmed and Price defend EDT along similar lines in a response to a related thought experiment by Frank Arntzenius.

Yes, this demonstrates that EDT is also unstable under self modification, just as CDT is. And trying to build an updateless EDT is exactly what UDT is doing.

If statements about whether an algorithm exists are not objectively true or false, there is also no objectively correct decision theory, since the existence of agents is not objective in the first place. Of course you might even agree with this but consider it not to be an objection, since you can just say that decision theory is something we want to do, not something objective.

Yes, I share the impression that the BPB problem implies some amount of decision theory relativism. That said, one could argue that decision theories cannot be objectively correct, anyway. In most areas, statements can only be justified relative to some foundation. Probability assignments are correct relative to a prior, the truth of theorems depends on axioms, and whether you should take some action depends on your goals (or meta-goals). Priors, axioms, and goals themselves, on the other hand, cannot be justified (unless you have some meta-priors, meta-axioms, etc., but I think the chain as to end at some point, see ). Perhaps decision theories are similar to priors, axioms and terminal values?

I agree that any chain of justification will have to come to an end at some point, certainly in practice and presumably in principle. But it does not follow that the thing at the beginning which has no additional justification is not objectively correct or incorrect. The typical realist response in all of these cases, with which I agree, is that your starting point is correct or incorrect by its relationship with reality, not by a relationship to some justification. Of course if it is really your starting point, you will not be able to prove that it is correct or incorrect. That does not mean it is not one or the other, unless you are assuming from the beginning that none of your starting points have any relationship at all with reality. But in that case, it would be equally reasonable to conclude that your starting points are objectively incorrect.

Let me give some examples:

An axiom: a statement cannot be both true and false in the same way. It does not seem possible to prove this, since if it is open to question, anything you say while trying to prove it, even if you think it true, might also be false. But if this is the way reality actually works, then it is objectively correct even though you cannot prove that it is. Saying that it cannot be objectively correct because you cannot prove it, in this case, seems similar to saying that there is no such thing as reality -- in other words, again, saying that your axioms have no relationship at all to reality.

A prior: if there are three possibilities and nothing gives me reason to suspect one more than another, then each has a probability of 1/3. Mathematically it is possible to prove this, but in another sense there is nothing to prove: it really just says that if there are three equal possibilities, they have to be considered as equal possibilities and not as unequal ones. In that sense it is exactly like the above axiom: if reality is the way the axiom says, it is also the way this prior says, even though no one can prove it.

A terminal goal: continuing to exist. A goal is what something tends towards. Everything tends to exist and does not tend to not exist -- and this is necessarily so, exactly because of the above axiom. If a thing exists, it exists and does not not exist -- and it is just another way of describing this to say, "Existing things tend to exist." Again, as with the case of the prior, there is something like an argument here, but not really. Once again, though, even if you cannot establish the goal by reference to some earlier goal, the goal is an objective goal by relationship with reality: this is how tendencies actually work in reality.

However, if you believe that the agent in world 2 is not an instantiation of you, then naturalized induction concludes that world 2 isn't actual and so pressing the button is safe.

By "isn't actual" do you just mean that the agent isn't in world 2? World 2 might still exist, though?

No, I actually mean that world 2 doesn't exist. In this experiment, the agent believes that either world 1 or world 2 is actual and that they cannot be actual at the same time. So, if the agent thinks that it is in world 1, world 2 doesn't exist.

I just remembered that in Naive TDT, Bayes nets, and counterfactual mugging, Stuart Armstrong made the point that it shouldn't matter whether you are simulated (in a way that you might be the simulation) or just predicted (in such a way that you don't believe that you could be the simulation).