FDT defects in a realistic Twin Prisoners' Dilemma

3Charlie Steiner

New Comment

I think this takes some supposed rules of logical nodes a little too seriously. The reason one augments a causal model of the world with logical nodes in the first place is to get a good practical model for what happens when you make decisions (or otherwise run programs whose details you want to elide from the world-model). These models are *practical* constructions, that may be different for different agents who care about different things, not fixed ideals derived a priori from the history of one's software.

Thanks to Caspar Oesterheld for discussion and helpful comments, as well as Tristan Cook, James Faville, Daniel Kokotajlo and Lukas Finnveden.## Summary

Updateless decision theory (UDT)/functional decision theory (FDT) can be formulated with logical conditionals as opposed to logi-causalist counterfactuals. I argue in favour of the former, on the grounds that this variant of UDT/FDT ensures robust mutual cooperation in the Twin Prisoner’s Dilemma between two realistic UDT/FDT agents, whereas the causalist variant doesdecision algorithms.

not. This falls out of thinking about how agents approximate decision theories and how they intervene on the outputs ofdifferent, yet similar,## Introduction

Updateless decision theory does not necessarily have to be formulated with logical counterfactuals: you could also use logical conditionals. This is true of functional decision theory and timeless decision theory as well,

UCDT1.1(s)=argmaxπ∈Π ∑o∈OU(o)P(┌UCDT1.1(s)=π┐ □→ o),UEDT1.1(s)=argmaxπ∈Π ∑o∈OU(o)P(o | ┌UEDT1.1(s)=π┐).mutatis mutandis. Specifically, we can define ‘updateless (logically) causal decision theory’ (UCDT), which is the standard formulation, and ‘updateless evidential decision theory’ (UEDT) in the UDT1.1 framework^{[1]}respectively as:(It seems Wei Dai initially thought of UDT as something evidential; see

thiscomment, for example.^{[2]})As I understand it,

Troll Bridge(a logical version of theSmoking Lesion) is a decision problem where UCDT and UEDT come apart, and UCDT is taken to give the correct answer. I personally think it is unclear what the takeaway from Troll Bridge should be, and I think there are problems in which UEDT is clearly preferable^{[4]}to UCDT. And the latter is what this post is about.## Mutual UDT cooperation

The causal variants of logical decision theories like UDT, FDT and TDT are understood to ensure mutual cooperation in the Twin Prisoner's Dilemma. The reasoning goes as follows: “in virtue of us being twins, we are implementing the same decision algorithm, and now the question is about what this algorithm will output. Suppose it outputs 'defect'; then both of us will defect. Suppose it outputs 'cooperate'; then both of us will cooperate. And since u(C,C)>u(D,D), I will cooperate.".

In terms of causal graphs, this is how I generally understand symmetric games between two agents that are both implementing UDT (the red area representing ‘logic’):

^{[5]}So when you intervene on the output of your decision theory you willhave a logically

causaleffect on the action the other player takes, since both of your actions are downstream of what UDT outputs in the given situation (or alternatively, the outputisboth of your actions). As said, you will then take thesameaction, meaning you should e.g. cooperate in the Prisoner’s Dilemma. All good so far (modulo concerns about "logical causality", of course).(In the context of causal graphs, I think of the decision theory, UDT, as a function that provides a [perhaps infinitely] large table of state-action pairs, i.e. a [perhaps infinite] amount of logical statements of the form “UDT(state) = action”, and the conjunction of these is the logical statement “UDT”—the node in the graph.)

## Approximating UDT: when logical counterfactuals are brittle

But real-world agents do not implement the “ideal" version of UDT

^{[6]}: rather—insofar as they think UDT actually is the “ideal”—they imperfectlyapproximateUDT in some way, due to boundedness and/or other practical constraints. In practice, this could, for example, look like doing Monte Carlo approximations for the expected utility, using approximations for π and e, using heuristics for intractable planning problems,et cetera—i.e.tryingto implement UDT but doing something (ex ante) suboptimal with respect to the formalism. Also, these agents are probably going to approximate UDTdifferently, such that you do not get the neat situation above since these different approximations would then refer towholly distinctlogical statements (the statements, as before, being the conjunction of all of the action recommendations of the respective UDT approximations). That is, these approximations should be represented as two different nodes in our graph.The way this plausibly works is that for a given decision problem, both players respectively do the thing that ideal UDT would have recommended with some probabilities corresponding to e.g. how much compute they have. Moreover, these approximations are conditionally independent given ‘ideal UDT’.

(Perhaps this is different to what is normally meant by an “approximation of a decision theory”. Here I am specifically thinking of it as something very ‘top-down’ where you look at the ideal version of the theory and then make concessions in the face of boundedness. You could also think of it as something more ‘bottom-up‘ where you start out with some minimal set of very basic principles of choice that [in the limit] evolve into the ideal theory as the agent grows more sophisticated and less bounded. This might be the more plausible perspective when thinking in terms of building AI systems. More on this later.)

Furthermore, these top-down approximations are arguablyf:DTideal→DTapprox

downstreamof UDT itself (i.e. there is a path from UDT to the approximations in the causal graph). I do not have a great argument for this, but it seems intuitive to represent the process of top-down approximations as functions of the type,in which case I think it is natural to say that UDTlogi-causesthe approximations of UDT. (For example, we might want to think about the approximation function f as adding e.g. Gaussian noise to the "ideal" distribution over actions for some decision problem.)With these assumptions, an interesting situation arises where it seems to matter what specific formulation of UDT we are using. Consider the causal graph of the symmetric game again:

So in this case, when you intervene on the output of your decision theory, you do not have a logi-causal effect on the action of the other agent (since you are doing different approximations). However, I think we can say that there is some ‘logical correlation’ between the outputs of the two approximations; i.e. as long as there is no screening off, there is some (logi-)evidential dependence between a′ and a''. This similarity should arguably be taken into account, and it

is(not surprisingly) taken into account when we use conditionals for the expected utility probabilities.That is, given sufficiently similar approximations, UEDT tells you to

cooperatein the Prisoner’s Dilemma (under the assumption that both agents are doing some approximation of UEDT, of course).On the other hand, without any further conceptual engineering, and given how we think of decision-theoretic approximations here, UCDT trivially recommends defecting, since it only thinks about logi-causal effects, and defecting dominates.

You could of course say something like “sure, but with some probability we are actually doing the same approximation, such that my action actually logi-causally determines the action of the other player in some world that is (hopefully) not too improbable”. In principle, I think this works, but since there are

so manydifferent ways of approximating a given decision theory—probably an infinite number—and considering you only need the slightest difference between them for cooperation to break down, the probability would not suffice. This is important to keep in mind.(Further notes on the graphs above:

decision nodes. For example, we could draw the graph in the following way where we do not get cooperation:individualoutputs of the first approximation, such that you (in some way) have a causal arrow going from the bottom left node to the bottom right, in which case it is arguably possible for two UCDT-approximators to achieve mutual cooperation. One specific example (where the approximation of the second agent is elicited from the action recommendations of the approximation of the first agent plus some noise, say):^{[7]}Therefore, for some situations at least, it might make sense to think about a low-intelligence, low-compute ‘UEDT-approximator’ as something akin to a UCDT agent, in which case we do not necessarily get mutual cooperation.)symmetricgames thus far. But most games areasymmetric—even Prisoner’s Dilemmas (since we often derive at least slightly different utilities from the different outcomes). Does this create analogous difficulties even for ideal UCDT? Perhaps not if you can simulate the situation of the other player, but if you are just argmaxing over available actions in your own situation then you are not intervening on the output of the other player’s decision theory (because you are in different situations) and you could think that we get something that looks similar to previous situations where there is correlation but not (logi-)causation.à laRawls and Harsanyi).^{[8]}Specifically, the following policy achieves mutual cooperation between two ideal UCDTers: "if I am player 1, cooperate; if I am player 2, cooperate". And this is achieved because the meta-gameissymmetric, and you will determine the policy of the other player.donevertheless## What about Newcomb's?

As we know, the

Twin Prisoner’s Dilemma is a Newcomb's problem. That raises a question: does UDT/FDT with counterfactuals actually two-box under reasonable assumptions about Omega (just as I have argued that UCDT defects against another agent implementing UCDT under reasonable assumptions about the agents)?I think this is a bit unclear and depends on what we think Omega is doing exactly (i.e. what are “reasonable assumptions”?): is she just a

very good psychologist, or is she basing her prediction on a perfect simulation of you? In the former case, it seems we have the same exactness issues as before, and UCDT might two-box^{[9]}; and the latter case merely corresponds to the case where the twin in the Prisoner’s Dilemma is yourexactcopy, and thus you one-box.Perhaps Omega is not directly approximating UCDT in her simulation, though, but rather approximating

you. That is, approximating your approximation of UCDT. In that case, it seems like there is a good argument for saying that UCDT would one-box since Omega's approximation is downstream of your approximation.I don't find this super interesting to discuss, and since the arguments in this post are based on thinking about

realisticagents in therealworld, I will set Newcomb’s aside and keep focusing on the Prisoner’s Dilemma. (Moreover, Prisoner’s Dilemma-like situations are more relevant for e.g.ECL.)## Why this is not surprising

In the

MIRI/OP decision theory discussion, Scott Garrabrant suggests that we view different decision theories as locations in the following 2x2x2 grid:Conditionals vs. causalist counterfactuals(or ‘EDT vs. CDT’).Updatefulness vs. updatelessness(or ‘from the perspective of what doxastic state am I making the decision, the prior or posterior?’).Physicalist vs. algorithmic/logical agent ontology (or ‘anagentis just a particular configuration of matter doing physical things’ vs. ‘anagentis just an algorithm; relating inputs and outputs’).This results in eight different decision theories, where we can think of UCDT/FDT as updateless CDT in the

algorithmicontology, as opposed to thephysicalist.To give an analogy for the problem I have attempted to explain in this post, consider two perfect updateful CDT copies in the Prisoner’s Dilemma. It is normally said that they will not cooperate because of dominance, no causal effects

et cetera, but under one particular physicalist conception of whoyouare, this might not hold: setting aside issues around spatiotemporal locations^{[10]}, we could say that an actualperfectcopy of you isyou, such that ifyoucooperate, your “copy” will also cooperate. (The ontology I have in mind here is one that says that “you” are just anequivalence class, where the relation isidentitywith respect toallphysical [macro-]properties [modulo location-related issues], i.e. something like the ‘identity of indiscernibles’ principle—restricted to ‘agents’.)Even if we accept this ontology, I would not say that this is a point in favour of (this version of) CDT since this decision problem is utterly unrealistic: even the slightest asymmetry (e.g. a slight difference in the colour of the rooms) would break the identity and thus break the cooperation.

The problem with mutual UCDT/FDT cooperation I have attempted to describe here is arguably

completely analogousto the “problem” of how CDT agents do not achieve mutual cooperation in any realistic Prisoner’s Dilemma under this particular physicalist ontology. (The idea is that the algorithmic agent ontology is analogous to the equivalence class-type physicalist agent ontology in that from both perspectives “identity implies control”.)(Some things

areof course different. For example, to my understanding, the usual algorithmic conception of an ‘agent’ is arguably more “minimalistic” than the physicalist: the former does not include specific information about the configuration of matteret cetera; rather, all else equal, the algorithm in and of itself isme, independent of the substrate and its underlying structure: as long as it is implemented,it is me. This makes mutual [logi-]causalist cooperation somewhat more likely.)## Objections

## Doing the ideal, partially

As said, the UDT node in the previous graphs is just the conjunction of state-to-action/policy mappings (statements of the form “UDT(s)=a”),

for allpossible states, S={s1,...,sk}. But suppose we partition the state space into S′ and S''—i.e. S′∩S''=∅ and S′∪S''=S—such that we get "UDT for S′" and “UDT for S''” (both arguably downstream of “UDT for S”). (We could of course make it even more fine-grained, perhaps even corresponding to the trivial partition {{s1},...,{sk}}.) Now, it might be the case that both players are actually implementing the ideal version of “UDT for S′”, e.g. for S′={problems that require less than x compute to solve}—where the standard Prisoner's Dilemma perhaps is included—but that they approximate “UDT for S''”.^{[11]}We then get the following graph:When you now intervene on the leftmost decision node, e.g. in a Prisoner’s Dilemma, you then have a logi-causal effect on the action of the other player, and you can ensure mutual cooperation in a UCDT vs. UCDT situation.

On the face of it, this is arguably a somewhat realistic picture: for many decision problems (especially the very idealised ones), it is not that difficult to act according to the ideal theory.

But it seems like this is brittle in similar ways as before. A couple of points:

exactsame way, since it is only then you can get the players to refer to theexactsame logical statement. This is highly unlikely. (Recall that we are thinking of the decision theory nodes here as conjunctions of all of the statements corresponding to their action recommendations; i.e. very large tables. For example, this means that even if the partitions differ by even one state, you do not get mutual cooperation.)reallyinterested in the idealised situations: reality is way more messy meaning most agents will never actually “do the ideal” with respect to almost any situation with anything close to probability one (at least the important ones).## Approximation Schelling points

We said before that the agents will approximate the decision theory differently (and this is the justification for drawing the graphs in the way we have). But perhaps there are Schelling points or general guidelines for how agents should approximate decision theories, such that there is some basis to saying that two agents will, in fact, approximate the theory in the ‘same’ way.

This line of thought seems plausible, but only further supports the point that UEDT is preferable to UCDT: (1) if there are these Schelling points, we should expect the correlation to be higher (and thus a higher probability of mutual cooperation in the UEDT vs. UEDT case); and (2) we still have the (seemingly insurmountable) ‘exactness issues’ in the case of a UCDT-approximator vs. a UCDT-approximator, where the Schelling point in question would have to be extremely, and implausibly, precise. Moreover, I think this implicitly assumes that the agents in question are equals in terms of resources and power, which of course is not realistic. For example, why would two agents with differing levels of compute approximate in the

sameway? The agent with more compute would then have to forego some expected accuracy in terms of coming closer to what the ideal theory would recommend.^{[12]}## Other notions of 'approximation'

As briefly touched upon, I have been relying on a certain conception of how agents approximate a decision theory—something very ‘top-down’. This is not necessarily the most natural or realistic way agents approximate a decision theory.

## Bottom-up approximations

As said, we could think that the agents start out with some plausible basic principles that could evolve into something more like the ideal theory as it gets more intelligent, has more resources and information. (See

by OesterheldA theory of bounded inductive rationalityet al. for something similar.) Specifically, this might look like two developers deciding on what the initial principles—which of course could be inspired by what they regard as the ideal theory—should be when building their respective AIs, and then letting them out into the world where they proceed to grow more powerful and perhaps can self-modify into something closer to optimal. And then at some later point(s) in time they play some game(s) against each other. When you have two agents approximating UDT in this way, it is prima facieunclear whether you get the same results as before.In particular, the correlation might in this case be much lower; for example, due to (slightly) different choices of initial principles on the part of the developers, or just mere contingencies in evolution. This means you do not necessarily get mutual cooperation even when what they are approximating is U

EDT.This seems plausible to me, but note that the argument here is merely that “it might be difficult to get mutual cooperation when you have two UEDT-approximators playing against each other as well”, i.e. this is not in and of itself a point in favour of UCDT.

Au contraire: now it is even less clear whether there areanylogi-causal effects considering the initial principles might differ, and how they might not be derived from the same ideal theory. Furthermore, the correlation might also be sufficient; and perhaps we should expect some high degree of convergence—not divergence—in the aforementioned evolution; and I suppose my point in this post partially amounts to saying that “we want the convergence in question to be UEDT”.## A hacky approach

Although agents do not know many of the action recommendations of the ideal theory for certain (this is why they are approximating the decision theory in the first place), there are structural features of the ideal theory that are known: for one, they know that top-down approximations are downstream of the ideal theory.

Prima facie,this implies that insofar as agents are approximating UCDT by “trying to do what ideal UCDT would have recommended”,including the aforementioned structural feature, they will act as if they have a partial logi-causal effect (corresponding to the accuracy of the other agent’s approximation) on the other agent’s choice of action since the approximation of the other agent is downstream of the ideal theory. As such, if both agents reason in this “structural” way, they will both cooperate.^{[13]}However, although this seems plausible on the face of it, it is unclear whether this is workable, principled, if the argument is sound to begin with and how this should be formalized. A couple of concrete issues and question marks:

actuallogi-causal effect on the choice of the other player when the structural UCDT-approximator makes its decision in this way? I am not sure, but perhaps this would become clear with a proper formalisation of this type of approximation. If there isnotan actual logi-causal effect, on the other hand, then the agents are really just pretending there is one, and it is unclear why they are not then defecting.couldbe logi-causal effects going on here, we can split up the discussion corresponding to the following two games: (i) a structural-UCDT approximator playing the Prisoner’s Dilemma against a structure-lessUCDT-approximator; and (ii) a structural UCDT-approximator playing against another structural UCDT-approximator.prima faciequite strange: on one hand, you would think that the ideal UCDT, being upstream of the choice of the approximator, could ensure mutual cooperation with some high probability by cooperating themselves. On the other hand, since the choice of the agent implementing ideal UCDT is not downstream of the approximator's decision, the approximator will defect.asymmetric, meaning there is not a logi-causal effect of the choice of the ideal UCDTer on the approximator. Illustration below (the ideal UCDTer intervenes on the choice node to the right, and there is no logi-causal effect on the choice node of the other player).^{[14]}^{[15]}In sum, this hacky way of approximating logi-causalist decision theories does not ensure mutual cooperation in the Prisoner’s Dilemma.

## Conclusions

An algorithm or decision theory corresponds to a logical statement that is very precise. This means that any slight difference in how agents implement an algorithm creates difficulties for having any subjunctive effect on the actions of other agents. In particular, when two agents are respectively approximating a decision theory, they are going to do thiswhich means that

differently, and thus implementdifferentalgorithms, which in turn means that the agents donotdetermine the output of the other agent’s algorithm. In other words,logical causation is brittle,mutual logi-causalist cooperation is brittle.EDT-style correlational reasoning is

notbrittle in this way: for good or bad, you just need a sufficient degree of similarity between the agents (in terms of decision theory) as well as a sufficient level of game-theoretic symmetry (and noscreening off), and that is it.In light of this, the following passage from the

FDT paperdiscussing the Twin Prisoner’s Dilemma (wherein FDT is also formulated with the counterfactual cashed out as conditioning on do(FDT(P,G)=a)) seems misleading (bold emphasis mine):If “twin” is supposed to mean “perfect psychological copy controlling for chaotic effects

et cetera” (which I think is the claim in the paper), then this seems true because you truly intervene on the output ofbothof your algorithms. But for the most part in the real world, they arenotgoing to implement theexact samealgorithm and use theexact samedecision theory; and, as such, you do not get mutual cooperation.And the following from

(also about the Twin Prisoner’s Dilemma, and also FDT with counterfactuals), seems to be incorrect (bold emphasis mine):Cheating Death in DamascusAs far as I can tell, no explanation for why “a close approximation” would suffice is given.

^{[16]}As a final note, I think it is important to not delude ourselves with terms like “success”, “failure”, “wins” and “preferable” (which I have used in this post) in relation to decision theories; UEDT and UCDT are both the “correct” decision theory by their own lights (just as all decision theories): the former maximises expected utility withSee

conditionalsfrom the earlier logical vantage point, and the latter does the same but with logicalcounterfactuals, and that is that—there is noobjectiveperformance metric.The lack of performance metrics for CDT versus EDT, etc.by Caspar Oesterheld for more on this. Personally, I just want to take it as a primitive that a reasonable decision theory should recommend cooperation in the Prisoner’s Dilemma against a similar (but not necessarily identical) opponent.So, in this post I made nothing but the following claim: insofar as you want your updateless decision theory to robustly cooperate in the Prisoner’s Dilemma against similar opponents,

.UEDT is all else equal preferable to UCDT^{^}I.e. we do

policyselection instead ofactionorprogramselection as in the cases of UDT1.0 and UDT2.0. I think everything in this post should generalise to those theories as well though,mutatis mutandis.^{^}Also, the following from

: "Wei Dai does not endorse FDT's focus on causal-graph-style counterpossible reasoning; IIRC he's holding out for an approach to counterpossible reasoning that falls out of evidential-style conditioning on a logically uncertain distribution.".MIRI/OP exchange about decision theory^{^}Personally, I also think that the Smoking Lesion is unpersuasive: in cases where the premises of the Tickle Defence are satisfied, you should smoke; and when they are not satisfied, there is not really a choice,

per se: either you smoke or you do not. See e.g.Ahmed (2014, ch.4)andby Caspar Oesterheld for more on this.Understanding the Tickle Defense in Decision Theory^{^}I will return to what I take this to (not) mean in the last section

^{^}Note that we are using FDT-style causal graphs here, and cashing out the counterfactual of UCDT as a do-operator, despite the following from the

: "[O]ne reason we [MIRI decision theorists/NS & EY] do not call [FDT] UDT (or cite Wei Dai much) is that Wei Dai does not endorse FDT's focus on causal-graph-style counterpossible reasoning; IIRC he's holding out for an approach to counterpossible reasoning that falls out of evidential-style conditioning on a logically uncertain distribution.".MIRI/OP exchange about decision theory^{^}I am merely saying here that there is

somethingthat the agents strive towards; i.e. not making the stronger claim that “the ideal” necessarily has to exist, or even that this notion is meaningful.^{^}The thought here is that it is costly to compute all of the correlations, and in most cases just focusing on the causal effects (the ‘perfect correlations’, in one direction) will come close to the recommendations of EDT. Moreover, there is often a

Tickle Defencelurking whenever you have some Newcomblike decision problem, meaning EDT and CDT will often recommend the same action. See e.g.Ahmed (2014)for more on this.^{^}For more on this see section 2.9 of Caspar Oesterheld’s

, as well as footnote 15 in this post.Multiverse-wide Cooperation via Correlated Decision Making^{^}Perhaps a more confrontative title of this post would have been ‘UDT/FDT two-boxes’.

^{^}The issue here is that even if we copy-and-paste a clump of matter, it will not share the exact same location afterwards. And ‘location’ is clearly a property of a clump of matter, meaning the two clumps of matter will not be the

same, strictly speaking.^{^}h/t Caspar Oesterheld for this suggestion.

^{^}Perhaps this is solvable if you are updateless with respect to who has the most compute—i.e. we take the perspective of the

original position. But (i) this relies on agents having a prior belief that there is a sufficient probability that in another closeby branch you areactuallyplaying the other side of the game (which is not obvious); (ii) this seems like an instance of ‘mixed-upside updatelessness’, which arguably partially reduces to preferences in the way Paul Christiano describeshere; and (iii) this means that you are not going to use all of your compute for your decisions, or not using your best approximation for π, and it is unclear if this is something you want.^{^}h/t Lukas Finnveden for this suggestion.

^{^}As touched upon before, you can arguably solve symmetry-issues by being updateless with respect to what side of the game you are playing, which in this case translates to being updateless with respect to whether you are the approximator or the ideal UCDTer. So, at the very best, it seems that you also (in imagining to implement the ideal theory) need to be updateless with respect to the sophistication of the very decision theory you are implementing. However, it is very unclear if this is something you

wantto be updateless about since this plausibly will influence the quality of your subsequent decisions. For example, perhaps you would need to forgo some amount of your compute for the expected utility calculations, or simply not use your best approximation for π. Additionally, and more importantly perhaps, this only works ifbothplayers—the approximator and the ideal UCDTer—are updateless about the sophistication of their decision theory, and it is of course questionable if this is a justified assumption. (Also, as usual, this depends on the priors in question.)^{^}h/t Lukas Finnveden again.

^{^}I would appreciate it if anyone could fill in the blanks here. It could be that the authors think that two approximators will have the same decision theory with some sufficiently high probability, and thus that the agents will be sufficiently certain that they have a logi-causal effect on the action of the other player to warrant cooperation. But this is highly improbable as previously argued.