Thanks to Vael Gates for mentoring and providing guidance on this project for the past couple months, and to Scott Garrabrant and Jack Ryan for illuminating conversations on the subject.

 

I’m a student doing a research project on decision theory. This is largely a personal project with the objective of clarifying and improving my own understanding. In this literature review, I aim to break down the recent discussion of decision theories, offering brief explanations of and objections to the major theories. I follow a chronological approach, and further separate decision theories into the traditional theories which evaluate actions, and the more recent theories which shift to evaluating policies. 

Traditional decision theories include evidential decision theory (EDT) and causal decision theory (CDT). Both of these theories focus on evaluating actions.

Evidential Decision Theory

In particular, EDT evaluates conditionals, namely reasoning against alternatives when evaluating a given action. In practice, this looks like a formalization of evaluating possibilities of what would happen given individual actions based on Bayesian probability distributions and conditional expected utility outcomes. However, EDT has serious issues when confronted with certain decision theoretic problems, in particular those which involve the information it receives from outside sources. EDT fails the “Blackmail Problems”, including but not unlimited to “Evidential Blackmail” (Soares and Fallenstein 2015), “XOR Blackmail” (Demski 2017), “Blackmail” (Schwartz 2018). EDT fails all of these given that it recommends the lower utility action. More precisely, EDT can consistently be “money-pumped” or tricked by the way in which it receives information about the outside world, making it extremely susceptible to a clever blackmailer. In particular, EDT draws a causal graph, and then makes its decision in a manner which includes “spurious” causal connections. These problems, among others, are often considered fatal for EDT.

Here’s a walked through example of “Evidential Blackmail” per S&F as mentioned above; I’ve abridged the example somewhat for brevity:

Consider a powerful AI agent which has grown wealthy through significant investments in the stock market. It has substantial investments in a particular company; rumors are currently circulating around the CEO of that company. The AI agent assigns a 0.4% chance that a scandal breaks, resulting in the CEO’s resignation; in that event the AI expects to lose USD 150 million on the market.

Some clever AI researcher, renowned for high honesty and accuracy, has access to the source code of the AI agent. Further, the researcher determines whether or not there will be a scandal. The researcher then predicts (the prediction is assumed to have perfect accuracy) whether or not the agent will pay USD 100 million after receiving the following message. If either

  1. There is not a scandal and the researcher predicts that the agent will pay, or
  2. There is a scandal and the researcher predicts that the agent will not pay

Then the researcher sends a pre-drafted message explaining the whole procedure and that either one of a) or b) turned out to be true. The researcher then requests USD 100 million.

S&F present EDT as reasoning in a manner as follows. Since the agent has received the message, either it pays and there was not a scandal or it does not pay and there was a scandal. If it pays, it loses 100 million, if it does not pay then it loses 150 million. Clearly the first option has higher utility, so EDT prescribes paying.

S&F proceed to demonstrate that EDT’s conclusion here is clearly flawed; whether or not the agent pays has no bearing on the occurrence of the scandal. Thus, paying the researcher will always be a needless loss of money, yet this is what EDT prescribed. S&F, along with others, critique EDT’s reliance on “spurious” connections—the causal links which lead EDT to conclude that paying is the rational choice in this sort of blackmail problem. This susceptibility is generally considered a fatal problem for EDT, as it is extremely vulnerable to anyone taking the approach of the researcher in the above example. Given these fatal problems for EDT, there is a motivation to employ a decision theory which avoids the same problem, thus we turn to CDT.

Causal Decision Theory

CDT aims to avoid these “spurious” causal connections by a study of counterfactual reasoning based on causal implications. Essentially, this means looking only at what actually causes the action to happen, thus culling the problematic “spurious” connections. CDT does this by “causal interventions”, which go something like this: 1) construct a graph in which the action is represented by a single node, 2) cut the connections between the action node and its causal predecessors 3) evaluate the expected utility of this causal counterfactual. However, CDT is faced by various “Newcomblike Problems”. Whether or not these are fatal problems for CDT or an adjusted variant of CDT is somewhat more disputed. For instance, consider a variant of Prisoner’s Dilemma (itself a Newcomblike problem) which pits two identical agents against each other. S&F argue that in a Prisoner’s Dilemma variant, CDT neglects “non-causal logical constraints”. This results in recommending defection when playing against an identical copy of oneself (this particularly applicable for source-code identical AI agents). Essentially, the problem is as follows: 

Since the two agents are obligated to perform the same action by logical necessity, there are only two choices; either both cooperate or both defect. Per the rules of Prisoner’s Dilemma, mutual cooperation is better than mutual defection, thus the rational choice is cooperation. However, standard CDT will defect under Prisoner’s Dilemma, thus making a suboptimal choice as both agents will be locked into defecting.

So it looks like CDT fails a certain type of Newcomb problem because it does not consider non-causal counterfactuals. Namely, CDT severs the causal connection to its opponent in the Prisoner's Dilemma variant, so it reasons as though it were in a standard Prisoner's Dilemma problem. Yet it fails to consider the fact that by logical or physical necessity, its opponent will act the same way it does. Whether or not this is a fatal flaw for CDT is somewhat unclear, as one could easily create a variant of CDT which updates appropriately, reevalutes the causal connections once it has made the initial decision to defect, and switches to cooperate. As its opponent does the same, an updating CDT might get around the above variant of Prisoner's Dilemma.

Yet CDT faces other problems, perhaps even a variant of the above family of “Blackmail” problems. Consider  “Counterfactual Blackmail”, in which S&F argue for the scenario where an honest predictor deploys a virus capable of causing 150 million losses in stocks to the AI agent, but offers to stop the virus for 100 million. CDT prescribes paying up, as it results in the lower utility loss. S&F argue that if the agent were to refuse to pay, the researcher, who is quite risk averse, would not have deployed the virus.

This seems to me to be where academic philosophers and researchers from MIRI diverge in their opinions on whether or not this example actually poses a problem for CDT. Namely, once the AI is being blackmailed, paying up is the better choice as the virus has already been deployed. Perhaps in some counterfactual it would not have been deployed, but that simply is not the scenario the agent is presently in, so the counterfactual lack of deployment does not seem to have any bearing on the agent’s present position. So there’s considerable uncertainty about whether or not this is actually a fatal flaw for CDT. Moreover, there are also attempts to “patch” CDT, for instance the more popular “Global CDT” is a recent variant which in theory performs optimally on previously problematic cases.

Both EDT and CDT  face substantial problems. Both also evaluate actions, while the following more recent decision theories take a new approach: evaluating policies. This is where the split between academic philosophers and researchers at MIRI seems to become more pronounced.

Updateless Decision Theory

The first decision theory to take this approach is updateless decision theory (UDT), which identifies the best policy before acting and updating on observations. Per S&F, this gets around the problems in a simple Prisoner’s Dilemma which are faced by CDT, namely source-code identical selfish agents will not defect and instead will cooperate, which yields the highest expected utility. Yet this is problematic in that it requires evaluating “logical counterfactuals” (this gets around the objections to CDT, namely the aforementioned variant of Prisoner’s Dilemma). Yet there is currently no good way of formalizing “logical counterfactuals”, nor any of the other necessary non-causal counterfactuals. Thus formalizing UDT is particularly difficult.

Functional Decision Theory

The next policy based approach is functional decision theory (FDT). First presented in “Cheating Death in Damascus”, by Levinstein and Soares (2017), FDT considers which output of a decision function gives the best outcome. The paper considers the problems of Death in Damascus, a standard predictor dilemma, and blackmail variants, on which FDT (supposedly) consistently outperforms both EDT and CDT. FDT relies on subjective dependence, meaning the implementations of the same mathematical function are considered to have the same results for logical reasons (rather than causal ones). This holds true for counterfactuals. Moreover, this provides a slightly more formalized means of capturing the spirit of UDT. For this reason, UDT is sometimes considered to fall under the framework of a family of FDT-like theories. FDT requires a “non-trivial theory of counterfactuals” and “algorithmic similarity”, it also seems to need further investigation in cases of reasoning under uncertainty about tautologies and contradictions, or a means of avoiding these problems by reformulating them accordingly. 

A more elaborate and thorough explanation of FDT appears in “Functional Decision Theory”, by Yudkowsky and Soares (2018) (Y&S). Here FDT is presented as a variant of CDT which considers not only causal counterfactuals, but also logical, physical, mathematical, and the whole host of possible counterfactuals. Thus, FDT takes a CDT inspired approach in severing “spurious” connections while still considering non-causal counterfactuals. Per Y&S, FDT succeeds everywhere that EDT and CDT fail, as a concrete series of steps for dealing with Newcomblike problems (this is difficult to evaluate because no one really agrees on the optimal choice in Newcomblike problems). The further lack of a formal metric for the comparison of decision theories makes it difficult to say that FDT is truly better, much less that it is entirely optimal.

FDT also faces several objections, the first of which was presented by Wolgang Schwarz in 2018. Schwarz argues that in Blackmail, an FDT agent avoids the problem because it is known not to succumb, and thus is likely to be blackmailed—yet the very problem is that we are being blackmailed. Thus this sort of reliance on policy does nothing to actually lead the agent to what Schwarz sees as the optimal choice. Once one is faced by the threat of blackmail, the fact that one does not get blackmailed very often seems irrelevant. Thus, it is not at all obvious that FDT actually outperforms CDT, rather this seems subjective how outperforms is defined (e.g. resisting blackmail and avoiding the irrational act of paying, or paying the minor cost and thus avoiding the highly costly outcome). Schwarz offers another similar example, Procreation, in which FDT agents do significantly worse than CDT agents, as FDT leads to a miserable life for the agent in question. Schwarz finds that among other problems, these objections render FDT poorly positioned to act as an optimal decision theory.

Schwarz suggests as a solution a variant of CDT. To capture the spirit of FDT, he provides some form of “Vengeful CDT”, which holds a severe grudge. Yet Schwarz argues that the best of the current approaches would be some form of “Compassionate CDT”, which cares enough about others to act optimally in the right real-world cases, while still two-boxing in the original Newcomb’s Problem.

The second objection to FDT was presented by William MacAskill in 2019. MacAskill argues that it is unclear that FDT consistently outperforms CDT or that it gets the most utility, despite this being the criterion on which Y&S base their argument. This lack of clarity is symptomatic of 1) lack of formalization for FDT and 2) disagreements about what actually is the highest utility choice. MacAskill also highlights what he calls “Implausible discontinuities”, asking where FDT draws the line between “statistical regularities” and “Predictors”. Essentially, at what point does a statistical likelihood become sufficiently accurate to be considered a “Predictor”, as seen in the context of various decision theoretic problems, especially “Blackmail”, Newcomb problems and variants, and “Death in Damascus”. These discontinuities seem to create an arbitrary (and therefore potentially problematic) point at which FDT would switch from recommending two different policies, for instance switching between one-boxing to two-boxing in a Newcomb problem. MackAskill also finds FDT to be “deeply indeterminate”. Consider the problem of identifying whether or not two agents are “really running the same program”. To illustrate this, MacAskill gives the example of two calculators, both producing identical results save for the slight difference of a negative sign in front of the output of one machine. This becomes a problem of symbol interpretation and potentially arbitrary notation. The difficulty of seeing whether the symbol is arbitrary, and say means the same thing in a foreign language, or whether these two machines are performing radically different functions, could also be usefully considered by instead considering a pair of AI systems whose operations fit the characteristic black-box model. In response to the seemingly fatal problems for FDT, MacAskill recommends “Global CDT”.

“Global CDT”

MacAskill argues that we want to maximize expected utility on anything which is an evaluative focal point. This should be the central criteria for a decision theory. Thus, “Global CDT” still captures the same goal as FDT, maximizing utility. Yet we also have to consider how we can reconcile “irrational actions from rational dispositions'”. Namely, in Prisoner’s Dilemma, the rational choice is to defect, yet the optimal policy that we can arrive at when faced against an identical copy of ourselves is cooperation.Thus, defecting is the right action and the right sort of person is the one who cooperates. So when we consider identical agents in this case, we want them to be the right sorts of agents, who will thus cooperate. Helpfully, this does not cause any problems for agents who will defect when defection is optimal. These dispositions are the “Global” aspect of “Global CDT”.

In a certain sense, “Global CDT” looks quite similar to FDT. It prescribes the appropriate CDT action where CDT has historically succeeded, and avoids the problems standard CDT faces by taking a “Global” view, which is fairly similar to the policy-driven approach on which FDT is founded. “Global CDT” has certain benefits over FDT, namely it avoids the problems outlined by Schwarz and MacAskill. It seems unclear whether it gets around one of the primary issues facing FDT, namely the lack of a formalization. “Global CDT” relies on the right sort of dispositions, which seems to roughly say “picking the best choice when considering multiple agents”. Whether or not this is really different from FDT’s approach of considering non-causal counterfactuals seems unclear.


 

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 11:45 AM

Want a challenge?

There's prior art on functional decision theory. Finding it is an interesting Google-fu challenge, though I think Schwartz might mention it. Anyhow I looked for it for a few minutes but couldn't find it, and the authors' names have slipped my mind.