Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is part of the sequence version of the Effective Altruism Foundation's research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence.

7 Foundations of rational agency

We think that the effort to ensure cooperative outcomes among TAI systems will likely benefit from thorough conceptual clarity about the nature of rational agency. Certain foundational achievements — probability theory, the theory of computation, algorithmic information theory, decision theory, and game theory to name some of the most profound — have been instrumental in both providing a powerful conceptual apparatus for thinking about rational agency, and the development of concrete tools in artificial intelligence, statistics, cognitive science, and so on. Likewise, there are a number of outstanding foundational questions surrounding the nature of rational agency which we expect to yield additional clarity about interactions between TAI-enabled systems. Broadly, we want to answer:

  • What are the implications of computational boundededness (Russell and Subra-manian, 1994; Cherniak, 1984; Gershman et al., 2015) for normative decision theory, in particular as applied to interactions among TAI systems?

  • How should agents handle non-causal dependences with other agents’ decision-making in their own decisions?

We acknowledge, however, the limitations of the agenda for foundational questions which we present. First, it is plausible that the formal tools we develop will be of limited use in understanding TAI systems that are actually developed. This may be true of black-box machine learning systems, for instance [1]. Second, there is plenty of potentially relevant foundational inquiry scattered across epistemology, decision theory, game theory, mathematics, philosophy of probability, philosophy of science, etc. which we do not prioritize in our agenda [2]. This does not necessarily reflect a considered judgement about all relevant areas. However, it is plausible to us that the research directions listed here are among the most important, tractable, and neglected (Concepts,n.d.) directions for improving our theoretical picture of TAI.

7.1 Bounded decision theory [3]

Bayesianism (Talbott, 2016) is the standard idealized model of reasoning under empirical uncertainty. Bayesian agents maintain probabilities over hypotheses; update these probabilities by conditionalization in light of new evidence; and make decisions according to some version of expected utility decision theory (Briggs, 2019). But Bayesianism faces a number of limitations when applied to computationally bounded agents. Examples include:

  • Unlike Bayesian agents, computationally bounded agents are logically uncertain. That is, they are not aware of all the logical implications of their hypotheses and evidence (Garber, 1983) [4]. Logical uncertainty may be particularly relevant in developing a satisfactory open-source game theory (Section 3.2), as open-source game theory requires agents to make decisions on the basis of the output of their counterparts' source codes (which are logical facts). In complex settings, agents are unlikely to be certain about the output of all of the relevant programs. Garrabrant et al. (2016) presents a theory for assigning logical credences, but it has flaws when applied to decision-making (Garrabrant, 2017). Thus one research direction we are interested in is a theoretically sound and computationally realistic approach to decision-making under logical uncertainty.

  • Unlike Bayesian agents, computationally bounded agents cannot reason over the space of all possible hypotheses. Using the the terminology of statistical modeling (e.g., Hansen et al. 2016), we will call this situation model misspecification [5]. The development of a decision theory for agents with misspecified world-models would seem particularly important for our understanding of commitment in multi-agent settings. Rational agents may sometimes want to bind themselves to certain policies in order to, for example, reduce their vulnerability to exploitation by other agents (e.g., Schelling (1960); Meacham (2010); Kokotajlo(2019a); see also Section 3 and the discussion of commitment races in Section 2). Intuitively, however, a rational agent may be hesitant to bind themselves to a policy by planning with a model which they suspect is misspecified. The analysis of games of incomplete information may also be quite sensitive to model misspecification [6]. To develop a better theory of reasoning under model misspecification, one might start with the literatures on decision theory under ambiguity (Gilboa and Schmeidler, 1989; Maccheroni et al., 2006; Stoye, 2011; Etner et al.,2012) and robust control theory (Hansen and Sargent, 2008).

7.2 Acausal reasoning [7]

Newcomb’s problem [8] (Nozick, 1969) showed that classical decision theory bifurcates into two conflicting principles of choice in cases where outcomes depend on agents' predictions of each other's behavior. Since then, considerable philosophical work has gone towards identifying additional problem cases for decision theory and towards developing new decision theories to address them. As with Newcomb's problem, many decision-theoretic puzzles involve dependences between the choices of several agents. For instance, Lewis (1979) argues that Newcomb's problem is equivalent to a prisoner's dilemma played by agents with highly correlated decision-making procedures, and Soares and Fallenstein (2015) give several examples in which artificial agents implementing certain decision theories are vulnerable to blackmail.

In discussing the decision theory implemented by an agent, we will assume that the agent maximizes some form of expected utility. Following Gibbard and Harper (1978), we write the expected utility given an action for a single-stage decision problem in context as

where are possible outcomes; is the agent’s utility function; and stands for a given notion of dependence of outcomes on actions. The dependence concept an agent uses for in part determines its decision theory.

The philosophical literature has largely been concerned with causal decision theory (CDT) (Gibbard and Harper, 1978) andevidential decision theory (EDT)(Horgan,1981), which are distinguished by their handling of dependence.

Causal conditional expectations account only for the causal effects of an agent’s actions; in the formalism of Pearl (2009)’s do-calculus, for instance, the relevant notion of expected utility conditional on action is . EDT, on the other hand, takes into account non-causal dependencies between the agent's actions and the outcome. In particular, it takes into account the evidence that taking the action provides for the actions taken by other agents in the environment with whom the decision-maker's actions are dependent. Thus the evidential expected utility is the classical conditional expectation .

Finally, researchers in the AI safety community have more recently developed what we will refer to as logical decision theories, which employ a third class of dependence for evaluating actions (Dai, 2009; Yudkowsky, 2009; Yudkowsky and Soares, 2017). One such theory is functional decision theory (FDT) [9], which uses what Yudkowsky and Soares (2017) refer to as subjunctive dependence. They explain this by stating that "When two physical systems are computing the same function, we will say that their behaviors "subjunctively depend’’ upon that function’’ (p. 6). Thus, in FDT, the expected utility given an action is computed by determining what the outcome of the decision problem would be if all relevant instances of the agent’s decision-making algorithm output .

In this section, we will assume an acausal stance on decision theory, that is, one other than CDT. There are several motivations for using a decision theory other than CDT:

  • Intuitions about the appropriate decisions in thought experiments such as Newcomb’s problem, as well as defenses of apparent failures of acausal decision theory in others (in particular, the "tickle defense’’ of evidential decision theory in the so-called smoking lesion case; see Ahmed (2014) for extensive discussion);

  • Conceptual difficulties with causality (Schaffer, 2016);

  • Demonstrations that agents using CDT are exploitable in various ways (Kokotajlo, 2019b; Oesterheld and Conitzer, 2019);

  • The evidentialist wager (MacAskill et al., 2019), which goes roughly as follows: In a large world (more below), we can have a far greater influence if we account for the acausal evidence our actions provide for the actions of others. So, under decision-theoretic uncertainty, we should wager in favor of decision theories which account for such acausal evidence.

We consider these sufficient motivation to study the implications of acausal decision theory for the reasoning of consequentialist agents. In particular, in this section we take up various possibilities for acausal trade between TAI systems. If we account for the evidence that one's choices provides for the choices that causally disconnected agents, this opens up both qualitatively new possibilities for interaction and quantitatively many more agents to interact with. Crucially, due to the potential scale of value that could be gained or lost via acausal interaction with vast numbers of distant agents, ensuring that TAI agents handle decision-theoretic problems correctly may be even more important than ensuring that they have the correct goals.

Agents using an acausal decision theory may coordinate in the absence of causal interaction. A concrete illustration is provided in Example 7.2.1, reproduced from Oesterheld (2017b)’s example, which is itself based on an example in Hofstadter (1983).


Example 7.2.1 (Hofstadter’s evidential cooperation game)

Hofstadter sends 20 participants the same letter, asking them to respond with a single letter ‘C’ (for cooperate) or ‘D’ (for defect) without communicating with each other. Hofstadter explains that by sending in ‘C’, a participant can increase everyone else’s payoff by $2. By sending in ‘D’, participants can increase their own payoff by $5. The letter ends by informing the participants that they were all chosen for their high levels of rationality and correct decision making in weird scenarios like this. Note that every participant only cares about the balance of her own bank account and not about Hofstadter’s or the other 19 participants’. Should you, as a participant, respond with ‘C’ or ‘D’?

An acausal argument in favor of 'C’ is: If I play 'C’, this gives me evidence that the other participants also chose 'C’. Therefore, even though I cannot cause others to play 'C’ — and therefore, on a CDT analysis — should play 'D’ — the conditional expectation of my payoff given that I play 'C’ is higher than my conditional expectation given that I play 'D’.


We will call this mode of coordination evidential cooperation.

For a satisfactory theory of evidential cooperation, we will need to make precise what it means for agents to be evidentially (but not causally) dependent. There are at least three possibilities.

  1. Agents may tend to make the same decisions on some reference class of decision problems. (That is, for some probability distribution on decision contexts , is high.)

  2. An agent’s taking action A in context C may provide evidence about the number of agents in the world who take actions like A in contexts like C.

  3. If agents have similar source code, their decisions provide logical evidence for their counterpart’s decision. (In turn, we would like a rigorous account of the notion of "source code similarity''.)

It is plausible that we live in an infinite universe with infinitely many agents (Tegmark,2003). In principle, evidential cooperation between agents in distant regions of the universe is possible; we may call this evidential cooperation in large worlds (ECL) [10]. If ECL were feasible then it is possible that it would allow agents to reap large amounts of value via acausal coordination. Treutlein (2019) develops a bargaining model of ECL and lists a number of open questions facing his formalism. Leskela (2019) addresses fundamental limitations on simulations as a tool for learning about distant agents, which may be required to gain from ECL and other forms of "acausal trade''. Finally, Yudkowsky (n.d.) lists potential downsides to which agents may be exposed by reasoning about distant agents. The issues discussed by these authors, and perhaps many more, will need to be addressed in order to establish ECL and acausal trade as serious possibilities. Nevertheless, the stakes strike us as great enough to warrant further study.

Acknowledgements & References


  1. Cf. discussion of the Machine Intelligence Research Institute foundational research and its applicability to machine-learning-driven systems Taylor (2016); Dewey (2017). ↩︎

  2. For other proposals for foundational research motivated by a concern with improving the long-term future, see for instance the research agendas of the Global Priorities Research Institute (Greaves et al., 2019) (especially Sections 2.1 and 2.2 and Appendix B) and the Machine Intelligence Research Institute (Soaresand Fallenstein, 2017; Garrabrant and Demski, 2018). ↩︎

  3. This subsection was developed from an early-stage draft by Caspar Oesterheld and Johannes Treutlein. ↩︎

  4. Consider, for instance, that most of us are uncertain about the value of the digit of , despite the fact that its value logically follows from what we know about mathematics. ↩︎

  5. This problem has been addressed in two ways. The first is simply to posit that the agent reasons over an extremely rich class of hypotheses, perhaps one rich enough to capture all of the important possibilities. An example of such a theory is Solomonoff induction (Solomonoff, 1964; Sterkenburg, 2013), in which evidence takes the form of a data stream received via the agent’s sensors, and the hypotheses correspond to all possible "lower semi-computable’’ generators of such data streams. But Solomonoff induction is incomputable and its computable approximations are still intractable. The other approach is to allow agents to have incomplete sets of hypotheses, and introduce an additional rule by which hypotheses may be added to the hypothesis space (Wenmackers and Romeijn, 2016). This sort of strategy seems to be the way forward for an adequate theory of bounded rationality in the spirit of Bayesianism. However, to our knowledge, there is no decision theory which accounts for possible amendments to the agent’s hypothesis space. ↩︎

  6. See Section 4.1 for discussion of games of incomplete information and possible limitations of Bayesian games. ↩︎

  7. This subsection was developed from an early-stage draft by Daniel Kokotajlo and Johannes Treutlein. ↩︎

  8. In Newcomb’s problem, a player is faced with two boxes: a clear box which contains $1000, and an opaque box which contains either $0 or $1 million. They are given a choice between choosing both boxes (Two-Boxing) or choosing only the opaque box (One-Boxing). They are told that, before they were presented with this choice, a highly reliable predictor placed $1 million in the opaque box if they predicted that the player would One-Box, and put $0 in the opaque box if they predicted that the player would Two-Box. There are two standard lines of argument about what the player should do. The first is a causal dominance argument which says that, because the player cannot cause money to be placed in the opaque box, they will always get at least as much money by taking both boxes than by taking one. The second is a conditional expectation argument which says that (because the predictor is highly reliable) One-Boxing provides strong evidence that there is $1 million in the opaque box, and therefore the player should One-Box on the grounds that the conditional expected payoff given One-Boxing is higher than that of Two-Boxing. These are examples of causal and evidential decision-theoretic reasoning, respectively. ↩︎

  9. Note that the little public discussion of FDT by academic philosophers has been largely critical (Schwarz, 2018; MacAskill, 2019). ↩︎

  10. Oesterheld (2017b), who introduced the idea, calls this "multiverse-wide superrationality’’, following Hofstadter (1983)’s use of "superrational’’ to describe agents who coordinate acausally. ↩︎

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:38 AM

Relevant to this agenda are the failure modes I discussed in my multi-agent failures paper, which seems worth looking at in this context.

For the evidential game, it doesn't just matter whether you co-operate or not, but why. Different why's will be more or less likely to be adopted by the other agents.

Agents may tend to make the decisions on some reference class of decision problems. (That is, for some probability distribution on decision contexts C, P(Agent 1’s decision in context C=Agent 2’s decision in context C) is high.)

Should this say "make the same decisions" (i.e., is the word "same" missing)? (Asking partly in case I'm misunderstanding what possibility is being described there.)

Should be "same", fixed, thanks :)