FDT Does Not Endorse Itself in Asymmetric Games — LessWrong

FDT Does Not Endorse Itself in Asymmetric Games — LessWrong

A twin guard-inmate dilemma (twin GID) is an asymmetric game that breaks FDT. [Image: GPT Image-1]

0. Introduction

TL;DR: FDT and UDT diverge in how they handle "behave as you would have ideally precommitted to behaving" in asymmetric games where a player is assigned a role after a deterministic clone is made. FDT updates, whereas UDT does not. ∴ an agent who knows in advance that they will enter one of these games would convert to UDT, not FDT, on this problem. [UPDATE: this applies to the formulation of FDT in the paper, but not necessarily to Yudkowsky & Soares' "preferred" version of FDT; see Menotim's comment]

I wrote a version of this post on my substack; it was for a less technical audience, and at the time I didn't understand updateless decision theory. I assumed that UDT and FDT just used different methods to compute the same recommendations. I was wrong! In fact, there are very simple scenarios in which FDT does not recommend precommitting to itself.

1. Definitions

According to Yudkowsky & Soares' "Functional Decision Theory: A New Theory of Instrumental Rationality," FDT, CDT, and EDT all maximize expected utility as defined by this formula:

where $o 1, o 2, o 3$ . . . are the possible outcomes from some countable set $O$ ; $a$ is an action from some finite set $A$ ; $x$ is an observation history from some countable set $X$ ; $P (a ↪ o_{j}; x)$ is the probability that $o_{j}$ will obtain in the hypothetical scenario where the action $a$ is executed after receiving observations $x$ ; and $U$ is a real-valued utility function bounded in such a way that [the above equation] is always finite.
…
From this perspective, the three decision theories differ only in two ways: how they prescribe representing [the world-model] $M$ , and how they prescribe constructing hypotheticals $M^{a ↪}$ from $M$ . (emphasis mine)

From here, the three decision theories are formalized by:

\begin{matrix} EDT (P, x) & := {argmax}_{a \in A} E (V | Obs = x, Act = a) CDT (P, G, x) & := {argmax}_{a \in A} E (V | d o (Act = a), Obs = x) FDT (P, G, x) & := {argmax}_{a \in A} E (V | d o (FDT (P - -, G - -, x - -) = a)) \end{matrix}

where $V$ is a variable representing $U (Outcome)$ , $G$ is a Pearl-style digraph (of causal relations for CDT, subjunctive relations for FDT), and $FDT (P - -, G - -, x - -)$ is notation for a variable representing the output of FDT given $P$ , $G$ , and $x$ .

Given the equation for FDT and the equation for expected utility maximization, it's a little unclear whether FDT is totally updateless here. In the FDT equation, $V$ is not conditioned on $x$ , but in the EU equation, $x$ is an input to $P$ in all three theories. If FDT is updateful, its problems get much worse (as described in my original article), so I'll assume FDT is updateless in how it assesses outcomes and show that there is still a problem.

In any case, notice that an FDT agent constructs hypotheticals by considering interventions on FDT's recommendation for agents with its precise priors, digraph, and observation history, and not considering interventions on FDT's recommendation for agents in the same scenario with a different observation history.

This is the first clue that what an agent would ideally precommit to could diverge from FDT's recommendations. When precommitting, you can choose a single policy which stipulates a full strategy profile for you and any clones of you who might have different observation histories in the scenario. But FDT only considers what agents with your observation history should do.

2. The Twin Guard-Inmate Dilemma

Let a "guard-inmate dilemma" (GID) be a prisoners' dilemma, with one twist: one player is randomly assigned the role of "guard", the other the role of "inmate". The guard has slightly different payoffs that are overall more favorable but do not change the Nash equilibrium of the problem. Here is the payoff matrix I use, where the guard gets a consistent +1 relative to the inmate:

		Guard
		Coop	Defect
Inmate	Coop	3, 4	-5, 6
Inmate	Defect	5, -4	-1, 0

Here is the setup for a twin GID: a deterministic agent is cloned and made to play a GID against its twin. Each is told their own role in the dilemma but cannot communicate with the other. So the two agents now have different observation histories, since the twin GID is asymmetric: one learns that they are a guard, the other that they are an inmate. Yudkowsky & Soares describe how they model behaving in response to different observations:

When CDT’s observation history updates from $x$ to $y$ , CDT changes from conditioning its model on $Obs = x$ to conditioning its model on $Obs = y$ , whereas FDT changes from intervening on the variable $FDT (P - -, G - -, x - -)$ to $FDT (P - -, G - -, y -)$ instead.

Therefore, here is the digraph for the problem:

where $g$ is the observation history corresponding to learning you're a guard, and $i$ is the observation history corresponding to learning you're an inmate.

Immediately, there's a problem: neither the guard nor the inmate's FDT-recommendation descend from the other! Suppose you, as an FDT agent, find yourself as the guard. According to Yudkowsky & Soares' equations, you choose your action by considering the (updateless) expected utility of each possible intervention on $FDT (P - -, G - -, g)$ . But neither $FDT (P - -, G - -, i)$ nor the inmate's action are subjunctively altered by your intervention. Thus, FDT treats the guard and inmate actions as independent and therefore recommends defection. The inmate will reason similarly and also defect. Thus, following FDT leads to mutual defection.

However, agents with the policy "always cooperate in twin GIDs, no matter your role" will achieve mutual cooperation, outperforming FDT agents. Therefore, if a winning agent knew in advance that they were going to face a twin GID, they would not want to act like an FDT agent. So either FDT does not endorse the winning strategy, or it does not endorse itself. In Yudkowsky & Soares' words,

A decision theory that (like CDT) advises agents to change their decision-making methodology as soon as possible can be lauded for its ability to recognize its own flaws, but is not a strong candidate for the normative theory of rational choice.

3. Implications

One objection is that my digraph for the twin GID might neglect some subjunctive relationship between $FDT (P - -, G - -, g)$ . and $FDT (P - -, G - -, i)$ . However, it is fully logically consistent to have a function which recommends different actions based on whether you are a guard or an inmate, so it cannot be the case that either one is subjunctively downstream of the other. It could be the case that $FDT (P - -, G - -, g)$ and $FDT (P - -, G - -, i)$ are both subjunctively downstream of some other computation C:

The problem is that FDT does not intervene on C. Instead it is stipulated to intervene directly on $FDT (P - -, G - -, x - -)$ , and Pearl-style intervention breaks all incoming arrows to the node; this is the difference between the $d o$ -operator and Bayesian conditioning.

So this doesn't solve the problem. However, it does show how the problem might be solved: by having a decision theory that intervenes on upstream policies themselves, rather than on the outputs of policies. This is what UDT does! The problem I've described is very similar to the problem Wei Dai found in an earlier version of UDT, and suggested that timeless decision theory might share the same bug. This prompted the switch from this action-based equation (where $o \in O$ is your observation history):

choice (o) := arg max a \in A E_{P} (U ∣ choice (o) = a)

to a policy-based equation (where $π : O \to A$ ):

\begin{matrix} choice (o) & := π^{*} (o) π^{*} & := arg max π \in Π E_{P} (U ∣ π^{*} = π) \end{matrix}

Correspondingly, UDT does not appear to have the same problem with asymmetry.

Therefore, for the moment I believe UDT, or a UDT-like theory should be the optimal decision theory for a self-modifying AI, not FDT.