# Ω 26

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

# Introduction and motivation

The starting point for this post is a comment I wrote on Paul's post on EDT vs CDT:

One argument for CDT over EDT that you didn’t mention in this post: Suppose you live in a deterministic universe and know your own source code. Suppose you are deciding between taking a 5 dollar bill and a 10 dollar bill. Suppose your world model says you take the 5 dollar bill with 100% probability. Now conditioning on taking the 10 dollar bill gives you complete garbage, since you are conditioning on a probability 0 event. If you use EDT, then depending on other details of your decision procedure, this could lead to you always taking the 5 dollar bill. So then your world model would be accurate. (This is the “5 and 10” problem often discussed at MIRI; I don’t know if it has been written up anywhere)

CDT never generates undefined expected utility estimates like EDT does. It takes the 10 dollar bill in this problem. However, if it always takes the 10 dollar bill, then its counterfactual for taking the 5 dollar bill is strange because it is one in which a physical law is violated. The violation of a physical law could have important consequences other than which action the agent takes.

Both decision theories have trouble with this problem, but at least CDT always produces a defined answer.

Here’s another way of thinking about this problem. A fully Bayesian version of EDT must construct all possible worlds and then condition on taking a certain action. But each of these possible worlds contains a running copy of the EDT algorithm. So, absent some defined method for taking a fixed point, this leads to an infinite loop, and you can’t actually have a fully Bayesian version of EDT.

(What if you use reflective oracles to allow EDT to select some fixed point? We could specify that the reflective oracle returns arbitrary results when asked to condition on a probability 0 event (I think this is what the most natural way to emulate conditional queries on a reflective oracle results in, but I haven’t checked). Now there are multiple possible reflective oracles (i.e. fixed points); it’s possible to always take the 10 dollar bill and think bad things will happen conditional on taking the 5 dollar bill, and it’s also possible to always take the 5 dollar bill and think bad things will happen conditional on taking the 10 dollar bill.)

A fully Bayesian version of CDT must construct all possible counterfactuals. Each of these counterfactuals contains a running copy of CDT, so one might think the same problem applies. But in each of these counterfactuals, the output of the CDT algorithm is “thrown away”, since the agent’s action is controlled by a magic counterfactual intervention rather than its algorithm. So, if the CDT algorithm is sandboxed, the CDT’s world model can simply ignore the running CDT algorithm, as it has no effect. Thus, at least in single-agent problems (with no predictors etc), a fully Bayesian version of CDT is possible in principle, though obviously not in practice.

Some time during or after writing this comment, I noticed something: the equilibrium where the EDT agent thinks it always takes the 5 dollar bill, and therefore gets garbage (possibly low) estimates when considering taking the 10 dollar bill, and therefore never takes the 10 dollar bill, is extremely unstable. As soon as the agent assigns any probability at all to taking the 10 dollar bill, their conditional expected utility estimates are perfect. Can we use this fact to design a variant of EDT that always takes the 10 dollar bill?

Yes, yes we can.

# Definitions and main theorem statements

Reflective conditional oracle distributions will be defined similar to in previous work such as reflective oracles and reflective oracle distributions. I recommend understanding reflective oracles before reading this post (understanding reflective oracle distributions is helpful but unnecessary).

Let be some finite subset of the set of probabilistic Turing machines that may query a conditional oracle (defined in the next paragraph). They must each always halt and return either 0 or 1.

Definition 1: A conditional oracle is a function .

Intuitively, asks whether is greater or less than , where is taken from some distribution over conditional oracles.

We make the assumption that machines in only ever query the oracle with an element of . All further definitions and lemmas/theorems are parameterized on some satisfying this assumption.

Definition 2: A conditional oracle distribution (COD) is a distribution over conditional oracles, of type .

Definition 3: A COD is fully-mixed if it assigns nonzero probability to each possible conditional oracle.

Since the number of conditional oracles is finite, there are fully-mixed CODs.

Definition 4: A COD weakly leads to a COD iff, for all :

(Here, the notation refers to the probability of event when .)

Intuitively, weakly leads to iff accurately answers conditional queries that are about . If for some and also for some (possibly different) and is fully-mixed, these conditions are equivalent to:

As notation, let map a COD to the set of CODs it weakly leads to. This and related functions will be interpreted as set-valued when referring to their graphs.

Unfortunately, these conditional queries are not sensible when or ; the oracle is allowed to return any answer. We will define a stricter notion of "leading to" that will require these answers to be sensible.

Let be a set-valued function whose graph is the closure of (the graph of restricted to where is fully-mixed). It will turn out that for fully-mixed .

Equivalently, iff there are infinite COD sequences and such that (a) each is fully mixed, (b) each weakly leads to , (c) limits to , and (d) limits to .

Intuitively, is equivalent to for fully-mixed and generalizes to non-fully-mixed through taking limits that involve only fully-mixed values.

Define . It will turn out that for fully-mixed .

(Why take the convex hull? This is to make a Kakutani map. It might be the case that is always convex but I haven't proven this.)

Definition 5: A COD leads to a COD iff .

Due to Theorem 1, it will turn out that a fully-mixed COD leads to iff it strictly leads to . Leading to is a stronger condition than weak leading to, as will be proven.

At this point we are ready to define a reflection condition:

Definition 6: A COD is reflective iff it leads to itself.

Intuitively, a COD is reflective iff it accurately answers queries that are about itself.

This post's main results are the following (in addition to the proof that EDT beats 5 and 10):

Theorem 1: For any fully-mixed COD , .

Theorem 2: If leads to , then it weakly leads to .

Theorem 3: There is a reflective COD.

These are proven at the end of the post.

# Defining EDT

EDT can be defined using a reflective COD; this decision theory will be called COEDT. Let the decision problem be described by a Turing machine with an embedded agent, where this Turing machine including its agent may randomize and call a conditional oracle, and which returns either 0 or 1 to represent the agent's utility (intermediate utilities can be emulated by randomizing). For example, for the 5 and 10 problem, we may define:

where is the source code of the agent .

COEDT is a function from the universe program (which already contains an embedded COEDT agent) to action. The following COEDT variant handles cases where there are only 2 actions:

It uses the conditional oracle to determine if it has a higher chance of winning conditional on taking action 1 or action 0, and takes the action that it is more likely to win conditional on taking.

What does do on ? We can take a fixed point:

Let the set of machines considered be

Theorem 4: For any reflective COD on , .

Proof:

Informally, this is true because the agent must always take action 1 when queries are about any fully-mixed COD.

, so it is a convex combination of CODs in . Let that convex combination be .

Each , so we have COD sequences limiting to (with each being fully mixed) and limiting to such that each .

For each , since and both return 1 for some conditional oracles, and because weakly leads to , we have:

Obviously, the first conditional probability is 1 (since it is defined) and the second is 0. Therefore, each . This must then also be true for the limit of , i.e. . This must also be true for the convex combination, i.e. .

# Multiple actions

(This section can be skipped.)

The COEDT defined above only handles problems that have 2 actions. What if there are more than 2 actions? Then we can split the agent into multiple 2-action COEDT agents: the first chooses between taking the first action and passing control to the second agent, the second agent chooses between taking the second action and passing control to the third agent, and so on. For example, here is a 3-action construction of COEDT:

(Why does have a second argument? This is so the two different agents know which they are and can take different actions.)

One might think this has problems when the first agent expects the second agent to always take a bad action, therefore never defers control to the second agent, and therefore the second agent has no incentive to take a good action (this happens in Nash equilibria in sequential games). However, since we consider fully-mixed oracles in the construction of COEDT, this is not a problem (EDIT: it is sometimes, see comment). To demonstrate this, consider the following 5 and 10 and 15 problem:

where flips a coin and returns 1 with probability and 0 with probability .

Let the set of machines considered be

Theorem 5: For any reflective COD on , .

Proof:

, so it is a convex combination of CODs in . Let that convex combination be .

Each , so we have COD sequences limiting to (with each being fully mixed) and limiting to such that each .

By the same logic as in Theorem 4, each .

For each , since and both return 1 for some conditional oracles, and because weakly leads to , we have:

Obviously, the second conditional probability is 1/2. The first is 1 since . Therefore, each .

As in Theorem 4, these conditions must then be true for the limits and the convex combination .

# Conclusion and future research

I consider COEDT to be major progress in decision theory. Before COEDT, there were (as far as I know) 3 different ways to solve 5 and 10, all based on counterfactuals:

• Causal counterfactuals (as in CDT), where counterfactuals are worlds where physical magic happens to force the agent's action to be something specific.
• Model-theoretic counterfactuals (as in modal UDT), where counterfactuals are models in which false statements are true, e.g. where PA is inconsistent.
• Probabilistic conditionals (as in reinforcement learning and logical inductor based decision theories such as LIEDT/LICDT and asymptotic decision theory), where counterfactuals are possible worlds assigned a small but nonzero probability by the agent in which the agent takes a different action through "exploration"; note that ADT-style optimism is a type of exploration.

COEDT is a new way to solve 5 and 10. My best intuitive understanding is that, whereas ordinary EDT (using ordinary reflective oracles) seeks any equilibrium between beliefs and policy, COEDT specifically seeks a not-extremely-unstable equilibrium (though not necessarily one that is stable in the sense of dynamical systems), where the equilibrium is "justified" by the fact that there are arbitrarily close almost-equilibria. This is similar to trembling hand perfect equilibrium. To the extent that COEDT has counterfactuals, they are these worlds where the oracle distribution is not actually reflective but is very close to the actual oracle distribution, and in which the agent takes a suboptimal action with very small probability.

My sense is that the results in this post open up a wide new territory of open questions and further research. Here are some of them:

• What kind of optimality result(s) does COEDT have for single-player problems?
• Do infinite reflective CODs exist, as with reflective oracles?
• Is the set of reflective CODs convex (as the set of reflective oracle distributions is)?
• Can this approach be integrated with logical uncertainty (e.g. logical inductors)?
• What happens in games with more than one COEDT? What is the equilibrium concept?
• Are there optimality results for common-payoff games, or Pareto-optimality results for non-common-payoff games?
• Can COEDT be attacked with a "troll bridge" problem similar to the one for LIEDT/LICDT?

There is a lot of low-hanging fruit here, and I am posting this now before immediately picking the low-hanging fruit in the hope that discussion will be helpful.

Proofs of theorems 1-3 follow.

# Proving Theorem 1 and Theorem 2

First we will show that is a Kakutani map; this will also help with Theorem 2 (as Theorem 2 will be proven by applying Kakutani's fixed point theorem to ).

Lemma 1: For each , is nonempty and convex.

Proof:

Informally, this is true because for a fixed , the constraints on are convex and consistent.

Let . There are 3 cases:

• If , then the constraint corresponding to is .
• If , then the constraint corresponding to is .
• If , then there is no constraint corresponding to .

iff satisfies these constraints for all .

Clearly, each of these constraints is convex, since it picks out some hyperplane. Their intersection must then also be convex. So is convex.

To show that is nonempty, we need only consider that are fully factorized (i.e. the distinct random variables of the form are independent). For each , if there is no constraint then set . If there is a constraint, set to the value determined by this constraint. This yields a .

Lemma 2: The graph of is closed.

Proof:

Informally, this is true because each constraint on is closed.

Let . Consider the set of (not necessarily fully-mixed) satisfying the two constraints:

It is simple to see that the set of satisfying the first constraint is closed, because (a) the set of satisfying the antecedent of the implication is open, (b) the set of satisfying the consequent of the implication is closed, and (c) the union of two closed sets is closed. By similar logic, so is the set of satisfying the second constraint. If we take the intersection for all , the result is still closed, and this is the graph of .

Theorem 1: For any fully-mixed COD , .

Proof:

Let be fully-mixed. First we will show . By Lemma 2, 's graph equals its closure intersected with the set of for which is fully-mixed. Since 's graph is simply the closure of 's graph, they coincide on .

Now we will show . By Lemma 1, is convex, so it equals its convex hull .

Theorem 2: If a COD leads to , then it weakly leads to .

Proof:

Since 's graph is closed, and 's graph is the closure of a subset of 's graph, 's graph is a subset of 's graph.

Since is a convex superset of for all , it must also be a superset of the convex hull .

# Proving Theorem 3

Now it is time to show that is a Kakutani map.

Lemma 3: has a closed graph.

Proof:

Informally, this is true because a limit point of 's graph is the limit of a convergent sequence of convex combinations of points in 's graph (which is closed), which is itself equal to a convex combination of limits of sequences of points in 's graph, and 's graph is closed.

Trivially, has a closed graph, since its graph is the closure of a set.

Consider a limit point of 's graph. That means we have sequences limiting to and limiting to such that each .

Let . We consider CODs as elements of . Since each , it must by definition be a finite convex combination of CODs in . We can assume without loss of generality that this is a combination of exactly CODs, by Carathéodory's theorem.

We will now name these convex combinations. For each we have:

• A list , each in , and
• weights , each in and which sum to 1,
• such that each .

We may now consider as a vector in . The sequence of these vectors has a convergent subsequence by the Bolzano-Weierstrass theorem.

Define to be the limit of the convergent subsequence, and also call this full vector . Let the convergent subsequence be . Since 's graph is closed, we must have each . Therefore, their convex combination is in . We will now show .

Consider a function . Each , so clearly limits to . Additionally, is continuous, so, limits to . Therefore .

We have at this point demonstrated , since we already had .

Lemma 4: For each , is nonempty and convex.

Proof:

is convex since it is the convex hull of a set, so we need only show that it is nonempty. To do this we will show is nonempty; this is sufficient since .

Let be any fully-mixed COD. For , define . is a curve proceeding from to , and every COD in its image is fully-mixed.

Define for natural . Define to be an arbitrary COD satisfying ; the sequence can be defined using the axiom of choice. By the Bolzano-Weierstrass theorem, has a convergent subsequence; let be the limit of this subsequence. Clearly, , since by construction is a limit point of the graph of . This proves that is nonempty.

The proof of Theorem 3 is now trivial:

Theorem 3: There is a reflective COD.

Proof:

is a Kakutani map by Lemma 3 and Lemma 4. Therefore by Kakutani's fixed point theorem, it has a fixed point, i.e. a COD . By definition is reflective.

# Ω 26

New Comment

I realized there is a problem when you have a 3-action problem, where the first 2-action agent chooses between 0 utility and passing control to the second 2-action agent, and the second 2-action agent chooses between 1/2 and 1 expected utility.

The problem is that there's a stable equilibrium where the first agent passes off control and the second agent always chooses 1/2 expected utility. The second agent makes this choice because they think that, if they choose 1, then the first agent will choose 0. The probability that the first agent chooses 0 is 0 but, in some sequence of CODs leading up to the actual COD, the first agent is more likely to choose 0 if the second agent chooses 1.

Basically, irrational threats can be relevant to COEDT even though they happen with probability 0.

We could fix this with a direct construction of queries that can return more than 2 results (as in reflective oracle distributions), but in any case this is a serious problem for sequential decision problems and multi-player common-payoff games.

I'm having trouble following this step of the proof of Theorem 4: "Obviously, the first conditional probability is 1". Since the COD isn't necessarily reflective, couldn't the conditional be anything?

By definition , regardless of . (The subscript to only affects the distribution of )

EDIT: clarified notation in the post

So the oracle is a black box which is always right, but that's not enough - it's also a limit of black boxes that are slightly wrong. Good work! (Modulo my usual skepticism about parameterizing decision theory on some black box, which is my personal hangup and shouldn't stop you from exploring this direction.)

Suppose your world model says you take the 5 dollar bill with 100% probability. Now conditioning on taking the 10 dollar bill gives you complete garbage, since you are conditioning on a probability 0 event.

You can make the agent take the 10 dollar bill in that case, in effect blackmailing the world model: "if you want to stay sound, don't prove that action A has probability 0". I'd love to know if that covers all cases. In other words, if there's an N symbol proof of a "garbage" conditional, must there be an M<N symbol proof of the form "action A has probability 0"?

The agents in this post aren't proof-based. Proof-based issues have some issues with weird counterfactuals. Perhaps the only worlds where you take some specific action are ones where PA is inconsistent. (COEDT also has issues since the nearby oracles are not reflective, but it's a different set of issues)

In general queries to reflective-oracle-like world models that have forms like "is the probability of this exactly 0?" are problematic, since they are vulnerable to liar's paradoxes. What if you take action A iff the probability of taking action A is exactly 0? So to the extent that this works for proof-based agents, it's because they're not complete in some sense.

I am not sure that if there is a proof of a garbage conditional then there will be a similarly-short proof that the conditional is garbage. Say "proving a garbage conditional P(A|B)" means proving P(A^B)=cP(B) for some c such that in fact P(B)=0. I could imagine a case where it is easy to prove the event A equivalent to the event B, but hard to prove P(B)=0. Then you could prove P(A^B)=P(B) easily but not P(B)=0. (This case might not be problematic from a decision theory perspective, which is interesting, but it's still an invalid conditional)

Yeah, my comment was more about proof-based agents rather than oracle-based, sorry about going off topic. In a proof-based setting, conditionals with P(B)=0 aren't necessarily garbage, some of them are intended and we want the agent to find them. Your example might be one of those. The hard part is defining which is which.

It seems to me like the conditional oracle's definition could be made more elegant by taking only m and n as a parameter, both of which take an action as a parameter. The oracle would then implement .