Deriving Conditional Expected Utility from Pareto-Efficient Decisions

Thomas Kwa

This is a distillation of this post by John Wentworth.

Introduction

Suppose you're playing a poker game. You're an excellent poker player (though you've never studied probability), and your goal is to maximize your winnings.

Your opponent is about to raise, call, or fold, and you start thinking ahead.

If your opponent raises, he either has a strong hand or is bluffing. In this situation, your poker intuition tells you he would be bluffing and you should call in response.
If your opponent calls, he probably has a better hand than yours.
If your opponent folds, you win the hand without need for further action.

Let's break down your thinking in the case where your opponent raises. Your thought process is something like this:

If he raises, you want to take the action that maximizes your expected winnings.
You want to make the decision that's best in the worlds where he would raise. You don't care about the worlds where he wouldn't raise, because we're currently making the assumption that he raises.
Your poker intuition tells you that the worlds where he would raise are mostly the ones where he is bluffing. In these worlds your winnings are maximized by calling. So you decide the optimal policy if he raises is to call.

Step 2 is the important one here. Let's unpack it further.

You don't know your opponent's actual hand or what he will do. But you're currently thinking about what to do if he raises.
The optimal decision here depends only on worlds where he would raise.
You decide how much you care about winning in different worlds precisely by thinking "how likely is this world, given that he raises?".

This sounds suspiciously like you're maximizing the Bayesian conditional expectation of your winnings: the expected value given some partial information about the world. This can be precisely defined as , where $u$ is your winnings, $A$ is your action, and $P [X]$ is the probability of world $X$ . But you don't know any probability, so you don't know how to assign probability to worlds, much less what conditioning and expectation are! How could you possibly be maximizing a "conditional expectation"?

Luckily, your opponent folds and you win the hand. You resolve to (a) study coherence theorems and probability so you know the Law behind optimal poker strategy, and (b) figure out why you have a voice in your head telling you about "conditional expectations" and reading equations at you.

It turns out your behavior at the poker table can be derived from one particular property of your poker strategy: you never make a decision that is worse than another possible decision in all possible worlds. (An economist would say you're being Pareto-efficient about maximizing your winnings in different possible worlds).

Summary

An agent which has some goal, has uncertainty over which world it's in, and is Pareto-efficient in the amount of goal achieved in different possible worlds, can be modeled as using conditional probability. We show this result in two steps:

A Pareto-efficient agent can be said to behave like an expected utility maximizer (EUM) in a weak sense.
If the agent is an EUM in this sense and makes decisions based on limited information, it can be modeled as using conditional expected value.

There's also a third, more speculative step:

If the agent makes many distributed decisions based on different pieces of limited information, it's more efficient / simpler for the agent to "think about" different underlying worlds rather than just the received information, so it is behaving as if it applies conditional expected value within a world-model.

This result is essentially a very weak selection theorem.

Pareto efficiency over possible worlds implies EUM

Suppose that an agent is in some world $X \in X$ and has uncertainty over which world it's in. The agent has a goal $u$ and is Pareto-efficient with respect to maximizing the amount of goal achieved in each world. A well-known result in economics says that Pareto efficiency implies the existence of some function $P [X]$ such that the agent chooses its actions $A$ to maximize the weighted sum $\sum_{X} P [X] u (A, X)$ . (Without loss of generality, we can let P sum to 1.) If we interpret $P [X]$ as the probability of world $X$ , the agent maximizes $E_{X} [u (A, X)]$ , i.e. expected utility.

Note that we have not determined anything about P other than that it sums to 1. Some properties we don't know or derive in this setup:

The agent has an explicit representation of $P [X]$
$P [X]$ satisfies other probability laws
The agent performs Bayesian updates on $P [X]$ .^[1]
$P [X]$ can be related to a frequentist notion of probability like in the setup for VNM

The following example assumes that we have an expected utility maximizer in the sense of being Pareto efficient over multiple worlds, and shows that it behaves as if it uses conditional probabilities.

EUM implies conditional expected value

Another example, but we actually walk through the math this time.

You live in Berkeley, CA, like Korean food, and have utility function u = "subjective quality of food you eat". Suppose you are deciding where to eat based only on names and Yelp reviews of restaurants. You are uncertain about X, a random variable representing the quality of all restaurants under your preferences, and Yelp reviews give you partial information about this. Your decision-making is some function A(f(X)) of the information f(X) in the Yelp reviews, and you choose A to maximize your expected utility between worlds: maybe the optimal A is to compare the average star ratings, give Korean restaurants a 0.2 star bonus, and pick the restaurant with the best adjusted average rating.

Here, we assume you behave like an "expected utility maximizer" in the weak sense above. I claim we can model you as maximizing conditional expected value.

Suppose you're constructing a lookup table for the best action A given each possible observation of reviews. Your lookup table looks something like

f(X)	A(f(X))
{("Mad Seoul", 4.5), ("Sushinista", 4.8)}	eat at Sushinista
{("Kimchi Garden", 4.3), ("Great China", 4.4)}	eat at Kimchi Garden
…	…

You always calculate the action A that maximizes $E_{X} u (A, X) = \sum_{X} P [X] u (A (f (X)), X) .$

Suppose that in a given row we have $f (X) = o$ , where $o$ is some observation. Then we are finding $arg {max}_{A (o)} E_{X} [u (A (f (X)), X] = arg {max}_{A (o)} \sum_{X} P [X] u (A (f (X)), X)$ . We can make a series of simplifications:

$arg {max}_{A (o)} \sum_{X} P [X] u (A (f (X)), X)$
$= arg {max}_{A (o)} [\sum_{X : f (X) = o} P [X] u (A, X) + \sum_{X : f (X) \neq o} P [X] u (A, X)]$
Now, note that since we are choosing $A (o)$ , we can equivalently maximize just the part of the above sum which is not constant in $A (o)$ . The constant terms are those for which $f (X) \neq o$ ; i.e. where reality would not produce the observation $o$ . This is clear if you think about it: the decision about where to eat if you see the ratings {("Mad Seoul", 4.5), ("Sushinista", 4.8)} should not depend on any world where you wouldn't see those ratings! So we can write:
$\dots = arg {max}_{A (o)} \sum_{X : f (X) = o} P [X] u (A (f (X)), X)$
$= arg {max}_{A (o)} P [f (X) = o] E_{X} [u (A, X) | f (X) = o]$ (expanding)
$= arg {max}_{A (o)} E_{X} [u (A, X) | f (X) = o]$ , since the factor $P [f (X) = o]$ doesn't depend on A.

Thus, we can model you as using conditional expected value.

Multiple decisions might imply conditional EV is meaningful

This section is a distillation of, and expansion upon, this comment thread.

Suppose now that you're making multiple decisions $A = (A_{i})_{1 \leq i \leq n}$ in a distributed fashion to maximize the same utility function, where there is no information flow between the decisions. For example, 10 copies of you (with the same preferences and same choice of restaurants) are dropped into Berkeley, but they all have slightly different observation processes $f_{i}$ : Google Maps reviews, Grubhub reviews, personal anecdotes, etc.

Now, when constructing a lookup table for $A_{i}$ , each copy of you will still condition each row's output on its input. When making decision $A_{i}$ from input $f_{i} (X)$ , you don't have the other information $f_{j} (X)$ for $i \neq j$ , so you consider each decision separately, still maximizing $E [u (A, X) | f_{i} (X) = o_{i}]$ . Here, the information $f_{i}$ does not depend on other decisions, but this is not necessary for the core point.^[2]

In the setup with one decision, we showed that a Pareto-efficient agent can be modeled as maximizing conditional EU over possible worlds $X$ : $u^{'} (A, o) = E [u (A, X) | f (X) = o]$ . But because one can construct a utility function of type $o b s e r v a t i o n \to a c t i o n$ consistent with any agent's behavior, the agent can also be modeled as maximizing conditional EU over possible observations $o$ : $u^{'} (A, o) = E [u (A, X) | f (X) = o]$ . In the single-decision case, there is no compelling reason to model the agent as caring about worlds rather than observations, especially because storing and processing observations should be simpler than storing and processing distributions of worlds.

When the agent makes multiple decisions based on different observations $o_{1}, \dots, o_{n}$ , there are two possible "trivial" ways to model it: either as maximizing a utility function $u^{'} (A, o_{1}, o_{2}, \dots, o_{n})$ , or as maximizing separate utility functions $u_{1}^{'} (A_{1}, o_{1}), \dots, u_{n}^{'} (A_{n}, o_{n})$ . However, with sufficiently many decisions, neither of these trivial representations is as "nice" as conditional EU over possible worlds:

With many observations, the tuple $(o_{1}, \dots, o_{n})$ could have more bits than X. Therefore, the utility function over worlds $u (A, X)$ can be considered a simpler, more compressed representation than the utility function over observations $u^{'} (A, o_{1}, o_{2}, \dots, o_{n})$ .
In the single-decision setup, maximizing any utility function $u^{'} (A, o)$ can be explained as maximizing $E [u (A, X) | f (X) = f^{*}]$ for some $u$ : perhaps if you always pick restaurants with the lowest star rating, you just like low-quality food. But this is not true in the multi-decision case: with enough decisions, not every tuple of utility functions $u_{1}^{'} (A_{1}, o_{1}), \dots, u_{n}^{'} (A_{n}, o_{n})$ corresponds to a utility function over worlds $X$ .

Suppose when given Grubhub ratings, an agent picks the highest-rated restaurants, but when given Yelp ratings, it picks the lowest-rated restaurants. The agent is now being suspiciously inconsistent-- though maybe it values eating at restaurants that have good delivery food but terrible service, or something. With enough inconsistent-looking decisions, there could actually be no property of the restaurants that it is maximizing, and so no utility function $u (A, X)$ that explains its behavior.^[3] So in the multi-decision case, saying the agent is maximizing $E [u (A, X) | f (X) = o_{i}]$ actually narrows down its behavior.

^{^}
John made the following comment:
We are showing that the agent performs Bayesian updates, in some sense. That's basically what conditioning is. It's just not necessarily performing a series of updates over time, with each retaining the information from the previous, the way we usually imagine.
^{^}
When f depends on past decisions, the agent just maximizes $E [u (A, X) | f_{i} (A_{< i}, X) = o_{i}]$ . To see the math for the multi-decision case, read the original post by John Wentworth.
^{^}
If the world has $b_{X}$ bits of state, and the observations reveal $b_{o}$ bits of information each, the pigeonhole principle says this surely happens when there are $b_{x} / b_{o}$ observations. Our universe has about $10^{125}$ bits of state, so this won't happen unless our agent can operate coherently in ~ $10^{125}$ different decisions; this number can maybe be reduced if we suppose that our agent can only actually observe, say, $10^{10}$ bits of state.

[-]Thomas Kwa4y20

Thoughts on the process of writing this post:

It took a lot of effort to write, something like 3 days of my time. Distillation is hard.
Most of this effort was not in understanding the original post (took me 2-3 hours to understand the math)
I sent drafts to johnwentworth several times and had several conversations with him to refine this piece. This probably spent ~2 hours of his time.
I'm not satisfied with the final result. It seems like the point the original post made was fairly obvious and I used way too many words to explain it properly. Maybe John thought the interpretation of the math was fairly deep and I thought it wasn't very deep?
I think that since John is a good and prolific writer already compared to most alignment researchers, there is higher value in distilling ideas of other researchers. It's hard to produce a lot of value from content already on LW.
- Paul Christiano blogposts are somewhat famously opaque; distillations of these have worked in the past and still seem pretty valuable. The highest-relevance academic papers might be better. But many of the highest-value distillations probably involve talking to researchers to get things they're too busy to write down at all.

LESSWRONG
LW