Thoughts on the process of writing this post:
This is a distillation of this post by John Wentworth.
Suppose you're playing a poker game. You're an excellent poker player (though you've never studied probability), and your goal is to maximize your winnings.
Your opponent is about to raise, call, or fold, and you start thinking ahead.
Let's break down your thinking in the case where your opponent raises. Your thought process is something like this:
Step 2 is the important one here. Let's unpack it further.
This sounds suspiciously like you're maximizing the Bayesian conditional expectation of your winnings: the expected value given some partial information about the world. This can be precisely defined as , where is your winnings, is your action, and is the probability of world . But you don't know any probability, so you don't know how to assign probability to worlds, much less what conditioning and expectation are! How could you possibly be maximizing a "conditional expectation"?
Luckily, your opponent folds and you win the hand. You resolve to (a) study coherence theorems and probability so you know the Law behind optimal poker strategy, and (b) figure out why you have a voice in your head telling you about "conditional expectations" and reading equations at you.
It turns out your behavior at the poker table can be derived from one particular property of your poker strategy: you never make a decision that is worse than another possible decision in all possible worlds. (An economist would say you're being Pareto-efficient about maximizing your winnings in different possible worlds).
An agent which has some goal, has uncertainty over which world it's in, and is Pareto-efficient in the amount of goal achieved in different possible worlds, can be modeled as using conditional probability. We show this result in two steps:
There's also a third, more speculative step:
This result is essentially a very weak selection theorem.
Suppose that an agent is in some world and has uncertainty over which world it's in. The agent has a goal and is Pareto-efficient with respect to maximizing the amount of goal achieved in each world. A well-known result in economics says that Pareto efficiency implies the existence of some function such that the agent chooses its actions to maximize the weighted sum . (Without loss of generality, we can let P sum to 1.) If we interpret as the probability of world , the agent maximizes , i.e. expected utility.
Note that we have not determined anything about P other than that it sums to 1. Some properties we don't know or derive in this setup:
The following example assumes that we have an expected utility maximizer in the sense of being Pareto efficient over multiple worlds, and shows that it behaves as if it uses conditional probabilities.
Another example, but we actually walk through the math this time.
You live in Berkeley, CA, like Korean food, and have utility function u = "subjective quality of food you eat". Suppose you are deciding where to eat based only on names and Yelp reviews of restaurants. You are uncertain about X, a random variable representing the quality of all restaurants under your preferences, and Yelp reviews give you partial information about this. Your decision-making is some function A(f(X)) of the information f(X) in the Yelp reviews, and you choose A to maximize your expected utility between worlds: maybe the optimal A is to compare the average star ratings, give Korean restaurants a 0.2 star bonus, and pick the restaurant with the best adjusted average rating.
Here, we assume you behave like an "expected utility maximizer" in the weak sense above. I claim we can model you as maximizing conditional expected value.
Suppose you're constructing a lookup table for the best action A given each possible observation of reviews. Your lookup table looks something like
f(X) | A(f(X)) |
{("Mad Seoul", 4.5), ("Sushinista", 4.8)} | eat at Sushinista |
{("Kimchi Garden", 4.3), ("Great China", 4.4)} | eat at Kimchi Garden |
… | … |
You always calculate the action A that maximizes
Suppose that in a given row we have , where is some observation. Then we are finding . We can make a series of simplifications:
Thus, we can model you as using conditional expected value.
This section is a distillation of, and expansion upon, this comment thread.
Suppose now that you're making multiple decisions in a distributed fashion to maximize the same utility function, where there is no information flow between the decisions. For example, 10 copies of you (with the same preferences and same choice of restaurants) are dropped into Berkeley, but they all have slightly different observation processes : Google Maps reviews, Grubhub reviews, personal anecdotes, etc.
Now, when constructing a lookup table for , each copy of you will still condition each row's output on its input. When making decision from input , you don't have the other information for , so you consider each decision separately, still maximizing . Here, the information does not depend on other decisions, but this is not necessary for the core point.[2]
In the setup with one decision, we showed that a Pareto-efficient agent can be modeled as maximizing conditional EU over possible worlds : . But because one can construct a utility function of type consistent with any agent's behavior, the agent can also be modeled as maximizing conditional EU over possible observations : . In the single-decision case, there is no compelling reason to model the agent as caring about worlds rather than observations, especially because storing and processing observations should be simpler than storing and processing distributions of worlds.
When the agent makes multiple decisions based on different observations , there are two possible "trivial" ways to model it: either as maximizing a utility function , or as maximizing separate utility functions . However, with sufficiently many decisions, neither of these trivial representations is as "nice" as conditional EU over possible worlds:
John made the following comment:
We are showing that the agent performs Bayesian updates, in some sense. That's basically what conditioning is. It's just not necessarily performing a series of updates over time, with each retaining the information from the previous, the way we usually imagine.
When f depends on past decisions, the agent just maximizes . To see the math for the multi-decision case, read the original post by John Wentworth.
If the world has bits of state, and the observations reveal bits of information each, the pigeonhole principle says this surely happens when there are observations. Our universe has about bits of state, so this won't happen unless our agent can operate coherently in ~ different decisions; this number can maybe be reduced if we suppose that our agent can only actually observe, say, bits of state.