Consider a simple coordination game. In this game, two players (player 1 and player 2) each simultaneously choose an action, X or Y. If they both choose X, they both get 1 utility. If they both choose Y, they both get uy utility for some 0≤uy≤2 known to both players. If they choose different actions, they both get 0 utility. Which action should they each choose? Assume they get to communicate before knowing uy, but not after knowing uy.
An optimal policy pair is for each to pick X if uy<1, and Y otherwise. Unfortunately, this policy pair can break down in the presence of even a small amount of noise. Assume neither player observes uy, but instead each receives an independent observation Vi (player 1 sees V1, player 2 sees V2), each of which is drawn uniformly from the range [uy−ϵ,uy+ϵ] for some small ϵ>0. If both players follow the policy of choosing X if and only if their observation of uy is less than 1, then if uy is very close to 1, there is a significant likelihood that one player will receive an observation less than 1, while the other will receive one greater than 1. Thus, they have a significant chance of choosing different actions and receiving 0 utility. This issue is essentially the same as the one faced by Buridan's ass: both options are equally good, so there is no way to effectively decide between them.
In fact, as I will prove:
Claim 1: No pair of policies in which the players select their actions independently given their observations can simultaneously guarantee an expected utility greater than max(1,uy)−0.5 for all uy∈[0,2].
But things change when a shared source of randomness is allowed. Then, as I will also prove:
Claim 2: Suppose that, after observing their observation of uy, each player also gets to observe a random number R∼Uniform([0,1]). Then there is a pair of policies that achieves expected utility at least max(1,uy)−2√ϵ−ϵ regardless of uy .
This solution method extends to arbitrary cooperative normal-form games where players observe a perturbed version of the true utility function. This is shown in the appendix.
Why is this important? I expect progress in decision theory to come from studying very simple decision problems like this one. If it is possible to show that any solution to some simple problem has certain properties, then this usefully constrains the design space of possible decision theories. In this case, the result suggests that something like shared randomness may be required for achieving provable near-optimality even in cooperative settings.
Claim 1: impossibility of solving the game using independent randomness
Fix ϵ>0. Let π1,π2:R→[0,1] be Lesbegue measurable functions representing the two players' policies, which map Vi to the player's probability of choosing action Y.
Define q(u):=Uniform([u−ϵ,u+ϵ]) to be the function mapping uy to the resulting Vi distribution.
Define τi(u):=Ev∼q(u)[πi(v)]. This is defined so that τi(uy) equals player i's overall probability of choosing action Y.
For u1,u2∈[0,2], note that the total variation distance between q(u1) and q(u2) is at most |u1−u2|ϵ. Thus, since πi is bounded between 0 and 1, τi is 1ϵ-Lipschitz and therefore continuous.
I will now show that, for some uy, the players achieve expected utility at most max(1,uy)−0.5 by case analysis on τ1:
If τ1(0)≥0.5 then at most 0.5 expected utility is achieved when uy=0 (since at most 1 utility is achieved when player 1 chooses action X, and no utility is achieved when player 1 chooses action Y).
If τ1(2)≤0.5, then at most 1.5 expected utility is achieved when uy=2 (since at most 1 utility is achieved when player 1 chooses action X, and at most 2 utility is achieved when player 1 chooses action Y).
If neither of the two above cases hold, then by continuity of τ1 and the intermediate value theorem, there exists some u∈[0,2] with τ1(u)=0.5. When uy=u, since the two players choose actions independently, there is a 0.5 probability that they select the same action, ensuring that they achieve at most max(1,uy)2≤max(1,uy)−0.5 expected utility.
Thus, claim 1 is proven. This proof method bears resemblance to the problem faced by Buridan's ass: in the third case, some uy value is found so that the "policy" τ1 is equally compelled by both options, choosing between them with a 50/50 coin flip in a way that is disastrous for coordination.
Claim 2: solving the game using shared randomness
Define fϵ(v)=v−(1−√ϵ)2√ϵ. Consider a pair of policies in which player i chooses Y if R<fϵ(Vi), and otherwise chooses X, where R∼Uniform([0,1]). Intuitively, each of these policies increases its chance of taking action Y as its observation of uy increases in a smooth fashion, and they correlate their randomness so that they are likely to choose the same action. Now note a couple properties of this pair of policies:
1. For uy≥1+√ϵ+ϵ, it is guaranteed that Vi,Vj≥1+√ϵ, so both players always take action Y.
2. Since fϵ is 12√ϵ-Lipschitz, and |Vi−Vj|≤2ϵ, the probability of the players taking different actions is at most √ϵ.
I will now show that the players' expected utility is at least max(1,uy)−2√ϵ−ϵ by case analysis:
Suppose uy≥1+√ϵ+ϵ. Then by property 1, both players are guaranteed to take action Y, so they achieve expected utility uy≥max(1,uy)−2√ϵ−ϵ.
Suppose uy<1+√ϵ+ϵ. By property 2, the players choose the same action with probability at least 1−√ϵ, and taking the same action yields a utility of at least 1, so they achieve expected utility at least 1−√ϵ. Due to the bound on uy, this is at least max(1,uy)−2√ϵ−ϵ.
Thus, the claim is proven. By setting ϵ sufficiently low, this pair of policies yields an expected utility arbitrarily close to max(1,uy) regardless of uy.
Conclusion and directions for further research
Together, these claims show that there is a simple set of noisy coordination games (namely, the set of games described in the beginning of this post for all possible uy values) that is impossible to solve with only independent randomness, but which can be solved using shared randomness. Some notes on this:
The proof of claim 1 only used the fact that τ1 is continuous. So even without noise, if the players' policies are a continuous function of the utility uy, the same problem occurs.
A pair of Bayesians playing this coordination game who have a prior over uy have no need for randomization, independent or joint. The goal of achieving a high expected utility regardless of uy most clearly makes sense if uy is selected adversarially (to maximize regret). It also makes sense if the environment is hard to model in a way that makes selection of a policy with a low rate of "ties" difficult. The comparison between the shared randomness solution and the Bayesian solution is similar to the comparison between randomized Quicksort and a "Bayesian" variant of Quicksort that selects pivots based on some expected distribution of inputs. While the Bayesian solution works better than the randomized solution when the prior over the input distribution is accurate, the randomized solution is simpler, is amenable to proofs, and works well even under adversarial conditions.
Where to go from here? Roughly, my overall "plan" for formal decision theory consists of 3 steps:
1. Solve Cartesian cooperative decision theory problems where players reason about each other using something like reflective oracles.
2. Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.
3. Extend the solution in 2 to a logically uncertain naturalized setting.
Step 1 has still not been solved. Previous writing on this includes this post which studies a setting with a single memoryless player (which is similar to a set of players who have the same utility function). The post shows (redundantly with the paper introducing the absent-minded driver problem as I later found out) that, if the player's policy is globally optimal (i.e. it achieves the highest possible expected utility), then all actions that might be taken by that policy are CDT-optimal, assuming SIA probabilities. This second condition is a local optimality condition, so the post shows that global optimality implies local optimality.
It would be highly desirable to find a similar but different local optimality property that implies a global optimality property. That would essentially be a way of deriving collective self-interest from individual self-interest when individuals have the same utility function and common priors.
As this current post shows, the globally optimal policy is discontinuous as a function of the utilities involved, and no continuous approximation always yields a near-optimal expected utility. I expect this to hinder attempts to derive global optimality from local optimality, as it implies there is a valley of bad policies between decent policies and good ones.
Introducing shared randomness here may help by preserving continuity. So a natural next step is to find useful local optimality properties in a cooperative setting where players have shared randomness.
Appendix: extension to arbitrary normal-form cooperative games
The solution described in the section on claim 2 can be extended to arbitrary normal-form cooperative games where the players receive only noisy observations of the payoffs. Reading this appendix (which takes up the rest of this post) is unnecessary for understanding the main point of this post.
Let n≥1 be the number of players, Ai be player i's set of actions, and u:∏ni=1Ai→R be the shared unknown utility function over strategy profiles, whose minimum value is umin and whose maximum value is umax.
Fix 0<ϵ≤umax−umin. Let Vi:∏ni=1Ai→R be a "perturbed" version of u that player i observes; it must satisfy the property that for any strategy profile a, |Vi(a)−u(a)|≤ϵ.
To define the policies, we will first number the strategy profiles arbitrarily a1,…,am where m=∏ni=1|Ai|. Let c>0 be some number to be determined later. For j∈{1,…,m}, define
fj(v):=exp(cv(aj))∑mj′=1exp(cv(aj′)).
This defines, given any v∈∏ni=1Ai, a probability distribution over strategy profiles, since ∑mj=1fj(v)=1. Specifically, the distribution is a softmax. This distribution is more likely to select strategy profiles that v considers better, but has some chance of selecting every strategy profile.
Players will sample from this distribution using a source of shared randomness, R∼Uniform([0,1]). Define
g(v,r):=min{j∈{1,…,m}∣r≤∑jj′=1fj′(v)}
Now note that P(g(v,R)=j)=fj(v), i.e. g(v,R) is distributed according to fj(v). Define policies πi(v,r):=ag(v,r)i. That is, player i will play their appropriate action for strategy profile number g(Vi,R).
Roughly, we will now show 2 properties that are sufficient to establish that these policies are near-optimal:
With high probability, the players play according to ag(u,R), i.e. for all i, πi(Vi,R)=ag(u,R)i.
The strategy profile ag(u,R) is near-optimal in expectation, i.e. E[u(ag(u,R))] is close to umax.
Players play according to ag(u,R) with high probability
First, we will show that, for all i and j, fj(Vi) is close to fj(u).
For all j∈{1,…,m}, since |Vi(aj)−u(aj)|≤ϵ, we have
For it to be the case that g(u,R)≠g(Vi,R), it must be the case that for some j,
R≤∑jj′=1fj′(u)↮R≤∑jj′=1fj′(Vi)
which implies
R∈ConvexHull({∑jj′=1fj′(u),∑jj′=1fj′(Vi)})
Due to the bound on ∣∣∑jj′=1fj′(Vi)−∑jj′=1fj′(u)∣∣, the measure of this hull is at most exp(2cϵ)−1. Furthermore, there are m hulls of this form (one for each j value), so the total measure of these hulls is at most m(exp(2cϵ)−1).
So, player i chooses action ag(u,R)i with probability at least 1−m(exp(2cϵ)−1). Thus, by the union bound, the players jointly choose the actions ag(u,R) with probability at least 1−nm(exp(2cϵ)−1).
ag(u,R) is near-optimal in expectation
First we will prove a lemma.
Softmax lemma: For any c>0 and vector x of length m,
Players play according to ag(u,R) with probability at least 1−nm(exp(2cϵ)−1).
E[u(ag(u,R))]≥umax−mec.
The first fact lets us quantify the expected difference between u(ag(u,R)) and u(π1(Vi,R),…,πn(Vn,R)). Since these random variables are equal with probability at least 1−n
Consider a simple coordination game. In this game, two players (player 1 and player 2) each simultaneously choose an action, X or Y. If they both choose X, they both get 1 utility. If they both choose Y, they both get uy utility for some 0≤uy≤2 known to both players. If they choose different actions, they both get 0 utility. Which action should they each choose? Assume they get to communicate before knowing uy, but not after knowing uy.
An optimal policy pair is for each to pick X if uy<1, and Y otherwise. Unfortunately, this policy pair can break down in the presence of even a small amount of noise. Assume neither player observes uy, but instead each receives an independent observation Vi (player 1 sees V1, player 2 sees V2), each of which is drawn uniformly from the range [uy−ϵ,uy+ϵ] for some small ϵ>0. If both players follow the policy of choosing X if and only if their observation of uy is less than 1, then if uy is very close to 1, there is a significant likelihood that one player will receive an observation less than 1, while the other will receive one greater than 1. Thus, they have a significant chance of choosing different actions and receiving 0 utility. This issue is essentially the same as the one faced by Buridan's ass: both options are equally good, so there is no way to effectively decide between them.
In fact, as I will prove:
Claim 1: No pair of policies in which the players select their actions independently given their observations can simultaneously guarantee an expected utility greater than max(1,uy)−0.5 for all uy∈[0,2].
But things change when a shared source of randomness is allowed. Then, as I will also prove:
Claim 2: Suppose that, after observing their observation of uy, each player also gets to observe a random number R∼Uniform([0,1]). Then there is a pair of policies that achieves expected utility at least max(1,uy)−2√ϵ−ϵ regardless of uy .
This solution method extends to arbitrary cooperative normal-form games where players observe a perturbed version of the true utility function. This is shown in the appendix.
Why is this important? I expect progress in decision theory to come from studying very simple decision problems like this one. If it is possible to show that any solution to some simple problem has certain properties, then this usefully constrains the design space of possible decision theories. In this case, the result suggests that something like shared randomness may be required for achieving provable near-optimality even in cooperative settings.
Claim 1: impossibility of solving the game using independent randomness
Fix ϵ>0. Let π1,π2:R→[0,1] be Lesbegue measurable functions representing the two players' policies, which map Vi to the player's probability of choosing action Y.
Define q(u):=Uniform([u−ϵ,u+ϵ]) to be the function mapping uy to the resulting Vi distribution.
Define τi(u):=Ev∼q(u)[πi(v)]. This is defined so that τi(uy) equals player i's overall probability of choosing action Y.
For u1,u2∈[0,2], note that the total variation distance between q(u1) and q(u2) is at most |u1−u2|ϵ. Thus, since πi is bounded between 0 and 1, τi is 1ϵ-Lipschitz and therefore continuous.
I will now show that, for some uy, the players achieve expected utility at most max(1,uy)−0.5 by case analysis on τ1:
Thus, claim 1 is proven. This proof method bears resemblance to the problem faced by Buridan's ass: in the third case, some uy value is found so that the "policy" τ1 is equally compelled by both options, choosing between them with a 50/50 coin flip in a way that is disastrous for coordination.
Claim 2: solving the game using shared randomness
Define fϵ(v)=v−(1−√ϵ)2√ϵ. Consider a pair of policies in which player i chooses Y if R<fϵ(Vi), and otherwise chooses X, where R∼Uniform([0,1]). Intuitively, each of these policies increases its chance of taking action Y as its observation of uy increases in a smooth fashion, and they correlate their randomness so that they are likely to choose the same action. Now note a couple properties of this pair of policies:
1. For uy≥1+√ϵ+ϵ, it is guaranteed that Vi,Vj≥1+√ϵ, so both players always take action Y.
2. Since fϵ is 12√ϵ-Lipschitz, and |Vi−Vj|≤2ϵ, the probability of the players taking different actions is at most √ϵ.
I will now show that the players' expected utility is at least max(1,uy)−2√ϵ−ϵ by case analysis:
Thus, the claim is proven. By setting ϵ sufficiently low, this pair of policies yields an expected utility arbitrarily close to max(1,uy) regardless of uy.
Conclusion and directions for further research
Together, these claims show that there is a simple set of noisy coordination games (namely, the set of games described in the beginning of this post for all possible uy values) that is impossible to solve with only independent randomness, but which can be solved using shared randomness. Some notes on this:
Where to go from here? Roughly, my overall "plan" for formal decision theory consists of 3 steps:
1. Solve Cartesian cooperative decision theory problems where players reason about each other using something like reflective oracles.
2. Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.
3. Extend the solution in 2 to a logically uncertain naturalized setting.
Step 1 has still not been solved. Previous writing on this includes this post which studies a setting with a single memoryless player (which is similar to a set of players who have the same utility function). The post shows (redundantly with the paper introducing the absent-minded driver problem as I later found out) that, if the player's policy is globally optimal (i.e. it achieves the highest possible expected utility), then all actions that might be taken by that policy are CDT-optimal, assuming SIA probabilities. This second condition is a local optimality condition, so the post shows that global optimality implies local optimality.
It would be highly desirable to find a similar but different local optimality property that implies a global optimality property. That would essentially be a way of deriving collective self-interest from individual self-interest when individuals have the same utility function and common priors.
As this current post shows, the globally optimal policy is discontinuous as a function of the utilities involved, and no continuous approximation always yields a near-optimal expected utility. I expect this to hinder attempts to derive global optimality from local optimality, as it implies there is a valley of bad policies between decent policies and good ones.
Introducing shared randomness here may help by preserving continuity. So a natural next step is to find useful local optimality properties in a cooperative setting where players have shared randomness.
Appendix: extension to arbitrary normal-form cooperative games
The solution described in the section on claim 2 can be extended to arbitrary normal-form cooperative games where the players receive only noisy observations of the payoffs. Reading this appendix (which takes up the rest of this post) is unnecessary for understanding the main point of this post.
Let n≥1 be the number of players, Ai be player i's set of actions, and u:∏ni=1Ai→R be the shared unknown utility function over strategy profiles, whose minimum value is umin and whose maximum value is umax.
Fix 0<ϵ≤umax−umin. Let Vi:∏ni=1Ai→R be a "perturbed" version of u that player i observes; it must satisfy the property that for any strategy profile a, |Vi(a)−u(a)|≤ϵ.
To define the policies, we will first number the strategy profiles arbitrarily a1,…,am where m=∏ni=1|Ai|. Let c>0 be some number to be determined later. For j∈{1,…,m}, define
fj(v):=exp(cv(aj))∑mj′=1exp(cv(aj′)).
This defines, given any v∈∏ni=1Ai, a probability distribution over strategy profiles, since ∑mj=1fj(v)=1. Specifically, the distribution is a softmax. This distribution is more likely to select strategy profiles that v considers better, but has some chance of selecting every strategy profile.
Players will sample from this distribution using a source of shared randomness, R∼Uniform([0,1]). Define
g(v,r):=min{j∈{1,…,m}∣r≤∑jj′=1fj′(v)}
Now note that P(g(v,R)=j)=fj(v), i.e. g(v,R) is distributed according to fj(v). Define policies πi(v,r):=ag(v,r)i. That is, player i will play their appropriate action for strategy profile number g(Vi,R).
Roughly, we will now show 2 properties that are sufficient to establish that these policies are near-optimal:
Players play according to ag(u,R) with high probability
First, we will show that, for all i and j, fj(Vi) is close to fj(u).
For all j∈{1,…,m}, since |Vi(aj)−u(aj)|≤ϵ, we have
exp(−cϵ)≤exp(cVi(aj))exp(cu(aj))≤exp(cϵ).
Therefore, for all j∈{1,…,m},
fj(Vi)fj(u)=exp(cVi(aj))exp(cu(aj))∑mj′=1exp(cu(aj′))∑mj′=1exp(cVi(aj′))≤exp(2cϵ).
By identical logic,
fj(u)fj(Vi)≤exp(2cϵ).
Now we will bound ∣∣∑jj′=1fj′(Vi)−∑jj′=1fj′(u)∣∣.
∣∣∑jj′=1fj′(Vi)−∑jj′=1fj′(u)∣∣
=∣∣∣∑jj′=1fj′(Vi)(1−fj′(u)fj′(Vi))∣∣∣
≤∑jj′=1fj′(Vi)∣∣∣1−fj′(u)fj′(Vi)∣∣∣
≤∑jj′=1fj′(Vi)max{exp(2cϵ)−1,1−exp(−2cϵ)}
≤∑jj′=1fj′(Vi)(exp(2cϵ)−1)
=exp(2cϵ)−1
For it to be the case that g(u,R)≠g(Vi,R), it must be the case that for some j,
R≤∑jj′=1fj′(u)↮R≤∑jj′=1fj′(Vi)
which implies
R∈ConvexHull({∑jj′=1fj′(u),∑jj′=1fj′(Vi)})
Due to the bound on ∣∣∑jj′=1fj′(Vi)−∑jj′=1fj′(u)∣∣, the measure of this hull is at most exp(2cϵ)−1. Furthermore, there are m hulls of this form (one for each j value), so the total measure of these hulls is at most m(exp(2cϵ)−1).
So, player i chooses action ag(u,R)i with probability at least 1−m(exp(2cϵ)−1). Thus, by the union bound, the players jointly choose the actions ag(u,R) with probability at least 1−nm(exp(2cϵ)−1).
ag(u,R) is near-optimal in expectation
First we will prove a lemma.
Softmax lemma: For any c>0 and vector x of length m,
∑mj=1exp(cxj)xj∑mj=1exp(cxj)≥maxj∈{1,…,m}xj−mec.
Proof:
Let x∗ be the maximum of x. Now:
∑mj=1exp(cxj)(x∗−xj)∑mj=1exp(cxj)≤∑mj=1exp(cxj)(x∗−xj)exp(cx∗)=∑mj=1exp(−c(x∗−xj))(x∗−xj)
For any y, we have y≥logy+1. Therefore for any y, cy≥log(cy)+1=logy+logc+1⇔logy−cy≤−logc−1. By exponentiating both sides, yexp(−cy)≤1ec.
Applying this to the term from the previous inequality:
∑mj=1exp(−c(x∗−xj))(x∗−xj)≤∑mj=11ec=mec
At this point we have
∑mj=1exp(cxj)xj∑mj=1exp(cxj)=x∗−∑mj=1exp(cxj)(x∗−xj)∑mj=1exp(cxj)≥x∗−mec.
□
At this point the implication is straightforward. Since g(u,R) is distributed according to fj(u), we have
E[u(ag(u,R))]=∑mj=1fj(u)u(aj)=∑mj=1exp(cu(aj))u(aj)∑mj=1exp(cu(aj))≥umax−mec .
Proving the result from these facts
At this point we have:
The first fact lets us quantify the expected difference between u(ag(u,R)) and u(π1(Vi,R),…,πn(Vn,R)). Since these random variables are equal with probability at least 1−n