Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Reducing collective rationality to individual optimization in common-payoff games using MCMC

20th Aug 2018

6Rohin Shah

4jessicata

6cousin_it

2jessicata

5cousin_it

3jessicata

2cousin_it

1jessicata

4Rohin Shah

1jessicata

2shminux

3jessicata

New Comment

12 comments, sorted by Click to highlight new comments since: Today at 12:05 AM

I like this, it's an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice. Are you planning to use reflective oracles to replace the exploration here with reasoning about other agents? Or perhaps you think the slow exploration is not a problem for some reason?

The approach most similar to this that is actually practical is MCMC optimization, which is practical for, among other things, program synthesis. I did not use that for this post since it would require agents to compute counterfactuals, but it is significantly more efficient in practice.

I'm currently trying to solve the theoretical decision theory problem rather than create usable algorithms. It would be pretty great to have a mathematical formalization of the correct decision theory even if it takes exponential time to run.

The problem with using reflective oracles is that this means the agents' actions are independent, but, based on this post, the agents' actions need to be correlated with each other to get a high score, if each policy is a continuous functions of the utility function. Correlated reflective oracles seem like they might help, but they are essentially a superset of ordinary reflective oracles, so using them can't ensure that actions are correlated.

There might be some variant of reflective oracles that does work; a problem with this is that the algorithm in this post is "sticky", which seems like an obstacle to deriving an agent's policy based on their beliefs about others.

You could generalize to games where all players can achieve their best possible utilities by coordinating, and these utilities don't have to be symmetric. For example, a coordination game with (5,5) and (10,20). Maybe your proof already works for this case.

Then you could try generalizing to games where some subset of players can jointly enforce their best possible utilities regardless of what other players do, but unfortunately that fails. For example, imagine three people, one of whom must be sacrificed to Cthulhu so the other two may live. The person to be killed is chosen by majority vote, and in case of tie they all die. Then any two people can ensure maximum utility for themselves by voting to kill the third, but the solution isn't unique.

I believe the same proof method works when players share the same ordering on outcomes; the important thing is that it's more likely to jump from a bad state to a good state than vice versa.

I think the main issue with the Cthulhu example is that, if person A is voting for person B and person B is voting for person A, then person C is indifferent between voting for person A or person B. But I expect that the same proof method from the post shows that, if any subset can ensure their *strict* maximum utility by coordinating, then they will.

What does "strict maximum utility" mean - achieving the unique outcome that maximizes utility? That seems too restrictive, e.g. in a coordination game with (10,10) and (10,10) I want the players to find an optimal outcome even though it's not unique. Can you make it precise?

Edit: your condition "all players share the same ordering on outcomes" also seems too restrictive. My condition "all players can achieve their best possible utilities by coordinating" is more lenient, because it doesn't care about ordering of non-optimal outcomes. So maybe you could push in that direction as well.

Yes, that is what I meant by strict maximum utility. Agree this is too restrictive. I don't know what similar criterion would be both good to have and achievable, given the Cthulhu example.

Regarding all players achieving their best possible utilities by coordinating: what about a case where player A's utility is always 0, and player B's utility is 1 if player A takes action 1 and 0 if player A takes action 2? They can both achieve their best possible utilities by coordinating but I don't think that means player A should necessarily take action 1.

Great example, I didn't think of that. Maybe the right criterion is that some achievable utility vector must be strictly superior to all other achievable utility vectors on all coordinates. But it shouldn't have to correspond to a unique outcome, so the (10,10) (10,10) example could still work. Would that be possible with your approach?

Do I understand it correctly that the approach is to adjust the decision probabilities in a way that penalizes in some way low-utility outcome and rewards high-utility outcomes? If so, do you recommend a specific way of doing so? How hard would it be to do what you propose constructively, say in the iterated Prisoner's dilemma with a bunch of bots following this strategy, and see if they converge to a Nash equilibrium, hopefully the best one?

Do I understand it correctly that the approach is to adjust the decision probabilities in a way that penalizes in some way low-utility outcome and rewards high-utility outcomes?

Yes, the idea is that you switch actions more often when utility is low.

If so, do you recommend a specific way of doing so?

Yes, the decision rule is in the post.

How hard would it be to do what you propose constructively, say in the iterated Prisoner’s dilemma with a bunch of bots following this strategy, and see if they converge to a Nash equilibrium, hopefully the best one?

This doesn't guarantee a Pareto efficient outcome in games where players have different utility functions (see "Extension to non-common-payoff games"). Running these strategies in iterated games is difficult given that there is an infinite number of "actions" (policies). It's plausible that this approach yields something like -approximate correlated equilibria but I haven't worked out the details, and anyway I'd rather try to find something that gets Pareto efficient outcomes.

A well-known issue in game theory is that of multiple Nash equilibria in coordination games.

For example, consider the following game. Players 1 and 2 both choose either "high" or "low". If both choose "high", they both get 10 utility; if both choose "low", they both get 5 utility; if they choose different actions, then they get no utility. There are 3 Nash equilibria: one where both always play "high", one where both always play "low", and one where each plays "high" with 1/3 probability and "low" with 2/3 probability. While (high, high) is obviously the best Nash equilibrium, unilaterally switching to high when the current equilibrium is (low, low) is probably unwise.

You could imagine TDT arguments based on symmetry between the different players to argue that "high" is the only rational choice, but it is easy to imagine asymmetric variants of this problem where the symmetry arguments are much less clear.

The presence of suboptimal Nash equilibria looks like a serious problem for reconciling individual and collective rationality, even when players have common payoffs. If everyone is going to play "low", then playing "low" is individually rational (at least according to CDT); however, the collective of agents fails to act rationally due to a coordination failure.

What I hoped to do before coming up with the idea in this post is to somehow reduce collective rationality to individual rationality. Can we prescribe individual rationality conditions to each player, such that if each player follows these conditions, then a near-optimal outcome for everyone is attained?

Somewhat surprisingly, the answer is "kind of." It certainly seems like solving coordination problems requires global optimization instead of just local optimization, but under appropriate conditions, local optimization is actually sufficient.

## Problem setup

We will consider the setting of normal-form common-payoff games. Let n be the number of players. Let A be some finite set of actions; each player i will select the action Ai. Let U:An→R be the shared utility function. Each player receives the same utility U(a1,…,an).

I will now assume that the players play some number of practice rounds before the "real" round. These practice rounds are very similar to those in the fictitious play algorithm. Ontologically, what are these practice rounds? One interpretation is that they are actual rounds of the game that have been played in the past; another interpretation is that they are a sequence of analogous one-shot rounds played by agents who have more and more compute and are thus able to simulate previous rounds. (The second interpretation might seem strange, but this is essentially how logical inductors achieve generalization bounds in decision theory problems)

Let T be the number of practice rounds. Let Ati be the action selected by player i in practice round t. I will assume Ai=ATi, i.e. the last practice round is the "real" round.

We will assume that, in each round t, some shared random number Rt∼Uniform({1,…,n}) is generated, and that players' policies in round t may depend on this number.

How are players' policies represented? We will assume that each player has only black-box access to each others' actions in previous practice rounds. Let πi:List(An)×{1,…,n}→ΔA be player i's policy, which selects an action distribution for player i in round t given knowledge of the actions in all previous rounds and of the current shared random number.

The generative model is now obvious: For each t in sequence, sample Rt∼Uniform({1,…,n}) and then sample each Ati∼πi(A1:t−11:n,Rt) independently, where the notation A1:t−11:n refers to the list (A11:n,…,At−11:n).

Given a set of policies, it is possible to compute, for each t, expected utilities at each time step (which are the expected utilities attained if T=t):

vt(π1:n):=E[U(At1:n)|π1:n]

The limit limt→∞vt(π1:n) and corresponding liminf and limsup are interesting; they indicate how well a set of policies does in expectation after a large number of practice rounds.

## Policies that solve the game

The policies we will consider are ones that usually stick with the same action they played in the last round, but sometimes switch to a different action, and are more likely to switch the worse the utility in the last round was.

Let umin be a lower bound on U. Let α>0 be some "rationality parameter". Consider the following set of policies:

παi(A1:t−11:n,Rt)=Uniform(A) if t=1 παi(A1:t−11:n,Rt)=δ(At−1i) if t>1,Rt≠i παi(A1:t−11:n,Rt)=eα⋅(umin−U(At−11:n))⋅Uniform(A)+(1−eα⋅(umin−U(At−11:n)))⋅δ(At−1i) otherwise

where δ(a) is a delta distribution on a, and probability distributions are taken to be vectors.

These policies start out with each player selecting a random action. In each round, a random player is selected (according to Rt); this player switches their action to some uniformly random action with a probability that is lower the higher the utility in the previous round was, and otherwise sticks with the same action. Players other than the randomly selected player always stick with the same action.

An important aspect of these policies is that

each player only does local optimization. That is, each player shifts their own action depending on the utility in the last round, such that over the course of many iterations they are more likely to take an action that attains a high utility given the other players' actions. Naively, one would expect a local optimization algorithm like this one to get stuck in traps such as the (low, low) Nash equilibrium, but this turns out not to be the case.We will prove the following theorem:

Theorem 1: For all ϵ>0 there exists α such that limt→∞vt(πα1:n)≥maxa1:n∈AnU(a1:n)−ϵ.i.e. by setting α high enough, these policies achieve an arbitrarily-close-to-optimal expected utility in the limit.

## Analyzing the process as a Markov chain

To prove this theorem, we can analyze the multi-round process with these policies as a Markov chain where the state is the action vector At1:n. The transition function (giving the distribution of At+11:n given At1:n) is:

tα(a1:n)(a1:i−1,a′i,ai+1:n):=eα⋅(umin−U(a1:n))n|A| if a′i≠ai

tα(a1:n)(a1:n):=1−∑a′i≠ait(a1:n)(a1:i−1,a′i,ai+1:n)

tα(a1:n)(a′1:n):=0 if a1:n and a′1:n differ on at least two elements

Consider the following distribution over states:

pα(a1:n):=eαU(a1:n)∑a′∈AneαU(a′1:n)

i.e. it is a softmax that is more likely to select states with higher utilities.

We will show:

Lemma 1: p is the unique stationary distribution of the Markov chain defined by t.Proof:It is well-known (e.g. see this Wikipedia article) that an ergodic Markov chain satisfying detailed balance (a condition defined later) with respect to a probability distribution over states has this distribution of states as the unique stationary distribution.

The Markov chain is aperiodic, since every state has a nonzero probability of transitioning to itself. Additionally, it is irreducible (i.e. it is possible to get from any state to any other state), since any action in the state can change to any other action with nonzero probability while leaving the other actions fixed. Therefore, the Markov chain is ergodic.

This Markov chain satisfies a detailed balance condition relative to the state distribution p. Specifically, for any states a1:n and a′1:n:

pα(a1:n)tα(a1:n)(a′1:n)=pα(a′1:n)tα(a′1:n)(a1:n)

i.e. if a state is sampled from p then transitioned once, the probability of a transition from a1:n to a′1:n equals the probability of a transition from a′1:n to a1:n.

The proof of this detailed balance property proceeds by case analysis on a1:n and a′1:n.

pα(a1:n)tα(a1:n)(a′1:n)=eαU(a1:n)∑b1:n∈AneαU(b1:n)eα⋅(umin−U(a1:n))n|A| =eαuminn|A|∑b1:n∈AneαU(b1:n)

and similarly

pα(a′1:n)tα(a′1:n)(a1:n)=eαU(a′1:n)∑b1:n∈AneαU(b1:n)eα⋅(umin−U(a′1:n))n|A| =eαuminn|A|∑b1:n∈AneαU(b1:n).

Since the Markov chain is ergodic and satisfies detailed balance with respect to pα, it follows that pα is the unique stationary distribution of the Markov chain.

□

Now we will prove Theorem 1.

Theorem 1: For all ϵ>0 there exists α such that limt→∞vt(πα1:n)≥maxa1:n∈AnU(a1:n)−ϵ.Proof:Clearly, limt→∞vt(πα1:n)=Ea1:n∼pα[U(a1:n)], since pα is the unique stationary distribution of At1:n considered as a Markov chain.

Define u∗:=maxa1:n∈AnU(a1:n), i.e. the highest achievable utility. Define u′ to be the second-highest achievable utility.

At this point, we need only set α to be large enough that the pα assigns almost all of its probability mass to optimal states. The argument that this is possible to do follows.

Fix ϵ>0. Set α:=nlog|A|+log(u∗−umin)−logϵu∗−u′.

Let a∗1:n be an optimal action profile, and let a′1:n be a non-optimal action profile. We have

pα(a∗1:n)pα(a′1:n)=eα(U(a∗1:n)−U(a′1:n))≥eα(u∗−u′)=|A|n(u∗−umin)ϵ

Since there is at least one optimal state and fewer than |A|n non-optimal states, the ratio between the probability (under pα) of a state being optimal and the probability of a state being non-optimal is at least u∗−uminϵ. Since in general x1+x≥1−1x for positive x, this means that the probability of a state being optimal is at least 1−ϵu∗−umin. A non-optimal state has utility at least umin, so the overall expected suboptimality (i.e. difference between u∗ and u(a1:n)) is at most ϵ, as desired.

□

## Extension to non-common-payoff games

The policies defined above can be extended to play non-common-payoff games. Specifically, each agent can have a higher likelihood of switching action in each round if

their ownutility is lower. Unfortunately, this results in mutual defection in a prisoner's dilemma; while I won't present a detailed analysis here, the gist of the argument is that (C, C) is much more likely to switch into (C, D) or (D, C), than (C, D) or (D, C) is to switch into (C, C). So a different approach than this one will be needed for achieving Pareto optimal outcomes in non-common-payoff games.## Conclusion

In this post I have shown that, in common-payoff games under a multi-round structure, there is a set of policies that only does local optimization but which, together, yield a globally near-optimal outcome. This is important because getting global optimality from local optimization is required for decision theory (i.e. each agent taking an action based on its own expected utility) to produce Pareto optimal outcomes in common-payoff games.

In a previous post on coordination games, I presented a plan for solving formal decision theory:

Solve Cartesian common-payoff decision theory problems where players reason about each other using something like reflective oracles.

Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.

Extend the solution in 2 to a logically uncertain naturalized setting.

This post has done something pretty close to solving 1, with the multi-round structure taking the place of reflective oracles. Unfortunately, step 2 fails since the solution method does not extend to non-common-payoff games. So it appears that, for my plan to work, 1 must be solved in a different way that can be extended to also solve non-common-payoff games.

Alternatively, the problem of non-common-payoff games may be attacked directly, though this seems difficult given that non-common-payoff games are more complex than common-payoff games, and include all the dynamics present in common-payoff games. But perhaps restricted subsets of general games, other than the set of common-payoff games, can be found and solved.