A wellknown issue in game theory is that of multiple Nash equilibria in coordination games.
For example, consider the following game. Players 1 and 2 both choose either "high" or "low". If both choose "high", they both get 10 utility; if both choose "low", they both get 5 utility; if they choose different actions, then they get no utility. There are 3 Nash equilibria: one where both always play "high", one where both always play "low", and one where each plays "high" with 1/3 probability and "low" with 2/3 probability. While (high, high) is obviously the best Nash equilibrium, unilaterally switching to high when the current equilibrium is (low, low) is probably unwise.
You could imagine TDT arguments based on symmetry between the different players to argue that "high" is the only rational choice, but it is easy to imagine asymmetric variants of this problem where the symmetry arguments are much less clear.
The presence of suboptimal Nash equilibria looks like a serious problem for reconciling individual and collective rationality, even when players have common payoffs. If everyone is going to play "low", then playing "low" is individually rational (at least according to CDT); however, the collective of agents fails to act rationally due to a coordination failure.
What I hoped to do before coming up with the idea in this post is to somehow reduce collective rationality to individual rationality. Can we prescribe individual rationality conditions to each player, such that if each player follows these conditions, then a nearoptimal outcome for everyone is attained?
Somewhat surprisingly, the answer is "kind of." It certainly seems like solving coordination problems requires global optimization instead of just local optimization, but under appropriate conditions, local optimization is actually sufficient.
Problem setup
We will consider the setting of normalform commonpayoff games. Let be the number of players. Let be some finite set of actions; each player will select the action . Let be the shared utility function. Each player receives the same utility .
I will now assume that the players play some number of practice rounds before the "real" round. These practice rounds are very similar to those in the fictitious play algorithm. Ontologically, what are these practice rounds? One interpretation is that they are actual rounds of the game that have been played in the past; another interpretation is that they are a sequence of analogous oneshot rounds played by agents who have more and more compute and are thus able to simulate previous rounds. (The second interpretation might seem strange, but this is essentially how logical inductors achieve generalization bounds in decision theory problems)
Let be the number of practice rounds. Let be the action selected by player in practice round . I will assume , i.e. the last practice round is the "real" round.
We will assume that, in each round , some shared random number is generated, and that players' policies in round may depend on this number.
How are players' policies represented? We will assume that each player has only blackbox access to each others' actions in previous practice rounds. Let be player 's policy, which selects an action distribution for player in round given knowledge of the actions in all previous rounds and of the current shared random number.
The generative model is now obvious: For each in sequence, sample and then sample each independently, where the notation refers to the list .
Given a set of policies, it is possible to compute, for each , expected utilities at each time step (which are the expected utilities attained if ):
The limit and corresponding and are interesting; they indicate how well a set of policies does in expectation after a large number of practice rounds.
Policies that solve the game
The policies we will consider are ones that usually stick with the same action they played in the last round, but sometimes switch to a different action, and are more likely to switch the worse the utility in the last round was.
Let be a lower bound on . Let be some "rationality parameter". Consider the following set of policies:
where is a delta distribution on , and probability distributions are taken to be vectors.
These policies start out with each player selecting a random action. In each round, a random player is selected (according to ); this player switches their action to some uniformly random action with a probability that is lower the higher the utility in the previous round was, and otherwise sticks with the same action. Players other than the randomly selected player always stick with the same action.
An important aspect of these policies is that each player only does local optimization. That is, each player shifts their own action depending on the utility in the last round, such that over the course of many iterations they are more likely to take an action that attains a high utility given the other players' actions. Naively, one would expect a local optimization algorithm like this one to get stuck in traps such as the (low, low) Nash equilibrium, but this turns out not to be the case.
We will prove the following theorem:
Theorem 1: For all there exists such that .
i.e. by setting high enough, these policies achieve an arbitrarilyclosetooptimal expected utility in the limit.
Analyzing the process as a Markov chain
To prove this theorem, we can analyze the multiround process with these policies as a Markov chain where the state is the action vector . The transition function (giving the distribution of given ) is:
Consider the following distribution over states:
i.e. it is a softmax that is more likely to select states with higher utilities.
We will show:
Lemma 1: is the unique stationary distribution of the Markov chain defined by .
Proof:
It is wellknown (e.g. see this Wikipedia article) that an ergodic Markov chain satisfying detailed balance (a condition defined later) with respect to a probability distribution over states has this distribution of states as the unique stationary distribution.
The Markov chain is aperiodic, since every state has a nonzero probability of transitioning to itself. Additionally, it is irreducible (i.e. it is possible to get from any state to any other state), since any action in the state can change to any other action with nonzero probability while leaving the other actions fixed. Therefore, the Markov chain is ergodic.
This Markov chain satisfies a detailed balance condition relative to the state distribution . Specifically, for any states and :
i.e. if a state is sampled from then transitioned once, the probability of a transition from to equals the probability of a transition from to .
The proof of this detailed balance property proceeds by case analysis on and .
 If , then the condition is trivial.
 If and differ on more than one element, then both transition probabilities are 0, so the condition is trivial.
 If they differ on a single element , then:
and similarly
Since the Markov chain is ergodic and satisfies detailed balance with respect to , it follows that is the unique stationary distribution of the Markov chain.
Now we will prove Theorem 1.
Theorem 1: For all there exists such that .
Proof:
Clearly, , since is the unique stationary distribution of considered as a Markov chain.
Define , i.e. the highest achievable utility. Define to be the secondhighest achievable utility.
At this point, we need only set to be large enough that the assigns almost all of its probability mass to optimal states. The argument that this is possible to do follows.
Fix . Set .
Let be an optimal action profile, and let be a nonoptimal action profile. We have
Since there is at least one optimal state and fewer than nonoptimal states, the ratio between the probability (under ) of a state being optimal and the probability of a state being nonoptimal is at least . Since in general for positive , this means that the probability of a state being optimal is at least . A nonoptimal state has utility at least , so the overall expected suboptimality (i.e. difference between and ) is at most , as desired.
Extension to noncommonpayoff games
The policies defined above can be extended to play noncommonpayoff games. Specifically, each agent can have a higher likelihood of switching action in each round if their own utility is lower. Unfortunately, this results in mutual defection in a prisoner's dilemma; while I won't present a detailed analysis here, the gist of the argument is that (C, C) is much more likely to switch into (C, D) or (D, C), than (C, D) or (D, C) is to switch into (C, C). So a different approach than this one will be needed for achieving Pareto optimal outcomes in noncommonpayoff games.
Conclusion
In this post I have shown that, in commonpayoff games under a multiround structure, there is a set of policies that only does local optimization but which, together, yield a globally nearoptimal outcome. This is important because getting global optimality from local optimization is required for decision theory (i.e. each agent taking an action based on its own expected utility) to produce Pareto optimal outcomes in commonpayoff games.
In a previous post on coordination games, I presented a plan for solving formal decision theory:

Solve Cartesian commonpayoff decision theory problems where players reason about each other using something like reflective oracles.

Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.

Extend the solution in 2 to a logically uncertain naturalized setting.
This post has done something pretty close to solving 1, with the multiround structure taking the place of reflective oracles. Unfortunately, step 2 fails since the solution method does not extend to noncommonpayoff games. So it appears that, for my plan to work, 1 must be solved in a different way that can be extended to also solve noncommonpayoff games.
Alternatively, the problem of noncommonpayoff games may be attacked directly, though this seems difficult given that noncommonpayoff games are more complex than commonpayoff games, and include all the dynamics present in commonpayoff games. But perhaps restricted subsets of general games, other than the set of commonpayoff games, can be found and solved.
I like this, it's an explicit construction that demonstrates how you can play with the exploreexploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get nearoptimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice. Are you planning to use reflective oracles to replace the exploration here with reasoning about other agents? Or perhaps you think the slow exploration is not a problem for some reason?
The approach most similar to this that is actually practical is MCMC optimization, which is practical for, among other things, program synthesis. I did not use that for this post since it would require agents to compute counterfactuals, but it is significantly more efficient in practice.
I'm currently trying to solve the theoretical decision theory problem rather than create usable algorithms. It would be pretty great to have a mathematical formalization of the correct decision theory even if it takes exponential time to run.
The problem with using reflective oracles is that this means the agents' actions are independent, but, based on this post, the agents' actions need to be correlated with each other to get a high score, if each policy is a continuous functions of the utility function. Correlated reflective oracles seem like they might help, but they are essentially a superset of ordinary reflective oracles, so using them can't ensure that actions are correlated.
There might be some variant of reflective oracles that does work; a problem with this is that the algorithm in this post is "sticky", which seems like an obstacle to deriving an agent's policy based on their beliefs about others.
You could generalize to games where all players can achieve their best possible utilities by coordinating, and these utilities don't have to be symmetric. For example, a coordination game with (5,5) and (10,20). Maybe your proof already works for this case.
Then you could try generalizing to games where some subset of players can jointly enforce their best possible utilities regardless of what other players do, but unfortunately that fails. For example, imagine three people, one of whom must be sacrificed to Cthulhu so the other two may live. The person to be killed is chosen by majority vote, and in case of tie they all die. Then any two people can ensure maximum utility for themselves by voting to kill the third, but the solution isn't unique.
I believe the same proof method works when players share the same ordering on outcomes; the important thing is that it's more likely to jump from a bad state to a good state than vice versa.
I think the main issue with the Cthulhu example is that, if person A is voting for person B and person B is voting for person A, then person C is indifferent between voting for person A or person B. But I expect that the same proof method from the post shows that, if any subset can ensure their strict maximum utility by coordinating, then they will.
What does "strict maximum utility" mean  achieving the unique outcome that maximizes utility? That seems too restrictive, e.g. in a coordination game with (10,10) and (10,10) I want the players to find an optimal outcome even though it's not unique. Can you make it precise?
Edit: your condition "all players share the same ordering on outcomes" also seems too restrictive. My condition "all players can achieve their best possible utilities by coordinating" is more lenient, because it doesn't care about ordering of nonoptimal outcomes. So maybe you could push in that direction as well.
Yes, that is what I meant by strict maximum utility. Agree this is too restrictive. I don't know what similar criterion would be both good to have and achievable, given the Cthulhu example.
Regarding all players achieving their best possible utilities by coordinating: what about a case where player A's utility is always 0, and player B's utility is 1 if player A takes action 1 and 0 if player A takes action 2? They can both achieve their best possible utilities by coordinating but I don't think that means player A should necessarily take action 1.
Great example, I didn't think of that. Maybe the right criterion is that some achievable utility vector must be strictly superior to all other achievable utility vectors on all coordinates. But it shouldn't have to correspond to a unique outcome, so the (10,10) (10,10) example could still work. Would that be possible with your approach?
Yes, I think that will work.
Two typos in the proof that confused me:
Sign should be flipped.
Should be "A nonoptimal state has utility at least umin".
Thanks, fixed.
Do I understand it correctly that the approach is to adjust the decision probabilities in a way that penalizes in some way lowutility outcome and rewards highutility outcomes? If so, do you recommend a specific way of doing so? How hard would it be to do what you propose constructively, say in the iterated Prisoner's dilemma with a bunch of bots following this strategy, and see if they converge to a Nash equilibrium, hopefully the best one?
Yes, the idea is that you switch actions more often when utility is low.
Yes, the decision rule is in the post.
This doesn't guarantee a Pareto efficient outcome in games where players have different utility functions (see "Extension to noncommonpayoff games"). Running these strategies in iterated games is difficult given that there is an infinite number of "actions" (policies). It's plausible that this approach yields something like ϵapproximate correlated equilibria but I haven't worked out the details, and anyway I'd rather try to find something that gets Pareto efficient outcomes.