My take on higher-order game theory

Nisan

40 My take on higher-order game theory

30th Nov 2021

6 min read

40 Ω 18

This is how I currently think about higher-order game theory, the study of agents thinking about agents thinking about agents....

This post doesn't add any new big ideas beyond what was already in the post by Diffractor linked above. I just have a slightly different perspective that emphasizes the "metathreat" approach and the role of nondeterminism.

This is a work in progress. There's a bunch of technical work that must be done to make this rigorous. I'll save the details for the last section.

Multiple levels of strategic thinking

Suppose you're an agent with accurate beliefs about your opponents. It doesn't matter where your beliefs come from; perhaps you have experience with these opponents, or perhaps you read your opponents' source code and thought about it. Your beliefs are accurate, although for now we'll be vague about what exactly "accurate" means.

In a game of Chicken, you may want to play Swerve, since that's the maximin strategy. This is zeroth-order thinking because you don't need to predict what your opponent will do.

Or maybe you predict what your opponent will do and play the best response to that, Swerving if they'll go Straight and going Straight if they'll Swerve. This is first-order thinking.

Or maybe you know your opponent will use first-order thinking. So you resolve to go Straight, your opponent will predict this, and they will Swerve. This is second-order thinking.

Beyond second-order, Chicken turns into a commitment race.

In Prisoner's Dilemma, zeroth- and first-order thinking recommend playing Defect, as that's a dominant strategy. Second-order thinking says that it's good to Cooperate on the margin if by doing so you cause your opponent to also Cooperate on the margin. Let's say it's worth it to Cooperate with probability if your opponent thereby Cooperates with probability more than $\frac{p}{2}$ (although the exact numbers depend on the payoff matrix).

If you're a third-order agent, you might commit to Cooperating with probability equal to 51% times the probability that your opponent Cooperates. This causes your opponent to Cooperate with certainty, and thus you're bound to Cooperate with probability 51%. This is a Pareto-optimal outcome that favors you heavily.

Fixed points

So far we've talked about agents of order $n + 1$ playing against agents of order $n$ . What about games between agents of equal order?

If two first-order agents play the best response to each other, then they play a Nash equilibrium. Let's see how this works in Matching Pennies, which has a mixed-strategy Nash equilibrium:

If the Even player thinks Odd is more likely to play Heads, then Even will play Heads. If the Even player thinks Odd is more likely to play Tails, then Even will play Tails. If Even thinks Odd is equally likely to play Heads or Tails, then Even will be indifferent. So if both players have well-calibrated beliefs, we must be in the equilibrium where both players are indifferent and both players play $\frac{1}{2}$ Heads, $\frac{1}{2}$ Tails.

(What if Even always plays Heads when it's indifferent? In that case, there is no equilibrium, which is awkward for our theory. But this behavior requires Even to represent its beliefs as real numbers. If we make the more realistic assumption that Even's beliefs are arbitrary-precision numbers, then it can never be sure that it's on the decision boundary. It gets stuck in a for loop, querying its belief for ever more precision in case it's not exactly $0.5000 \dots$ . (Then some other process, like infinitesimal errors in the beliefs, causes the agent to output the equilibrium strategy of $\frac{1}{2}$ Heads, $\frac{1}{2}$ Tails.))

The space of all agents

Now we're ready to define our types. (I'll leave some details to the last section.) The space $A_{0}$ of 0th-order agents is just the set of actions or pure strategies.^[1]

The space of $(n + 1)$ st-order agents is the space of functions on probability distributions on $A_{n}$ . That is, $A_{n + 1} = [Δ A_{n} \to Δ A_{n}]$ .

So an $(n + 1)$ st-order agent is a stochastic function from its $n$ th-order beliefs to a successor agent of order $n$ . Creating a successor agent is a form of currying.

Playing a game

Two agents $a_{n + 1}, b_{n + 1}$ in $A_{n + 1}$ interact as follows:

Find distributions $c_{n}, d_{n}$ on $A_{n}$ such that $c_{n} = a_{n + 1} (d_{n})$ and $d_{n} = b_{n + 1} (c_{n})$ .

So $(c_{n}, d_{n})$ is a fixpoint representing consistent $n$ th-order beliefs that the agents have about each other.^[2]
Sample $a_{n}$ from $c_{n}$ and sample $b_{n}$ from $d_{n}$ .
Repeat steps 1 and 2 on $a_{n}$ and $b_{n}$ until we end up with $a_{0}, b_{0}$ in $A_{0}$ .
Calculate the payoff $u (a_{0}, b_{0})$ to the first player.

Now take the expectation over all samples in step 2, and take the minimum over all choices of fixpoint in step 1. This gives a lower bound on the first player's expected payoff.

Some agents

Now we can write down formulas for the kind of higher-order reasoning discussed earlier.

The first-order agent that always plays the best response to its opponent is defined by $a_{1} (d_{0}) = {argmax}_{a_{0}} E [u (a_{0}, d_{0})]$ . It achieves a Nash equilibrium when playing against itself.

A second-order agent that always plays the best response to its opponent is $a_{2} (d_{1}) (d_{0}) = {argmax}_{c_{0}} E [u (c_{0}, d_{1} (c_{0}))]$ .^[3] This agent also achieves a Nash equilibrium when playing against itself. In a future post I'll construct a different 2nd-order agent that both plays the best response to any first-order agent, and also achieves a Pareto-optimal outcome when playing against itself.

Summary

Order | Beliefs about other agent | Strategic behavior | n vs. n equilibrium
----------------------------------------------------------------------------
0     | None                      | Maximin            | Maximin
----------------------------------------------------------------------------
1     | 0th order                 | Best response      | Nash equilibrium
----------------------------------------------------------------------------
2     | 1st order                 | Argmax/optimize    | Pareto optimum
----------------------------------------------------------------------------
...

Infinity

There's a sense in which the 0th-order maximin agent is an approximation of the 1st-order best-response agent, which is itself an approximation of the 2nd-order argmaxing agent. Standard domain-theoretical constructions should allow construction of infinite-order agents, including an infinite-order strategic agent that subsumes all finite-order best-response behavior. Fortunately the hierarchy ends at infinity; $A_{\infty}$ has type $[Δ A_{\infty} \to Δ A_{\infty}]$ . However, whether and how this works depends a lot on the technical details I haven't worked out, so I hesitate to say any more about it.

Correlated strategies

This framework doesn't have correlated strategies built in. But I have a hunch that agents of sufficiently high order can play correlated strategies anyways.

Technical details

Everything is a domain; in particular, a dcpo. $X_{0}$ is also a lower semilattice. The mapping spaces $[X \to Y]$ are spaces of Scott-continuous functions.

$Δ$ is some kind of probabilistic powerdomain. An approach similar to (Jones & Plotkin 1989) might just work: $Δ X$ would be the set of probability measures on the Borel $σ$ -algebra of $X$ which restrict to continuous evaluations on open sets.

But it's possible $Δ X$ will have to contain sets of probability measures, in order to express nondeterministic choice of fixpoint. Possible approaches include (Mislove 2000), (Tix, Keimel, Plotkin 2009), (Goubault-Larrecq 2007), and of course (Appel and Kosoy 2021). There's also the field of quantitative semantics, which I don't know anything about.

I'd like to have fixpoints $(c_{n}, d_{n})$ which are maximal elements of $Δ X_{n} \times Δ X_{n}$ . This might follow easily from the Kakutani fixed point theorem, but it really depends on how $Δ$ is defined.

Another issue is that higher-order strategic agents have to compute fixpoints themselves (in order to calculate the expected value of playing a lower-order policy). I'm not sure if this can be done continuously. It might require the use of a reflective oracle somehow.

This work was supported by a grant from the Center on Long-Term Risk. Thanks to Alex Appel, Adele Dewey-Lopez, and Patrick LaVictoire for discussions about these ideas.

Actually, we want the set of nonempty sets of actions, so an agent can express indifference between actions. ↩︎
Really we want $c_{n}$ and $d_{n}$ to be maximal elements of the poset $Δ A_{n}$ such that $c_{n} ⊒ a_{n + 1} (d_{n})$ and $d_{n} ⊒ b_{n + 1} (c_{n})$ . See the section on technical details. ↩︎
Actually you can't compute an argmax over a function of a continuous variable. The best you can do is a distribution that's biased towards higher-payoff outcomes, like quantilization or best-of- $k$ sampling. I'll write more about this in the future. ↩︎

Domain TheoryGame TheoryWorld Modeling

Frontpage

40 Ω 18

Mentioned in

64Game Theory without Argmax [Part 1]

55FixDT

17Master plan spec: needs audit (logic and cooperative AI)

My take on higher-order game theory

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:58 AM

[-]Cleo Nardo11moΩ250

Hey Nisan. Check the following passage from Domain Theory (Samson Abramsky and Achim Jung). This might be helpful for equipping with an appropriate domain structure. (You mention [JP89] yourself.)

We should also mention the various attempts to define a probabilistic version of the powerdomain construction, see [SD80, Mai85, Gra88, JP89, Jon90].
[SD80] N. Saheb-Djahromi. CPO’s of measures for nondeterminism. Theoretical Computer Science, 12:19–37, 1980.
[Mai85] M. Main. Free constructions of powerdomains. In A. Melton, editor, Mathematical Foundations of Programming Semantics, volume 239 of Lecture Notes in Computer Science, pages 162–183. Springer Verlag, 1985.
[Gra88] S. Graham. Closure properties of a probabilistic powerdomain construction. In M. Main, A. Melton, M. Mislove, and D. Schmidt, editors, Mathematical Foundations of Programming Language Semantics, volume 298 of Lecture Notes in Computer Science, pages 213–233. Springer Verlag, 1988.
[JP89] C. Jones and G. Plotkin. A probabilistic powerdomain of evaluations. In Proceedings of the 4th Annual Symposium on Logic in Computer Science, pages 186–195. IEEE Computer Society Press, 1989.
[Jon90] C. Jones. Probabilistic Non-Determinism. PhD thesis, University of Edinburgh, Edinburgh, 1990. Also published as Technical Report No. CST63-90.

During my own incursion into agent foundations and game theory, I also bumped into this exact obstacle — namely, that there is no obvious way to equip $Δ$ with a least-fixed-point constructor ${Fix}_{X}^{Δ} : (X \to Δ X) \to Δ X$ . In contrast, we can equip $P$ with a LFP constructor ${Fix}_{X}^{P} : (X \to P X) \to P X, g \mapsto {x \in X : x \in g (x)}$ .

One trick is to define ${Fix}_{X}^{Δ} (g)$ to be the distribution $π \in Δ X$ which maximises the entropy $H (π)$ subject to the constraint $g^{Δ} (π) = π$ .

A maximum entropy distribution $π^{*}$ exists, because —
- For $g : X \to Δ X$ , let $g^{Δ} : Δ X \to Δ X$ be the lift via the $Δ$ monad, and let $G = {π \in Δ X | g^{Δ} (π) = π}$ be the set of fixed points of $g^{Δ}$ .
- $Δ X$ is Hausdorff and compact, and $g^{Δ} : Δ X \to Δ X$ is continuous, so $G = {π \in Δ X : π = g^{Δ} (π)}$ is compact.
- $H : Δ X \to R$ is continuous, and $G \subseteq Δ X$ is compact, so $H$ obtains a maximum $π^{*}$ in $G$ .
Moreover, $π^{*}$ must be unique, because —
- $G$ is a convex set, i.e. if $π_{1} = g^{Δ} (π_{1})$ and $π_{2} = g^{Δ} (π_{2})$ then $λ_{1} π_{1} + λ_{2} π_{2} = g^{Δ} (λ_{1} π_{1} + λ_{2} π_{2})$ for all $λ_{1} + λ_{2} = 1$ .
- $H : Δ X \to R$ is strictly concave, i.e. $H (λ_{1} π_{1} + λ_{2} π_{2}) \geq λ_{1} H (π_{1}) + λ_{2} H (π_{2})$ for all $λ_{1} + λ_{2} = 1$ , and moreover the inequality is strict if $π_{1} \neq π_{2}$ and $λ_{1}, λ_{2} > 0$ .
- Hence if $π_{1}^{*}, π_{2}^{*} \in G$ both obtain the maximum entropy, then $π_{1}^{*} \neq π_{2}^{*} ⟹ H (0.5 π_{1} + 0.5 π_{2}) > 0.5 H (π_{1}) + 0.5 H (π_{2})$ , a contradiction.

The justification here is the Principle of Maximum Entropy:

Given a set of constraints on a probability distribution, then the “best” distribution that fits the data will be the one of maximum entropy.

More generally, we should define ${Fix}_{X}^{Δ} (g)$ to be the distribution $π \in Δ X$ which minimises cross-entropy $D_{K L} (π | | π_{0})$ subject to the constraint $π = g (π)$ , where $π_{0}$ is some uninformed prior such as Solomonoff. The previous result is a special case by considering $π_{0}$ to be the uniform prior. The proof generalises by noting that $D_{K L} (- | | π_{0}) : Δ X \to R$ is continuous and strictly convex. See the Principle of Minimum Discrimination.

Ideally, we'd like ${Fix}^{P}$ and ${Fix}^{Δ}$ to "coincide" modulo the maps $Supp : Δ X \to P X$ , i.e. $Supp ({Fix}^{Δ} (g)) = {Fix}^{P} (Supp \circ g)$ for all $g : X \to Δ X$ . Unfortunately, this isn't the case — if $g : H \mapsto 0.5 \cdot | H ⟩ + 0.5 \cdot | T ⟩, T \mapsto | T ⟩$ then ${Fix}^{P} (Supp \circ g) = {H, T}$ but $Supp ({Fix}^{Δ} (g)) = {T}$ .

Alternatively, we could consider the convex sets of distributions over $X$ .

Let $C (X)$ denote the set of convex sets of distributions over $X$ . There is an ordering $\leq$ on $C (X)$ where $A \leq B ⟺ A \supseteq B$ . We have a LFP operator ${Fix}_{X}^{C} : (X \to C X) \to C X$ via $g \mapsto ⋃ {S \in C X : g_{X}^{C} (S) = S}$ where $g^{C} : C X \to C X, S \mapsto {\sum_{i = 1}^{n} α_{i} \cdot π_{i} | π_{i} \in g (x_{i}), \sum_{i = 1}^{n} α_{i} | x_{i} ⟩ \in S}$ is the lift of $g : X \to C X$ via the $C$ monad.

[-]Nisan10moΩ130

Thanks! For convex sets of distributions: If you weaken the definition of fixed point to , then the set ${S \in C X : g^{C} (S) = S}$ has a least element which really is a least fixed point.

[-]romeostevensit3yΩ230

Tangential, but did you ever happen to read statistical physics of human cooperation?

[-]Nisan3yΩ120

No, I just took a look. The spin glass stuff looks interesting!

[-]romeostevensit3yΩ130

Are we talking about the same thing?

https://www.sciencedirect.com/science/article/am/pii/S0370157317301424

[-]Nisan3yΩ120

Yep, I skimmed it by looking at the colorful plots that look like Ising models and reading the captions. Those are always fun.

[-]Charlie Steiner3yΩ120

I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?

If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like . So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconverge to $A_{\infty}$ or does it converge to some weird $A_{ω 2}$ ?

[-]Nisan3yΩ120

I think you're saying , right? In that case, since $A_{0}$ embeds into $A_{ω}$ , we'd have $A_{ω + 1}$ embedding into $A_{ω}$ . So not really a step up.

If you want to play ordinal games, you could drop the requirement that agents are computable / Scott-continuous. Then you get the whole ordinal hierarchy. But then we aren't guaranteed equilibria in games between agents of the same order.

I suppose you could have a hybrid approach: Order $ω + 1$ is allowed to be discontinuous in its order- $ω$ beliefs, but higher orders have to be continuous? Maybe that would get you to $ω 2$ .

Moderation Log