Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

My take on higher-order game theory

30th Nov 2021

5Cleo Nardo

3Nisan

3romeostevensit

2Nisan

3romeostevensit

2Nisan

2Charlie Steiner

2Nisan

New Comment

8 comments, sorted by Click to highlight new comments since: Today at 2:04 PM

Hey Nisan. Check the following passage from *Domain Theory (Samson Abramsky and Achim Jung)**. *This might be helpful for equipping with an appropriate domain structure. (You mention **[JP89]** yourself.)

We should also mention the various attempts to define a probabilistic version of the powerdomain construction, see [SD80, Mai85, Gra88, JP89, Jon90].

[SD80]N. Saheb-Djahromi. CPO’s of measures for nondeterminism. Theoretical Computer Science, 12:19–37, 1980.[Mai85]M. Main. Free constructions of powerdomains. In A. Melton, editor, Mathematical Foundations of Programming Semantics, volume 239 of Lecture Notes in Computer Science, pages 162–183. Springer Verlag, 1985.[Gra88]S. Graham. Closure properties of a probabilistic powerdomain construction. In M. Main, A. Melton, M. Mislove, and D. Schmidt, editors, Mathematical Foundations of Programming Language Semantics, volume 298 of Lecture Notes in Computer Science, pages 213–233. Springer Verlag, 1988.[JP89]C. Jones and G. Plotkin. A probabilistic powerdomain of evaluations. In Proceedings of the 4th Annual Symposium on Logic in Computer Science, pages 186–195. IEEE Computer Society Press, 1989.[Jon90]C. Jones. Probabilistic Non-Determinism. PhD thesis, University of Edinburgh, Edinburgh, 1990. Also published as Technical Report No. CST63-90.

During my own incursion into agent foundations and game theory, I also bumped into this *exact* obstacle — namely, that there is no obvious way to equip with a least-fixed-point constructor . In contrast, we can equip with a LFP constructor .

One trick is to define to be the** **distribution which maximises the entropy subject to the constraint .

- A maximum entropy distribution exists, because —
- For , let be the lift via the monad, and let be the set of fixed points of .
- is Hausdorff and compact, and is continuous, so is compact.
- is continuous, and is compact, so obtains a maximum in .

- Moreover, must be unique, because —
- is a convex set, i.e. if and then for all .
- is strictly concave, i.e. for all , and moreover the inequality is strict if and .
- Hence if both obtain the maximum entropy, then , a contradiction.

The justification here is the **Principle of Maximum Entropy:**

Given a set of constraints on a probability distribution, then the “best” distribution that fits the data will be the one of maximum entropy.

More generally, we should define to be the** **distribution which minimises cross-entropy subject to the constraint , where is some uninformed prior such as Solomonoff. The previous result is a special case by considering to be the uniform prior. The proof generalises by noting that is continuous and strictly convex. See the Principle of Minimum Discrimination.

Ideally, we'd like and to "coincide" modulo the maps , i.e. for all . Unfortunately, this isn't the case — if then but .

Alternatively, we could consider the convex sets of distributions over .

Let denote the set of convex sets of distributions over . There is an ordering on where . We have a LFP operator via where is the lift of via the monad.

Thanks! For convex sets of distributions: If you weaken the definition of fixed point to , then the set has a least element which really is a least fixed point.

Are we talking about the same thing?

https://www.sciencedirect.com/science/article/am/pii/S0370157317301424

Yep, I skimmed it by looking at the colorful plots that look like Ising models and reading the captions. Those are always fun.

I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?

If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like . So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconverge to or does it converge to some weird ?

I think you're saying , right? In that case, since embeds into , we'd have embedding into . So not really a step up.

If you want to play ordinal games, you could drop the requirement that agents are computable / Scott-continuous. Then you get the whole ordinal hierarchy. But then we aren't guaranteed equilibria in games between agents of the same order.

I suppose you could have a hybrid approach: Order is allowed to be discontinuous in its order- beliefs, but higher orders have to be continuous? Maybe that would get you to .

This is how I currently think about higher-order game theory, the study of agents thinking about agents thinking about agents....

This post doesn't add any new big ideas beyond what was already in the post by Diffractor linked above. I just have a slightly different perspective that emphasizes the "metathreat" approach and the role of nondeterminism.

This is a work in progress. There's a bunch of technical work that must be done to make this rigorous. I'll save the details for the last section.

## Multiple levels of strategic thinking

Suppose you're an agent with accurate beliefs about your opponents. It doesn't matter where your beliefs come from; perhaps you have experience with these opponents, or perhaps you read your opponents' source code and thought about it. Your beliefs are accurate, although for now we'll be vague about what exactly "accurate" means.

In a game of Chicken, you may want to play Swerve, since that's the maximin strategy. This is zeroth-order thinking because you don't need to predict what your opponent will do.

Or maybe you predict what your opponent will do and play the best response to that, Swerving if they'll go Straight and going Straight if they'll Swerve. This is first-order thinking.

Or maybe you know your opponent will use first-order thinking. So you resolve to go Straight, your opponent will predict this, and they will Swerve. This is second-order thinking.

Beyond second-order, Chicken turns into a commitment race.

In Prisoner's Dilemma, zeroth- and first-order thinking recommend playing Defect, as that's a dominant strategy. Second-order thinking says that it's good to Cooperate on the margin if by doing so you cause your opponent to also Cooperate on the margin. Let's say it's worth it to Cooperate with probability p if your opponent thereby Cooperates with probability more than p2 (although the exact numbers depend on the payoff matrix).

If you're a third-order agent, you might commit to Cooperating with probability equal to 51% times the probability that your opponent Cooperates. This causes your opponent to Cooperate with certainty, and thus you're bound to Cooperate with probability 51%. This is a Pareto-optimal outcome that favors you heavily.

## Fixed points

So far we've talked about agents of order n+1 playing against agents of order n. What about games between agents of equal order?

If two first-order agents play the best response to each other, then they play a Nash equilibrium. Let's see how this works in Matching Pennies, which has a mixed-strategy Nash equilibrium:

If the Even player thinks Odd is more likely to play Heads, then Even will play Heads. If the Even player thinks Odd is more likely to play Tails, then Even will play Tails. If Even thinks Odd is equally likely to play Heads or Tails, then Even will be indifferent. So if both players have well-calibrated beliefs, we must be in the equilibrium where both players are indifferent and both players play 12 Heads, 12 Tails.

(What if Even always plays Heads when it's indifferent? In that case, there is no equilibrium, which is awkward for our theory. But this behavior requires Even to represent its beliefs as real numbers. If we make the more realistic assumption that Even's beliefs are arbitrary-precision numbers, then it can never be sure that it's on the decision boundary. It gets stuck in a for loop, querying its belief for ever more precision in case it's not exactly 0.5000…. (Then some other process, like infinitesimal errors in the beliefs, causes the agent to output the equilibrium strategy of 12 Heads, 12 Tails.))

## The space of all agents

Now we're ready to define our types. (I'll leave some details to the last section.) The space A0 of 0th-order agents is just the set of actions or pure strategies.

^{[1]}The space of (n+1)st-order agents is the space of functions on probability distributions on An. That is, An+1=[ΔAn→ΔAn].

So an (n+1)st-order agent is a stochastic function from its nth-order beliefs to a successor agent of order n. Creating a successor agent is a form of currying.

## Playing a game

Two agents an+1,bn+1 in An+1 interact as follows:

Find distributions cn,dn on An such that cn=an+1(dn) and dn=bn+1(cn).

So (cn,dn) is a fixpoint representing consistent nth-order beliefs that the agents have about each other.

^{[2]}Sample an from cn and sample bn from dn.

Repeat steps 1 and 2 on an and bn until we end up with a0,b0 in A0.

Calculate the payoff u(a0,b0) to the first player.

Now take the expectation over all samples in step 2, and take the minimum over all choices of fixpoint in step 1. This gives a lower bound on the first player's expected payoff.

## Some agents

Now we can write down formulas for the kind of higher-order reasoning discussed earlier.

The first-order agent that always plays the best response to its opponent is defined by a1(d0)=argmaxa0E[u(a0,d0)]. It achieves a Nash equilibrium when playing against itself.

A second-order agent that always plays the best response to its opponent is a2(d1)(d0)=argmaxc0E[u(c0,d1(c0))].

^{[3]}This agent also achieves a Nash equilibrium when playing against itself. In a future post I'll construct a different 2nd-order agent that both plays the best response to any first-order agent, and also achieves a Pareto-optimal outcome when playing against itself.## Summary

## Infinity

There's a sense in which the 0th-order maximin agent is an approximation of the 1st-order best-response agent, which is itself an approximation of the 2nd-order argmaxing agent. Standard domain-theoretical constructions should allow construction of infinite-order agents, including an infinite-order strategic agent that subsumes all finite-order best-response behavior. Fortunately the hierarchy ends at infinity; A∞ has type [ΔA∞→ΔA∞]. However, whether and how this works depends a lot on the technical details I haven't worked out, so I hesitate to say any more about it.

## Correlated strategies

This framework doesn't have correlated strategies built in. But I have a hunch that agents of sufficiently high order can play correlated strategies anyways.

## Technical details

Everything is a domain; in particular, a dcpo. X0 is also a lower semilattice. The mapping spaces [X→Y] are spaces of Scott-continuous functions.

Δ is some kind of probabilistic powerdomain. An approach similar to (Jones & Plotkin 1989) might just work: ΔX would be the set of probability measures on the Borel σ-algebra of X which restrict to continuous evaluations on open sets.

But it's possible ΔX will have to contain sets of probability measures, in order to express nondeterministic choice of fixpoint. Possible approaches include (Mislove 2000), (Tix, Keimel, Plotkin 2009), (Goubault-Larrecq 2007), and of course (Appel and Kosoy 2021). There's also the field of quantitative semantics, which I don't know anything about.

I'd like to have fixpoints (cn,dn) which are maximal elements of ΔXn×ΔXn. This might follow easily from the Kakutani fixed point theorem, but it really depends on how Δ is defined.

Another issue is that higher-order strategic agents have to compute fixpoints themselves (in order to calculate the expected value of playing a lower-order policy). I'm not sure if this can be done continuously. It might require the use of a reflective oracle somehow.

This work was supported by a grant from the Center on Long-Term Risk. Thanks to Alex Appel, Adele Dewey-Lopez, and Patrick LaVictoire for discussions about these ideas.Actually, we want the set of nonempty sets of actions, so an agent can express indifference between actions. ↩︎

Really we want cn and dn to be maximal elements of the poset ΔAn such that cn⊒an+1(dn) and dn⊒bn+1(cn). See the section on technical details. ↩︎

Actually you can't compute an argmax over a function of a continuous variable. The best you can do is a distribution that's biased towards higher-payoff outcomes, like quantilization or best-of-k sampling. I'll write more about this in the future. ↩︎