Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Superrationality in arbitrary games

3paulfchristiano

0Vanessa Kosoy

0Vanessa Kosoy

0Stuart_Armstrong

0Vanessa Kosoy

0paulfchristiano

New Comment

Another case which *is* important is games with infinite strategy spaces. In particular see my recent post where I use the notion of thermodynamic equilibrium in a way which cannot be trivially mimicked (I think) by a proper equilibrium.

Good question. One obvious difference is that thermodynamic equilibria are naturally defined for games with appropriate continuous strategy spaces, but it's probably not important for intelligent agent theory.

It seems plausible that for zero temperature they don't.

For positive temperature thermodynamic equilibria are more rigid than -proper profiles but I'm not sure it's important.

Very interesting! I'm eager to see where this will lead.

If the PD has different values (thus is non-symmetric), but still has the same ordinal relations on outcomes (player 1 prefers (D,C) to (C,C) to (D,D) to (C,D), etc...), is (C,C) still the unique level 1 metathreat equilibrium?

And what inspired this "thermodynamic" approach?

The metathreat equilibrium definitely depends on the cardinal utilities, for example if then the "nicer bots" are not an equilibrium since both agents want to lower the probability of cooperation.

The thermodynamic approach is needed since we need a way to rule out some of the metathreat Nash equilibria. A pair of defect bots *is* a Nash equilibrium in the level 1 metathreat game but this equilibrium is "unstable" since there are bots that are equally good against the defect bot but do better against other bots. So we need some notion of equilibrium stability, but the usual trembling hand is not good enough since for trembling hand equilibrium stability wrt an arbitrary totally mixed perturbation is sufficient (and it is possible to construct perturbations that stabilize "bad" equilibria). The thermodynamic equilibrium works by considering perturbations in which lower utility alternatives get higher order infinitesimal probabilities. Together with the ease to achieve this equilibrium in a natural decision theory, it becomes an attractive choice.

This post is an informal summary of ideas that Scott and I discussed on my recent visit to Los Angeles, meant to serve as a temporary "savepoint" and reference. The approach outlined here, albeit very promising in my opinion, requires a more work to become mature.## Abstract

We introduce a new type of game theoretic equilibrium, which is a candidate game theoretic desideratum for decision theory. This can be regarded as a generalization of Hofstadter's superrationality to arbitrary games. This equilibrium can be interpreted as the outcome of games between transparent agents running CDT with restricted self-modification but also as the outcome of games between transparent agents employing a certain type of logical counterfactuals. We suggest it should be possible to use the formalism of optimal predictors to rigorously construct a logical decision theory which implements the latter interpretation.

## Introduction

It is commonly assumed that two agents running the correct decision theory would cooperate in a transparent Prisoner's dilemma, that is, in a game of prisoner's dilemma in which each player knows the source code of the other player1. It is possible to argue this behavior will indeed arise in UDT, at least for sufficiently similar players, but the argument is unrigorous since an agreed upon formalisation of UDT is lacking. It is also possible to formally prove this behavior for certain set-ups of proof-based decision theory. However, it is not presently clear what is the correct desideratum for decision theory in arbitrary games, especially games that are asymmetric.

The correct decision theory is likely to use logical counterfactuals but current understanding of logical counterfactuals is poor. On the other hand, it is much easier to model causal counterfactuals. Although CDT often leads to bad outcomes (and in particular defects in prisoner's dilemma) it is reflectively unstable and given the chance is expected to self-modify into some kind of better decision theory2. We will now see that a naive formalisation of this idea fails to achieve satisfactory game-theoretic properties. This failure is later amended by restricting the "self-modification" to only produce programs of a certain form, arriving at what we hope is the correct desideratum.

## Self-modifying CDT

Fix a universal Turing machine to be used for representing Turing machines as words (programs).

Consider a game G in normal form between n players. Denote Σi the set of pure strategies for the i-th player and ui:∏nj=1Σj→R the utility function of the i-th player.

Given t∈N, we define the

associated self-modification gameGSM(t) as follows. The pure strategies of the i-th player are ΣSM(t)i:={0,1}t. These strategy represent programs that receive the strategies of the other players as input, run for t time steps and produce a strategy in the original game, possibly using random coin flips. The utility function of the i-th player is given byuSM(t)i(π1,π2…πn):=EUnt[ui(evt(π1;π2,π3…πn,x1),evt(π2;π1,π3…πn,x2)…evt(πn;π1,π2…πn−1,xn))]

For example, denoting the prisoner's dilemma GPD and taking t>>0, there are Nash equilibria in GSM(t)PD which corresponds to the outcome (C,C) in GPD. For example take q to be "cliquebot" program, constructed using quining:

q(π):={C if q is a prefix of πD otherwise

Then obviously (q1,q2) is a Nash equilibrium of GSM(t)PD for any q1,q2∈{0,1}t s.t. q is a prefix of q1 and q2. However, the "defectbot" d(π):=D also gives a Nash equilibrium. More generally, for any p∈[0,1] a dyadic rational we can define the p-cliquebot to be

qp(π):={C with probability p if qp is a prefix of πD otherwise

The defectbot equilibrium turns out to be unstable in the sense that it's not a thermodynamic equilibrium at zero temperature (see below). However p-cliquebots for p>0 yield equilibria which are stable in this sense. This is a deficiency of the formalism which is avoided by the metathreat hierarchy (see below).

## Thermodynamic Game Equilibria

Consider a game G in normal form as before. Denote Δi the set of mixed strategies of the i-th player i.e.

Δi:={μ:Σi→[0,1]∣∑a∈Σiμ(a)=1}

Consider T∈(R>0)n. A

thermodynamic equilibrium of G at temperature Tis μ:=(μ1,μ2…μn)∈∏ni=1Δi s.t.μi(a)=Z−1iexp(Eμ[ui(a1,a2…a…an)]Ti)

where Zi∈R are normalization constants. Alternatively, μ is a thermodynamic equilibrium iff

μi=argmaxμ∈Δi(Eμ[ui(a1,a2…an)]+TiS[μi])

where S[μi] is the information-theoretic entropy of μi.

At least one thermodynamic equilibrium exists for any game and temperature, as a consequence of Brouwer's fixed-point theorem.

μ∈∏ni=1Δi is called a

thermodynamic equilibrium of G at zero temperaturewhen there is a sequence {(μk∈∏ni=1Δi,Tk∈(R>0)n)}k∈N s.t. limk→∞μk=μ, limk→∞Tk=0 and μk is a thermodynamic equilibrium at temperature Tk.At least one thermodynamic equilibrium at zero temperature exists for any game, as a consequence of the Bolzano-Weierstrass theorem. Any thermodynamic equilibrium at zero temperature is a trembling hand equilibrium but the converse is false in general. We believe that for finite extensive-form games, a thermodynamic equilibrium at zero temperature is a subgame perfect equilibrium.

We believe (d,d) is a trembling hand equilibrium in GSM(t)PD but not a thermodynamic equilibrium at zero temperature. On the other hand (qp,qp) is a thermodynamic equilibrium at zero temperature for p>0 (when random bits are appended to qp until it becomes a string of length t).

Thermodynamic equilibria are interesting because it is possible to implement them using a natural decision theory (that is, a decision theory that treats everything as a 1-player game by modeling the other players). Games between transparent CDT agents using reflective oracles lead to Nash equilibria, and it should be straightforward to "thermalize" the decision theory and get thermodynamic equilibria at finite temperature. A similar result should be possible using optimal predictor schemes instead of reflective oracles.

## The Metathreat Hierarchy

The p-cliquebot can be informally regarded as an infinite tower of threats of increasing level of "meta". The 1st level says any strategy other than cooperation with probability p will be answered by defection. The k+1-st level says that failure to produce the k-th level threat will be answered by defection. This way the players blackmail each other into complying with this suboptimal strategy. However, if we had a way to truncate the tower at any finite level, it would apart: there would be no justification for each player to produce the highest level threat and it would unwind downwards recursively. The metathreat hierarchy is a formal method to implement this truncation.

Consider a game G in normal form as before. Fix m∈N. The

associated level 0 metathreat game with m coin-flipsis just G itself: GMT(0,m):=G. We define theassociated level k+1 metathreat game with m coin-flipsGMT(k+1,m) recursively. The pure strategies of the i-th player areΣMT(k+1,m)i:={π:∏j≠iΣMT(k,m)j×{0,1}m→ΣMT(k,m)i∣∃a∈ΣMT(k,m)i∀b∈∏j≠iΣMT(k,m)j∃x∈{0,1}m:π(b,x)=a}

Consider π∈∏ni=1ΣMT(k+1,m)i. Then there is μπ∈∏ni=1ΔMT(k,m)i unique s.t.

μπi(a)=Prμπ×Um[πi(b,x)=a]

It exists because of Brouwer's fixed-point theorem and is unique because of standard results about Markov chains (π defines a Markov chain on ∏ni=1ΣMT(k,m)i, μπ is a stationary distribution of this chain and it has to be unique because of the properties of the chain that follow from the condition on πi). This allows us to define the utility functions

uMT(k+1,m)i(π)=Eμπ[uMT(k,m)i(a)]

Denote ΓMT(k,m) the space of correlated mixed outcomes of GMT(k,m):

ΓMT(k,m):={ν:n∏i=1ΣMT(k,m)i→[0,1]∣∑π∈∏ni=1ΣMT(k,m)iν(π)=1}

There is a natural sense in which ∏ni=1ΔMT(k,m)i⊆ΓMT(k,m).

For k>j we define the unwinding operator ωMT(k→j,m):ΓMT(k,m)→ΓMT(j,m) recursively by

ωMT(k+1→k,m)(ν):=∑π∈∏ni=1ΣMT(k+1,m)iν(π)μπ

ωMT(k→j,m)(ν):=ωMT(j+1→j,m)(ωMT(k→j+1,m)(ν))

Consider ν∈Γ, a correlated mixed outcome of G. ν is called a

level k metathreat equilibriumwhen there is a sequence {(mi∈N,μi∈∏ni=1ΔMT(k,mi)i)}i∈N s.t. limi→∞mi=∞, limi→∞ωMT(k→0,mi)(μi)=ν and μi is a thermodynamic equilibrium of GMT(k,mi) at zero temperature. The Bolzano-Weierstrass theorem ensures such ν exists for any game and value of k.We believe that for m>>0, GMT(1,m)PD has a unique thermodynamic equilibrium at zero temperature which is a mixture of π s.t.

PrUm[π(C,x)=C]=1

PrUm[π(D,x)=C]=2−m

Therefore (C,C) is the unique level 1 metathreat equilibrium of GPD.

Some questions for further research:

One decision theoretic interpretation of metathreat equilibria is in terms of self-modifying CDT. The elements of ΣMT(k,m) can be interpreted as programs that use reflective oracles or optimal predictor schemes to make predictions about the opposing programs and decide on a strategy accordingly. Thus our self-modification is restricted by the programs in this class.

Another interpretation is in terms of logical decision theory, at least for k=1. Such a decision theory consists of two stages. In the first stage the agent computes logical conditional expected utility (using an optimal predictor scheme) where the condition corresponds to selection of a given element in π∈ΣMT(k,1). It then performs a sequence of logical coin flips to decide on π according to a probability distribution designed to yield a thermodynamic equilibrium. In the second stage the agent uses an optimal predictor scheme to guess the strategy of its opponent and acts according to π, this time providing so much resources to the optical predictor scheme that the result of the first stage for the opposing agent is certain. Such a decision theory is not entirely natural since it explicitly uses the concept of "opponent actions", but hopefully it can be used as a foundation for constructing a fully natural logical decision theory.

1 The historical origin of this idea is Hofstadter "superrationality".

2 We don't expect the limit of this self-modification process to be entirely reasonable since in some situations self-modification cannot save CDT, for example if self-modification only begins after the agent is inspected by Omega in Newcomb's paradox. However, it seems plausible to specify assumptions under which self-modifying CDT will make the right choices e.g. Newcomb's paradox where self-modification is allowed before inspection by Omega.