Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is an informal summary of ideas that Scott and I discussed on my recent visit to Los Angeles, meant to serve as a temporary "savepoint" and reference. The approach outlined here, albeit very promising in my opinion, requires a more work to become mature.

Abstract

We introduce a new type of game theoretic equilibrium, which is a candidate game theoretic desideratum for decision theory. This can be regarded as a generalization of Hofstadter's superrationality to arbitrary games. This equilibrium can be interpreted as the outcome of games between transparent agents running CDT with restricted self-modification but also as the outcome of games between transparent agents employing a certain type of logical counterfactuals. We suggest it should be possible to use the formalism of optimal predictors to rigorously construct a logical decision theory which implements the latter interpretation.

Introduction

It is commonly assumed that two agents running the correct decision theory would cooperate in a transparent Prisoner's dilemma, that is, in a game of prisoner's dilemma in which each player knows the source code of the other player. It is possible to argue this behavior will indeed arise in UDT, at least for sufficiently similar players, but the argument is unrigorous since an agreed upon formalisation of UDT is lacking. It is also possible to formally prove this behavior for certain set-ups of proof-based decision theory. However, it is not presently clear what is the correct desideratum for decision theory in arbitrary games, especially games that are asymmetric.

The correct decision theory is likely to use logical counterfactuals but current understanding of logical counterfactuals is poor. On the other hand, it is much easier to model causal counterfactuals. Although CDT often leads to bad outcomes (and in particular defects in prisoner's dilemma) it is reflectively unstable and given the chance is expected to self-modify into some kind of better decision theory. We will now see that a naive formalisation of this idea fails to achieve satisfactory game-theoretic properties. This failure is later amended by restricting the "self-modification" to only produce programs of a certain form, arriving at what we hope is the correct desideratum.

Self-modifying CDT

Fix a universal Turing machine to be used for representing Turing machines as words (programs).

Consider a game in normal form between players. Denote the set of pure strategies for the -th player and the utility function of the -th player.

Given , we define the associated self-modification game as follows. The pure strategies of the -th player are . These strategy represent programs that receive the strategies of the other players as input, run for time steps and produce a strategy in the original game, possibly using random coin flips. The utility function of the -th player is given by

For example, denoting the prisoner's dilemma and taking , there are Nash equilibria in which corresponds to the outcome in . For example take to be "cliquebot" program, constructed using quining:

Then obviously is a Nash equilibrium of for any s.t. is a prefix of and . However, the "defectbot" also gives a Nash equilibrium. More generally, for any a dyadic rational we can define the -cliquebot to be

The defectbot equilibrium turns out to be unstable in the sense that it's not a thermodynamic equilibrium at zero temperature (see below). However -cliquebots for yield equilibria which are stable in this sense. This is a deficiency of the formalism which is avoided by the metathreat hierarchy (see below).

Thermodynamic Game Equilibria

Consider a game in normal form as before. Denote the set of mixed strategies of the -th player i.e.

Consider . A thermodynamic equilibrium of at temperature is s.t.

where are normalization constants. Alternatively, is a thermodynamic equilibrium iff

where is the information-theoretic entropy of .

At least one thermodynamic equilibrium exists for any game and temperature, as a consequence of Brouwer's fixed-point theorem.

is called a thermodynamic equilibrium of at zero temperature when there is a sequence s.t. , and is a thermodynamic equilibrium at temperature .

At least one thermodynamic equilibrium at zero temperature exists for any game, as a consequence of the Bolzano-Weierstrass theorem. Any thermodynamic equilibrium at zero temperature is a trembling hand equilibrium but the converse is false in general. We believe that for finite extensive-form games, a thermodynamic equilibrium at zero temperature is a subgame perfect equilibrium.

We believe is a trembling hand equilibrium in but not a thermodynamic equilibrium at zero temperature. On the other hand is a thermodynamic equilibrium at zero temperature for (when random bits are appended to until it becomes a string of length ).

Thermodynamic equilibria are interesting because it is possible to implement them using a natural decision theory (that is, a decision theory that treats everything as a 1-player game by modeling the other players). Games between transparent CDT agents using reflective oracles lead to Nash equilibria, and it should be straightforward to "thermalize" the decision theory and get thermodynamic equilibria at finite temperature. A similar result should be possible using optimal predictor schemes instead of reflective oracles.

The Metathreat Hierarchy

The -cliquebot can be informally regarded as an infinite tower of threats of increasing level of "meta". The 1st level says any strategy other than cooperation with probability will be answered by defection. The -st level says that failure to produce the -th level threat will be answered by defection. This way the players blackmail each other into complying with this suboptimal strategy. However, if we had a way to truncate the tower at any finite level, it would apart: there would be no justification for each player to produce the highest level threat and it would unwind downwards recursively. The metathreat hierarchy is a formal method to implement this truncation.

Consider a game in normal form as before. Fix . The associated level metathreat game with coin-flips is just itself: . We define the associated level metathreat game with coin-flips recursively. The pure strategies of the -th player are

Consider . Then there is unique s.t.

It exists because of Brouwer's fixed-point theorem and is unique because of standard results about Markov chains ( defines a Markov chain on , is a stationary distribution of this chain and it has to be unique because of the properties of the chain that follow from the condition on ). This allows us to define the utility functions

Denote the space of correlated mixed outcomes of :

There is a natural sense in which .

For we define the unwinding operator recursively by

Consider , a correlated mixed outcome of . is called a level metathreat equilibrium when there is a sequence s.t. , and is a thermodynamic equilibrium of at zero temperature. The Bolzano-Weierstrass theorem ensures such exists for any game and value of .

We believe that for , has a unique thermodynamic equilibrium at zero temperature which is a mixture of s.t.

Therefore is the unique level metathreat equilibrium of .

Some questions for further research:

  • Does the set of level metathreat equilibria stabilize for ?
  • Are level metathreat equilibria guaranteed to be Pareto efficient for ? Or even already for ?
  • How do level metathreat equilibria look like for in -space for ?
  • Does anything change for if we explicitly introduce shared random bits? It seems plausible that shared random bits can always be replaced by going to higher values of .

One decision theoretic interpretation of metathreat equilibria is in terms of self-modifying CDT. The elements of can be interpreted as programs that use reflective oracles or optimal predictor schemes to make predictions about the opposing programs and decide on a strategy accordingly. Thus our self-modification is restricted by the programs in this class.

Another interpretation is in terms of logical decision theory, at least for . Such a decision theory consists of two stages. In the first stage the agent computes logical conditional expected utility (using an optimal predictor scheme) where the condition corresponds to selection of a given element in . It then performs a sequence of logical coin flips to decide on according to a probability distribution designed to yield a thermodynamic equilibrium. In the second stage the agent uses an optimal predictor scheme to guess the strategy of its opponent and acts according to , this time providing so much resources to the optical predictor scheme that the result of the first stage for the opposing agent is certain. Such a decision theory is not entirely natural since it explicitly uses the concept of "opponent actions", but hopefully it can be used as a foundation for constructing a fully natural logical decision theory.

The historical origin of this idea is Hofstadter "superrationality".

We don't expect the limit of this self-modification process to be entirely reasonable since in some situations self-modification cannot save CDT, for example if self-modification only begins after the agent is inspected by Omega in Newcomb's paradox. However, it seems plausible to specify assumptions under which self-modifying CDT will make the right choices e.g. Newcomb's paradox where self-modification is allowed before inspection by Omega.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 7:50 PM

Do thermodynamic equilibria differ meaningfully from proper equilibria?

Another case which is important is games with infinite strategy spaces. In particular see my recent post where I use the notion of thermodynamic equilibrium in a way which cannot be trivially mimicked (I think) by a proper equilibrium.

Good question. One obvious difference is that thermodynamic equilibria are naturally defined for games with appropriate continuous strategy spaces, but it's probably not important for intelligent agent theory.

It seems plausible that for zero temperature they don't.

For positive temperature thermodynamic equilibria are more rigid than -proper profiles but I'm not sure it's important.

Very interesting! I'm eager to see where this will lead.

If the PD has different values (thus is non-symmetric), but still has the same ordinal relations on outcomes (player 1 prefers (D,C) to (C,C) to (D,D) to (C,D), etc...), is (C,C) still the unique level 1 metathreat equilibrium?

And what inspired this "thermodynamic" approach?

The metathreat equilibrium definitely depends on the cardinal utilities, for example if then the "nicer bots" are not an equilibrium since both agents want to lower the probability of cooperation.

The thermodynamic approach is needed since we need a way to rule out some of the metathreat Nash equilibria. A pair of defect bots is a Nash equilibrium in the level 1 metathreat game but this equilibrium is "unstable" since there are bots that are equally good against the defect bot but do better against other bots. So we need some notion of equilibrium stability, but the usual trembling hand is not good enough since for trembling hand equilibrium stability wrt an arbitrary totally mixed perturbation is sufficient (and it is possible to construct perturbations that stabilize "bad" equilibria). The thermodynamic equilibrium works by considering perturbations in which lower utility alternatives get higher order infinitesimal probabilities. Together with the ease to achieve this equilibrium in a natural decision theory, it becomes an attractive choice.

Does this equilibrium notion differ meaningfully from proper equilibria?