Preface

I wrote this post during my scholarship at MATS. My goal is to describe a research direction of the learning theoretic agenda (LTA). Namely, a natural infra-Bayesian learning algorithm proposal that arguably leads to Pareto-optimal solutions in repeated games. The idea originates from Vanessa, I'm expanding a draft of her into a more accessible description.

The expected audience are people who are interested in ongoing work on LTA. It is especially suitable for people who are looking for a research direction to pursue in this area.

Introduction

There has been much work on the theory of agents who cooperate in the Prisoner's dilemma and other situations. Some call this behavior superrationality. For example, functional decision theory (FDT) is a decision theory that prescribes such behavior. The strongest results so far are those coming from "modal combat", e.g. Critch. However, these results are of very limited scope: among other issues, they describe agents crafted for specific games, rather than general reasoners that produce superrational behavior in a "naturalized" manner (i.e. as a special case of the general rules of reasoning.)

At the same time, understanding multi-agent learning theory is another major open problem. Attempts to prove convergence to game-theoretic solution concepts (a much weaker goal than superrationality) in a learning-theoretic setting are impeded by the so-called grain-of-truth problem (originally observed by Kalai and Lehrer, its importance was emphasized by Hutter). An agent can learn to predict the environment via Bayesian learning only if it assigns non-zero prior probability to that environment, i.e. its prior contains a grain of truth. What's the grain-of-truth problem? Suppose Alice's environment contains another agent, Bob. If Alice is a Bayesian agent, she can learn to predict Bob's behavior only if her prior assigns a positive probability to Bob's behavior. (That is, Alice's prior contains a grain of truth.) If Bob has the same complexity level as Alice, then Alice is not able to represent all possible environments. Thus, in general, Alice's prior doesn't contain a grain of truth. A potential solution based on "reflective oracles" was proposed by Leike, Taylor and Fallenstein. However it involves arbitrarily choosing a fixed point out of an enormous space of possibilities, and requires that all agents involved choose the same fixed point. Approaches to multi-agent learning in the mainstream literature (see e.g. Cesa-Bianchi and Lugosi) also suffer from restrictive assumptions and are not naturalized.

Infra-Bayesianism (IB) was originally motivated by the problem of non-realizability, of which the multi-agent grain-of-truth problem is a special case. Moreover, it converges to FDT-optimal behavior in most Newcombian problems. Therefore, it seems natural to expect IB agents to have strong multi-agent guarantees as well, hopefully even superrationality.

In this article, we will argue that an infra-Bayesian agent playing a repeated game displays a behavior dubbed "infra-Bayesian haggling". For two-player games, this typically (but, strictly speaking, not always) leads to Pareto-efficient outcome. The latter can be a viewed as a form of superrationality. Currently, we only have an informal sketch, and even that in a toy model with no stochastic hypotheses. However, it seems plausible that it can be extended to a fairly general setting. Certainly it allows for asymmetric agents with different priors and doesn't have any strong mutual compatibility condition. The biggest impediment to naturality is the requirement that the learning algorithm is of a particular type (namely, Upper Confidence Bound). However, we believe that it should be possible to replace it by some natural abstract condition.

Preliminaries

Repeated games

As mentioned, we will focus our exploration on infinitely repeated games.

Notation: In a -player game, each player  can choose from a finite set of actions . Let  be the set of action profiles. Furthermore, each player  has a reward function  .

Contrary to classical game theory, we look at this setting as a reinforcement learning setup. At time step , player  can choose from action set , then receives an observation  consisting of the actions played by other players at time step , and reward . The goal of agent  is to maximize its utility   for some discount factor .

We will use the words "agent" and "player" interchangeably.

Remark: We suppose the agents know the rules of the game. Specifically, this means player  knows the reward function , but doesn't necessarily know  for .

Notation:  Let  be the set of possible observations for player  : .

Running example: Prisoner's dilemma

We will use the Prisoner's dilemma to illustrate some of the concepts later in the post. Here, the players can choose between cooperating and defecting and they get the following rewards:

Other notation

We will also use the following notations.

Notation: For set , let  denote the set of finite sequences of elements of .

Notation: For set , let X denote the set of probability distributions on .

Proposed algorithm

Each agent starts with a set of hypotheses  about how the other agents might behave. We will talk about the exact form of these hypotheses in the next section. For now we just assume  is finite and each agent has a prior distribution  over the hypotheses. Note that we neither require the agents to have the same prior over the hypothesis space nor having the same hypothesis space as the others.

The proposed algorithm is the following:

  1. Compute the set of hypotheses that are consistent with our experience so far. Let us denote it with .
  2. Find the hypotheses in  that are promising the highest utility, denote this set of hypotheses with .
  3. Sample an element  from  according to the prior  conditioned on .
  4. Compute the optimal policy  if  holds.
  5. Follow  until  gets falsified. When falsified, go to step 1.

We will discuss how a hypothesis looks like in the next section. Then we will discuss what we mean by "optimal policy"  and "utility" in the section after that.

How to describe hypotheses?

Our first try might be to include all hypotheses of type "Others will play ", where . That is way too restrictive. It excludes any dependence on what we have done in previous time steps. It also excludes any probabilistic behavior.

So our second try might be to include hypotheses of type "Others play according to a function , where  maps histories to distributions on actions of others. This is too fine-grained and intractable.

We need something in between these two extremes. A natural choice would be regular decision processes (RDPs).

Regular decision processes

We will define the hypotheses with the reinforcement learning setting in mind. The "observation" in reinforcement learning is what the agent observes from the environment. In this case, that's the actions the other agents played.

We will slightly abuse notation here. Previously  was an element of   in a -player game. From now on,  will sometimes be an element of the action set of the agent, denoted by . For player ,  .

Definition: A Regular Decision Process (RDP) is a tuple , where

  •  is a finite set of states
  •  is a finite set of actions
  •  is a finite set of observations
  •  is the initial state
  •  is the observation kernel that defines a probability distribution over observations for each (state, action) pair
  •  is the transition function describing the next state after taking action  in state  and observing 

Infra-Bayesianism lets us merge some of the RDPs together, which allows us to express beliefs about some aspects of the environment while remaining uncertain about others. (You can read more about infra-Bayesianism here.) The infra-Bayesian version of an RDP is an infra-RDP.

Definition: An Infra-RDP is a tuple , where

  •  have the same properties as in an RDP
  •  is the observation kernel, where  denotes the convex, closed subsets . So  defines a closed convex subset of probability distributions over observations for each (state, action) pair.

If you are not familiar with infra-Bayesianism, it is not clear how to interpret this kind of observation kernel. So suppose the environment is in state  and the agent takes action . The hypothesis claims that the next observation is determined by using one of the distributions in . This means there exists a probability distribution , such that , and the next observation is sampled from . Importantly, the hypothesis doesn't say anything about how  is chosen from : it has Knightian uncertainty about it.  could be chosen randomly or could be chosen adversarially by Murphy or using any other method.

In this post, we will focus on a special case of infra-RDPs called sharp infra RDPs. We restrict ourselves to sharp infra-RDPs now to start with an easier special case. However, the hope is that the method described here can be extended to infra-RDPs in future work.

Definition:  A sharp infra-RPD is a tuple , where

  •  have the same properties as in an RDP
  •  is the observation kernel that describes what observations are possible in a particular state given a particular action.

Remark: A sharp infra-RDP is an infra-RDP. The corresponding infra-RDP is defined by the observation kernel .

Remark: We denote the following special hypothesis with  (it stands for "top"). There's only one state . Whatever action the agent takes, any observation is possible, that is,  for any action .
We will assume  to be in the hypothesis space of each agent. This ensures the agent won't run out of hypotheses:  will always remain unfalsified. and thus the set of consistent hypotheses  will never be empty.

Remark: For sharp infra-RDPs, there is no ambiguity in deciding whether a hypothesis is falsified or not. For general infra-RDPs we will need to have a statistical notion of whether a hypothesis is falsified or not.

Example: Tit-for-tat is a well-known strategy in the iterated prisoner's dilemma. The strategy is to cooperate in the first round of the game, then do what the agent's opponent did in the previous round. the hypothesis "My opponent follows the tit-for tat strategy " can be represented as a sharp infra-RDP. Actually, this is just a deterministic RDP, there is nothing "infra" in it.

Example: Consider the -defect strategy in the prisoners dilemma (defect with probability  and cooperate with probability ). This can't be part of the hypothesis space on its own. It is represented as part of . The closes thing to the -defect bot in the agent's hypothesis space is a deterministic bot who cooperates  times and then defects once.

Resettable infra-RDPs

In real life, but even in repeated games, there might be irreversible actions, so-called traps. This is a prevalent problem in multi-agent cases, because other agents can have memory. For example, in the prisoners' dilemma, the other player might use the Grim strategy: start with cooperate, but once the other defects, defect forever. In this case, if you defect on the first try, you will never be able to get a reward bigger than 1.

However, for the proposed algorithm to make sense, the agents need to be able to try hypotheses without ruining their chances in any of their other hypotheses. There are weaker conditions that are sufficient, but resettability is enough for our purposes here.

Definition: An infra-RDP  is resettable if for all , there exists a policy , such that if the agent is in  and follows , it will eventually reach 

Which hypotheses are in the hypothesis space of the agents?

As mentioned previously, we suppose  is in the hypothesis space. There is an infinite number of sharp infra-RDPs that can potentially be in the agents' hypothesis spaces. We imagine the agents to have some kind of complexity prior and keep all the hypotheses with complexity under a specific threshold. The hope here is that if the prior assigns "enough" probability mass to a special kind of hypotheses, then the agents will end up close to a Pareto-optimal payoff.

Definition: Suppose there is a unique Pareto-optimal strategy profile  For  of length , we define a simple password hypothesis  as the following sharp infra-RDP.

  • for state  and action 
  • for state , action  and observation 

That is, if the agent enters the password , the others will play the Pareto-optimal  strategy as long as the agent itself plays the Pareto-optimal action.

Example: For the password  the sharp infra-RDP looks like this:

For games where there are multiple Pareto-optimal strategy profiles, a password hypothesis doesn't necessarily end in a unique state. It might end in a loop as well. The idea is that after "entering the password", the world gets into a state where if the agent plays a particular strategy, the other agents are restricted to play a particular strategy as well.

Definition: For a password  and a sequence , we define the password hypothesis  as the following sharp infra-RDP.

  • The possible set of observations after taking action  in state  are
  • The transition for state , action  and observation  is described by 

Example For the password  and sequence , the sharp infra-RDP  looks like this:

 

Learning-theoretic guarantee

What can we say about the optimality of this algorithm? In this section, we will show that if the true environment satisfies at least one of the hypotheses in the hypothesis space of the agent, then the agent's payoff converges to at least the optimal payoff of the hypothesis.

What is the optimal payoff of a hypothesis? For a hypothesis , the optimal payoff or value of  is the minimum utility the agent can get if the environment satisfies , and the agent plays the best strategy according to . To define it more formally, we will need to start with what a policy is.

Definition: A stationary policy the agent can play in hypothesis  is an element of . That means a policy is a function that maps states of the sharp infra-RDP to actions.

While following the algorithm, the agent will switch its hypothesis about the world from time to time. The previous definition isn't enough to talk about the agent's behavior including these switches.

Definition: A policy is a function from histories to actions:  

A copolicy describes the behavior of the other agents. Or in other words, it describes the environment.

Definition: A copolicy in hypothesis  is an element of .

Definition: The utility depends on the policy  played, the copolicy  and the discount factor 

Definition: For discount factor , the value of  is .

Proposition 1: Suppose there is a hypothesis  in the agent's hypothesis space such that the true environment satisfies . Following the proposed algorithm guarantees the expected utility of  in the  limit. Let  be the policy defined by the proposed algorithm. Suppose the prior over the hypotheses assigns non-zero weight to