Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I wrote this post during the first part of the SERI MATS winter school, in Vanessa Kosoy's research stream. This is just an explanation of my intended research direction and the motivation behind it, and write down proofs for the statements in Vanessa's comment describing the research direction.

The motivation behind infra-Bayesianism is to develop a better theory of intelligence.

Why do we need a theory of intelligence? When we build superhuman AIs, we can have two kinds of reasons to believe in their safety:
- empirical evidence coming from experiments with less capable AIs;
- mathematically proven guarantees of its performance.

Empirical evidence on its own may be unreliable, because, as the system capabilities improve, results may fail to generalize as we expected. We need a theory to give us performance guarantees and tell us how to correctly extrapolate observed results to new conditions.

Why do we need a better theory? The ideal Bayesian agent, known as AIXI, has several shortcomings, and one of them is realizability assumption. In order for a Bayesian agent to have performance guarantees: the true hypothesis must have a positive probability in the agent's prior. However, this is unachievable in the real world, because the agent is contained in the environment, and so trying to have a true hypothesis about the environment runs into self-reference problems and computational constraints.

A related problem in game theory is the grain-of-truth problem. The term comes from the paper by Ehud Kalai and Ehud Lehrer [1], in which they prove that in an infinitely repeated game, subjectively rational players - i.e. players maximizing their expected reward according to their subjective beliefs about what strategies the opponent is likely to play - will learn to predict their opponents' moves and their strategies will converge to Nash equilibrium strategies, provided that the "grain of truth" requirement is met: each player must assign a positive prior probability to any set of histories that has a positive chance of occurringk in the game.

Unfortunately, the "grain of truth" requirement is very strict, and when it's not met, learning fails. Foster and Young [2] prove that in a repeated 2-player game with uncertain payoff, rational agents will fail to learn each other's strategies and fail to converge to a Nash equilibrium, except for a set of games of measure 0. John H Nachbar [3] demonstrates that, loosely speaking, for any sufficiently rich set of strategies, it is impossible for each player to both predict their opponent's strategy and have their own strategy be the best response to their beliefs about their opponent at the same time.

Infra-Bayesianism is a possible solution to the "grain of truth problem", as it allows agents to have partial models of their environment. In this way, infra-Bayesian beliefs are similar to physical laws. For example, Newton's laws of motion can predict the path of a falling object, but not the weather tomorrow or the outcome of an election. Instead of using a single probability distribution, Infra-Bayesianism defines hypotheses as convex sets of probability distributions. This allows Infra-Bayesian hypotheses to be agnostic about certain parts of the world.

Our hypothesis is that infra-Bayesian agents will have an advantage in iterated games over Bayesian agents. I want to try and compare their performance in iterated game setting. It will be interesting to see if infra-Bayesian agents have better convergence to Nash equilibria.

The folk theorem states that any payoff vector above the minimax of each player can be achieved by a subgame-perfect equilibrium of a repeated game. For 2-player games, minimax and maximin payoffs in mixed strategies are the same. Therefore the best guaranteed payoff in a 2-player game is the maximin payoff. However, achieving it would not be very interesting, because to receive a maximin payoff, the agent only needs to learn its own rewards, and doesn't need to learn about other agents' policies.

This is why it is interesting to consider another simple kind of repeated game: a population game. Instead of a fixed set of agents playing with each other, we consider  populations of agents. At each turn, a random agent is selected from each population to participate in a one-shot game with  players, which we refer to as the stage game. It is assumed that all populations are sufficiently large that the effect of any individual agent violating from their policy on the distribution of histories of other agents would be negligible.

In an immortal population game, the expected rewards of future games are discounted by a time factor of . In a mortal population game, the rewards are not discounted but instead each agent has a chance  of dying after each turn. Each turn all dead agents are replaced with new agents with empty histories.

Let us introduce some notations:

 is the set of pure strategies (actions) of the th agent in the game. We assume  are finite sets.

[I'll deal with deterministic rewards first, then see what happens in more complicated situations. For a deterministic reward, we can assume that a player knows his reward function.]

 is the set of pure outcomes of a stage game. We assume that each player knows his own deterministic reward function .

The set of -round outcome histories is  and  is the set of all finite-length histories. 

 is the length of a history. is the length of a history .

Definition (distance between histories)

The distance between partial histories  is defined as


where  is some constant,  is the first time the histories diverge, i.e. the smallest  such that exactly one of the histories has fewer than  steps, or they both have at least  steps and their step  is different. If the histories are identical, the distance is defined to be 0. It is easy to see that this is indeed a metric.

 represents the set of probability distributions on .

Definition (distance between measures)

The Kantorovich-Rubinstein (KR) metric on the space  is defined as follows:

where supremum is taken over all 1-Lipschitz 
functions .

 is the policy of player number i. It determines which mixed strategy player number  chooses, given the full history it has seen so far. Upon seeing history , the probability that player  will choose action  is , which I may also write down as  for brevity.

 is the distribution of histories in the i-th population at the th turn of the repeated game.

Note how it is entirely possible for distributions of histories for different players to be different. For example, suppose that player 1 has two possible actions, and his policy is  to take action in the first game randomly with probability 50% each, and then always repeat this action. Then histories of players in distribution 1 will have repeated actions for player 1, while histories of players in distribution 2 will have random actions for player 1, because player 1 is sampled randomly from the distribution 1, and each move is equally likely.


A set of policies is said to be in equilibrium if playing a different strategy  for any agent, while all other agents (including other agents in population ) stick to their policies, does not result in greater expected reward. 

What does an equilibrium look like? Populations are assumed to be sufficiently large that an effect of any individual agent's action on the distribution of histories in the population is negligible. At turn , opponent number  is going to play a mixed strategy  with probability . So his moves follow probability distribution . It means that the best policy for an individual agent in population  is to choose a best response policy to mixed policies  at turn . ("Bar" notation means the element in list is skipped). Since each agent follows a best response policy, their probabilistic mixture  is also a best response policy. So the weighted averages of the players' policies at every turn are in a Nash equilibrium strategy for the single-turn game. 

It is interesting to find out if choosing  to be learning strategies will result in convergence to a Nash equilibrium in the limit of .


The evolution rule   that produces  from  works as follows. 

For the first step, for each  we sample a history for player  from  Then we sample actions according to policies . We append the outcome of the game to the history of player number 

For the second step, each player dies with probability  and is replaced with a new player with an empty history.

The first step is described by the formula 

, where


where  stands for history  without the last step,  for the actions taken at the last step of h, and the sums on the right-hand side are taken over histories of all players except for player .

The second step is obtained by

where  stands for the Dirac delta measure of an empty history.


 is continuous.


 is obtained from  by a linear transformation, so it is enough to prove continuity of .

I am going to write down a proof for , it generalizes straightforwardly to any .

For any 1-Lipschitz function ,


Let us look at the second summand.

because    and .

Choose  such that 

Then the sum


Let us upper bound the first summand in (2):


where  if  and 0 otherwise.

Then  is a 1-Lipschitz function from  to [-1, 1]. 

Indeed, , so for any two histories  and ,  

Therefore, this expression can be bounded from above by

Now for the last summand in (2):


Bringing them together:

If follows that for any  such that , this sum is less than .

Since every history has  possible successors after 1 step,  

We proved that for any  such that  the second summand in (1) is bounded from above by 

The same argument proves that for any  such that , the first summand is bounded from above by .

Now let us upper bound the last summand in (1).

It can be written down as


 if ,

                    0 if 

is the probability that player  with a history  will end up with history  after one step. We can rewrite this expression as

because if  is not equal to the history  without the last step, the probability  is equal to zero. Now rewrite this as


where .

Let us prove that if  is a 1-Lipschitz function , the same is true of .

The inequalities  are obvious. For any two different histories  and , any their two continuation histories  and  are at the same distance , and therefore . Since the integrands are close, the integrals must be close too:

The same way,

and so  is 1-Lipschitz.

Therefore, the third summand is bounded from above by , so if we choose  so that  so that  and  so that , we'll get . Qed.

Notice how we also proved that if  and  are kept fixed,  does not increase the distance, and therefore  is a contraction.

Combining all components gives us the map .

We can extend a finite history to an infinite history by appending an infinite number of blank or "placeholder" turns represented by the symbol  at the end of the original history. With this notation,  is a subset of . The set   (with discrete topology) is compact, and therefore their product  is a compact space by the Tychonoff's theorem. can be seen as a subset of : we take a probability distribution  on  and for any subset  define.

The distance on  is defined just like on : the distance between two histories is  where  is the first time the two histories diverge. 


The topology induced on  by this metric is the product topology on .


The product topology on  can be generated by basis sets . The metric topology is generated by the unit balls . These two families coincide, since , where  is the smallest integer such that . Therefore the topologies also coincide.


Since  is compact, it must also be a complete space. The set of sequences with finitely many non- elements is countable and dense in . So  is a separable space. This means that  is a separable, completely metrizable topological space, also known as a Polish space.

Because  is compact, the space of probability distributions over it,  is compact in the weak* topology. (See theorem 10.2 in [3])

However, we are looking for fixed points among probability distributions on  and, unfortunately,  is not compact as  is not closed in (. (A sequence of finite histories that are continuations of each other has a limit in , but it's not a finite history.)

On the other hand, we can observe that a fixed distribution in  has the following property:

If the proportion of players in population with a history length of  is  in the th generation, then the proportion of players in population  with a history length of  in the next generation will be  x, since  is the proportion of players with a -long history that survive the turn. This implies that in a fixed distribution the age-distribution  must satisfy . This means that .

We call  the subset of  in which every population's age distribution is .


 is closed in the weak* topology.


Suppose  is a converging sequence of measures, . It means that for any continuous function 

For elements of , we can define length as the number of symbols before the first ocurrence of symbol  (if there is no symbol , the length is infinite).
For a fixed nonnegative integer , consider the function  which is equal to 1 if   has length , and 0 otherwise. It is continuous, because if length of a history  is . The integral  measures the proportion of histories in the distribution  which have length ; the limit distribution  then must preserve the age distribution and also belong in the set 

As a closed subset of a compact set,  is compact. It is also easy to see that  is convex. 


  has a fixed point.


The Schauder fixed-point theorem implies that a continuous map from a compact convex subset of a Hausdorff topological vector space to itself has a fixed point. We showed that P is compact and convex, and T is continuous. 

The set   is a subset of the topological vector space  of all finite signed measures on  with weak* topology. Let us show that this vector space is Hausdorff. Take any two signed measures  and , and suppose that no continuous function  on  separates them: . Then . The functional  is a positive linear functional on . The space  is metric and therefore Hausdorff; it is compact and therefore locally compact. Therefore, by the Riesz representation theorem, the measure corresponding to this functional is unique:  and therefore .

Another possibly interesting observation is that the age distribution of histories is going to converge to the limit distribution.


The age distribution of histories  converges to the distribution  in the norm .


Since the map  adds 1 step to every history and then replaces each agent with a newborn with probability , it induces a map on age distributions:



on age distributions.  For this map  is a fixed point and . It follows that any initial age distribution converges to .

What happens in the fixed points? In a fixed point, the distribution of opponents stays the same each term. It means that a player observes a sequence of independent identically distributed (IID) outcomes of their policy. Suppose that the policy  satisfies the following property: upon observing an IID sequence of games, it converges to the optimal response in the limit of . Then in this limit, fixed points will also be Nash equilibria. 
The property above can be satisfied even by very simple algorithms, for example, one that assumes that opponent's move is randomly sampled from the set of his previous moves. Any Bayesian agent that includes all IID processes in its prior will satisfy this property.

Some interesting questions for research are:

Are any/all of the fixed points attractors?

Does convergence to a fixed point occur for all or at least almost all initial conditions?

Do all Nash equilibria correspond to fixed points? 

Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

[1] Kalai, Ehud, and Ehud Lehrer. "Rational learning leads to Nash equilibrium." Econometrica: Journal of the Econometric Society (1993): 1019-1045.
[2] Dean P Foster and H Peyton Young. "On the
impossibility of predicting the behavior of rational agents." Proceedings of the National
Academy of Sciences, 98(22):12848–12853,
[3] Nachbar, John H. "Beliefs in repeated games." Econometrica 73.2 (2005): 459-480.
[4] Hadfield-Menell D. et al. Cooperative inverse reinforcement learning //Advances in neural information processing systems. – 2016. – Т. 29.
[5]Hadfield-Menell D. et al. The off-switch game //Workshops at the Thirty-First AAAI Conference on Artificial Intelligence. – 2017.
[6]Ergodic Theory by Charles Walkden, lecture 10

New Comment