In this post, we describe a generalization of Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) called crisp supra-MDPs and supra-POMDPs. The new feature of these decision processes is that the stochastic transition dynamics are multivalued, i.e. specified by credal sets. We describe how supra-MDPs give rise to crisp causal laws, the hypotheses of infra-Bayesian reinforcement learning. Furthermore, we discuss how supra-MDPs can approximate MDPs by a coarsening of the state space. This coarsening allows an agent to be agnostic about the detailed dynamics while still having performance guarantees for the full MDP.
Analogously to the classical theory, we describe an algorithm to compute a Markov optimal policy for supra-MDPs with finite time horizons. We also prove the existence of a stationary optimal policy for supra-MDPs with infinite time horizons and geometric time discount. Furthermore, we explain the connection between supra-MDPs and stochastic games and summarize known regret bounds. We conclude by showing that crisp supra-POMDPs are equivalent to crisp causal laws.
To motivate the generalization of MDPs using infra-Bayesianism, we consider a toy problem about the conservation of a Sumatran tiger population.[1] The population is either locally extinct or extant, and a conservation group that knows the status of the population can annually choose between managing the population with an anti-poaching strategy and doing nothing in order to divert resources elsewhere.
Suppose that when poaching is low, population dynamics are fairly predictable, and thus extinction probability can be reasonably estimated. On the other hand, when poaching is high, the extinction probability is highly dependent on the connectivity between core subpopulations. This connectivity is difficult to directly estimate and sensitive to factors that change over time such as prey abundance.
For example, the conservationists' best estimate may be that when nothing is done, the extinction probability is equal to 0.2 for 25% connectivity, whereas for 50% connectivity. In this case, it may be reasonable to assume that the connectivity ranges between 25% and 50%, and that similarly, ranges from 0.08 to 0.2. This range of probabilities can be used to define a credal set. This credal set is a valid hypothesis for the ground truth if the next state starting from extant under the "do nothing" action can be sampled according to some element of the set.
This model can be formulated as a crisp supra-MDP, formally defined as follows. Recall that given a set denotes the set of credal sets over
Definition: Crisp supra-MDP with geometric time discount
A crisp supra-Markov decision process (supra-MDP) is a tuple where
- is a set of states,
- is an initial state,
- is a set of actions,
- is a transition infrakernel,
- is a loss function, and
- is a geometric time discount.
Note that the only difference between an MDP and a crisp supra-MDP is the type of the map that provides information about the state transitions.
In the Sumatran tiger example, we have the supra-MDP where
and is given by
where the numerical entries denote probabilities (of transitioning from states labeling the rows to states labeling the columns) and intervals denote credal sets.
Alternatively, Figure 1 summarizes via a state transition graph.
To summarize from a learning theory perspective, a supra-MDP models the following situation. At each time-step, a system is in a state The state is known to the agent, who takes an action It is assumed that the state transitions stochastically according to some element of the credal set the ground truth is said to satisfy the supra-MDP if this assumption holds. With each state transition, the agent receives loss equal to The goal of the agent during online learning is to minimize the -discounted total loss while under uncertainty about the true state transition dynamics.
Recall that a multi-armed bandit is an MDP with one state. Similarly, supra-bandits are a special case of supra-MDPs. In particular, a supra-bandit can be described by a supra-MDP with where the state encodes the loss obtained from pulling the most recent arm. In this case, the transition kernel, which then takes the form depends only on the action. The loss also only depends on the action.
Alternatively, we can equivalently define supra-MDPs so that the type signature of the transition infrakernel is Under this definition, a supra-bandit is a one-state supra-MDP. (We choose the former definition for the transition infrakernel so that the equivalence of supra-MDPs and stochastic games can be explained below more easily.)
Regret bounds for infra-bandits[2] are studied in Imprecise Multi-Armed Bandits (V. Kosoy).
Recall that a crisp causal law is a map generated by a set of environments, meaning that there exists a set of environments such that where denotes convex hull and denotes closure.
Every crisp supra-MDP naturally gives rise to a set of environments, and thus gives rise to a crisp causal law generated by that set of environments. This is possible in two ways: in the more simple case, we assume that More generally, we can assume the existence of a state representation. In this subsection, we describe both constructions, and also discuss how (supra-)regular decision processes (RDPs) provide state representations. Unlike in the partially observable case (discussed below), not every crisp causal law arises from a supra-MDP.
With the goal of describing the first case, we define an -copolicy, which is a map that is compatible with the transition kernel of a supra-MDP
Let denote the empty string. Define by and for nonempty strings with prefix and
Definition: Copolicy to a supra-MDP
Let be a supra-MDP. A map is an M-copolicy if for all and .
If an -copolicy defines an environment, a map of the form Thus, a supra-MDP gives rise to the set of environments that are -copolicies. This set of environments generates a crisp causal law.
Note that in the classical case, i.e. the case when for all and there exists a unique copolicy. Thus, an MDP defines an environment.
The transition dynamics of a crisp supra-MDP are (infra-)Markovian in the sense that the credal set is determined entirely by the current state and action However, in general the copolicy can depend on the history. In turn, the environments compatible with a crisp supra-MDP are not necessarily Markovian.
State representations
We now consider the more general case in which we have a state representation[3] given by functions and We require that and satisfy a consistency condition. In particular, given a history of states and actions, we can compute the associated history of actions (determined by the given history) and observations (determined by in the natural way. To see this formally, define recursively by
and
Then we say that and are consistent if for all and
In this context, we say that is an -copolicy if for all and An -copolicy defines an environment in the following way. Similarly to the definition of given a history of actions and observations, we can compute the associated history of states (determined by and actions (determined by the given history). More precisely, define recursively by where denotes the empty string and
Given and let denote the Dirac delta distribution on Note that Let denote the pushforward of Define by
Then is a set of environments that generates a crisp causal law.
State representations are naturally induced by regular decision processes (RDPs). RDPs can be viewed in two ways. First, they can be viewed as a special case of POMDPs in which (as explained below) the state can always be deduced from an observation history. An RDP can also be thought of as an MDP that is coupled with a state representation, which is significant as this allows one to define an environment (or, in the supra-case, a crisp causal law).
Definition: Regular decision process (RDP)
A regular decision process (RDP) is a tuple where
- is a set of states,
- is an initial state,
- is a set of actions,
- is a set of observations,
- is a loss function, and
- is a geometric time discount.
Furthermore,
- is a transition kernel, and
- is an observation mapping,
which together satisfy that for all and there exists at most one such that where denotes support.
Given an RDP the corresponding state representation is defined as follows. We obtain by recursively applying the transition function. In particular, let for the empty string Given and if there exists such that then define Otherwise, can be defined arbitrarily. Define by for all and
By replacing the transition kernel with a crisp infrakernel, we can define a crisp supra-RDP. A crisp supra-RDP has all of the data of a crisp supra-MDP, together with the observation mapping that can be used to define a state representation analogously to the classical case.
Definition: Crisp supra-RDP
A crisp supra-regular decision process (RDP) is a tuple where
- is a set of states,
- is an initial state,
- is a set of actions,
- is a set of observations,
- is a loss function, and
- is a geometric time discount.
Furthermore,
- is a transition infrakernel, and
- is an observation mapping,
which together satisfy that for all and there exists at most one such that
In the case when a supra-RDP is given (rather than an RDP), there is more freedom in defining the state representation. Given and define if there exists such that and Otherwise, can be defined arbitrarily as before. Therefore, a crisp supra-RDP can be thought of as a crisp supra-MDP that is coupled with a state representation.
A supra-MDP can be used to approximate an MDP via a coarsening of the state space. By approximate, we mean that if an MDP is the ground truth, then the ground truth satisfies (is consistent with) the hypothesis that is the supra-MDP. As a result, if a policy for the supra-MDP satisfies some loss bound, then the same bound is guaranteed for the policy interacting with the full MDP. This observation is significant in cases when learning algorithms for MDPs are intractable due to the size of the full state space. Furthermore, the coarsening process allows an agent to be agnostic about parts of the state space that may be difficult to precisely describe.
Consider an MDP Define a surjective coarsening map to a set For the coarsening to be nontrivial, should be non-injective.
Conceptually, the coarsening map allows an agent to be agnostic about the difference between MDP states that are collapsed together. For example, Figure 2 depicts a coarsening in which an MDP with three states is collapsed to a supra-MDP with two states.
We now define the transition kernel of a supra-MDP with state space Recall that the transition kernel of is of type Conceptually, to define the transition infrakernel for a coarsened state , we take the preimage of under the coarsening map, apply the transition kernel from the full MDP, obtain a credal set by taking the convex closure, and then project back to using the coarsening map. More precisely, let denote convex hull, denote closure, and denote the pushforward of extended to credal sets. Define the transition infrakernel by
If the full state space is finite, then each credal set is the closed convex hull of a finite number of points in (itself a probability simplex) and thus is isomorphic to a closed convex polytope given by a vertex representation. When a transition infrakernel has this special property, the supra-MDP is said to be polytopic.
A policy is said to be Markov if it depends on the state and time-step, but not on the history. If the time horizon is finite, there is an efficient algorithm to compute a Markov optimal policy for a special class of supra-MDPs. In particular, we require that for all and there exists an efficient algorithm to maximize a linear functional over the credal set For example, this holds for polytopic supra-MDPs.
To compute the optimal policy, we iterate backwards from the final time-step Given a state define the cost at time to be the minimal achievable loss, i.e. We define the policy at time-step to achieve this minimal loss in every state. More precisely, let and define
Define the cost of state at time-step by
Informally speaking, the transition kernel associates every state and action to a credal set of distributions over next states. Then is the minimax expected cost at the next time-step; in particular, it is the best case expected cost over actions and worst case expected cost over the credal set.
By assumption, there exists an algorithm to maximize a linear functional over for all and , and thus is computable. Define to achieve i.e.
Then is an optimal policy, and furthermore it is Markov.
A policy is said to be stationary if it depends only on the current state, meaning it has no dependence on the history or time-step. A stationary optimal policy always exists for an MDP. We have the following analogous result for supra-MDPs.
Proposition 1 [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]: Let be a crisp supra-MDP with geometric time discount such that and are finite. Then there exists a stationary optimal policy.
The detailed proof is contained in the proof section. The key idea is that given any policy, expected loss can only decrease by switching at some time to the policy that is optimal for the current state. A stationary optimal policy is obtained by iteratively applying this argument to an optimal policy.
The proof relies on the fact that expectation with respect to a credal set can be written in an iterated form. The iterated form comes from the observation that given any finite time-step the set of destinies can be decomposed into sets of destinies that are the same up to time To be more precise, given a history[4] , let denote the set of destinies with prefix Let denote the set of histories of length Then for any time , the set of destinies can be decomposed into a finite, disjoint union of sets
Given a policy let denote the partial policy obtained by restricting to the first time-steps. More specifically, let denote the set of histories of length at most Then Furthermore, let denote the continuation of after a history i.e. where denotes the set of histories with prefix
Given a crisp causal law we restrict in the natural way to obtain and where is a history of length Then using the decomposition above, the supra-expectation can be written (in a manner reminiscent of Fubini's Theorem) as
Every supra-MDP can be interpreted as a zero-sum single controller stochastic game. A stochastic game is a game with a set of states in which after each turn, the game transitions to a new state according to a distribution that depends on the current state and the actions taken by the players in it. If only one player influences the state transitions, the game has the single-controller property. The payoff of the players similarly depends on the state and actions. In a zero-sum game, the gain of one player is the loss of the other. See, e.g. Regret Minimization Algorithms for Single-Controller Zero-Sum Stochastic Games (G. Peng, M. Raginsky, R. Willett, and D. Zois) for a study of strategies for this type of game.
To introduce notation parallel to our definition of a supra-MDP, we formally define zero-sum single-controller games as follows.[5]
Definition: Zero-sum single-controller game
A zero-sum single-controller stochastic game is a tuple where
- is a set of states,
- is an initial state,
- is the action set of Player 1,
- is the action set of Player 2,
- is a function that returns for each and , a set of actions available to Player 2 in the turn,
- is a state transition kernel, and
- is a loss function that Player 1 aims to minimize and Player 2 aims to maximize.
Consider a supra-MDP Let and In the corresponding game, Player 2 ("Nature", or "Murphy"[6]) controls the transition dynamics in a manner consistent with the transition kernel More precisely, let and As a consequence, the action set of Player 2 in each turn is restricted by the transition kernel of the MDP. Define the transition kernel of the game to simply return the second argument, i.e. Furthermore, let
Using the notation of the supra-MDP, the game has the following protocol. Both players observe the state, which is initialized as Player 1 selects an action Then Player 2 adversarially selects a distribution More specifically, constrained by a given and Player 2 selects Player 1 receives loss (and Player 2 receives an equal reward). The next state is sampled according to
If the supra-MDP is polytopic, then without loss of generality the corresponding game can be reduced to a game in which is finite. This is true because if is the closed convex hull of a finite number of distributions
Since we assume that the transition dynamics of are unknown to the agent, the corresponding game is unknown (as opposed to informed), meaning that Player 1 does not observe the actions of Player 2. As discussed in the next section, this interpretation of supra-MDPs as stochastic games is significant because regret bounds for this class of games can be understood as regret bounds for supra-MDPs.
A supra-MDP is episodic with horizon if the state is guaranteed to transition to the initial state at the -th time-step. In other words, the supra-MDP resets back to the initial state after each episode of length .
In Online Learning in Unknown Markov Games, Tian et al. propose an online learning algorithm for episodic games of the type discussed in the previous section. The regret bound for this algorithm thus can be applied to supra-MDPs. The notion of regret in this work compares the expected reward of a policy and the minimax value or expected reward of Player 1 at Nash equilibrium. Tian et al. show that it is possible to achieve sublinear regret in this setting by a computationally efficient algorithm. In particular, the regret bound of scales with the number of states , the number of actions , the horizon , and the number of episodes
This regret bound was improved to by Appel and Kosoy as a special case in Regret Bounds for Robust Online Decision Making. Whether or not there exists a computationally efficient algorithm with a regret bound of remains an open problem.
In a (supra-)MDP, the state is observed at each time-step. By contrast, in a (supra-)POMDP, an agent only accesses a set of observations. To account for this, a (supra-)POMDP has the additional structure of an observation mapping from states to observations. Similarly to the supra-MDP case, the transition dynamics are Markovian and multivalued. Uncertainty about the initial state is also described by a credal set. We define supra-POMDPs formally as follows.
Definition: Crisp supra-POMDP with geometric time discount
A crisp supra-partially observable Markov decision process (supra-POMDP) is a tuple where
- is a set of states,
- is an initial credal set over states,
- is a set of actions,
- is a set of observations,
- is a transition infrakernel,
- is an observation mapping,
- is a loss function[7], and
- is a geometric time discount.
One might expect the observation map to have range since POMDPs are typically defined to have a stochastic observation map. These formulations are equivalent since the state space can always be replaced by a state space that encodes observations as part of the state, i.e.
Crisp causal laws naturally give rise to supra-POMDPs, and vice versa. The following proposition and the commutative diagram in Figure 3 make this equivalence explicit.
Proposition 2: Every crisp causal law specifies a crisp supra-POMDP via a map Furthermore, every crisp supra-POMDP specifies a crisp causal law via a map . For all laws
We include a proof of Proposition 2 in the next section in order to illustrate the ideas of an environment compatible with a law and a copolicy to a supra-POMDP.
First we define . Let denote a crisp causal law. Recall that given an environment and a policy denotes the distribution on destinies determined by the interaction of the policy with the environment.
Definition: Environment compatible with a law
An environment is compatible with a crisp causal law if for all If is compatible with we write
Let [8] Note that is convex and closed. Let . So, the supra-POMDP we will construct has states that are environments together with a recorded history.
Let denote the empty history. Define the initial credal set over states as This credal set captures Knightian uncertainty with respect to the entire set of environments.
Define the observation mapping to simply return the most recent observation in the history, i.e. Note that can be chosen arbitrarily.
We now define the transition infrakernel. Towards that end, for a fixed history and action define to be a function that appends observations to the argument of the state that keeps track of the history, i.e. . Let denote the pushforward of . Then define the transition infrakernel to be Note that here we use the natural embedding of in
We now discuss how a crisp supra-POMDP determines a crisp causal law using copolicies.
Definition: Copolicy to a supra-POMDP
Let be a supra-POMDP. A map is an -copolicy if
- for the empty string and
- For all non-empty strings , .
Informally speaking, a copolicy is a map from state-action strings to distributions over next states that is consistent with the transition infrakernel. A copolicy together with the observation mapping induces an environment . Let
Define to be the crisp causal law generated by . If generates the law , then , and thus for all laws
I'm grateful to Vanessa Kosoy and Alexander Appel for insightful discussions, and Marcus Ogren and Mateusz Bagiński for their valuable feedback on the initial draft.
This toy example is based on When to stop managing or surveying cryptic threatened species (I. Chadès, E. McDonald-Madden, M. A. McCarthy, B. Wintle, M. Linkie, H. P. Possingham).
Infra-bandits and supra-bandits are distinguished by whether expectation (of a reward function or a loss function respectively) is defined by an infimum or a supremum.
See, e.g. State Representation Learning for Control: An Overview (T. Lesort, N. Díaz-Rodríguez, J. Goudou, D. Filliat) for an introduction to state representation learning, the problem of learning a map from a history of interactions (e.g. actions and observations) to states.
Since the states of a supra-MDP are observable, this terminology is consistent with the previous usage.
These games can be defined more broadly; cf. Regret Minimization Algorithms for Single-Controller Zero-Sum Stochastic Games (G. Peng, M. Raginsky, R. Willett, and D. Zois).
This name references Murphy's Law and the infra-Bayesian decision rule.
Whenever relevant (such as below where is a compact Polish space), we assume that is continuous.
Alternatively, we could define to be a set of environments that generates which exists by definition but is not unique.