Crisp Supra-Decision Processes

Introduction

In this post, we describe a generalization of Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) called crisp supra-MDPs and supra-POMDPs. The new feature of these decision processes is that the stochastic transition dynamics are multivalued, i.e. specified by credal sets. We describe how supra-MDPs give rise to crisp causal laws, the hypotheses of infra-Bayesian reinforcement learning. Furthermore, we discuss how supra-MDPs can approximate MDPs by a coarsening of the state space. This coarsening allows an agent to be agnostic about the detailed dynamics while still having performance guarantees for the full MDP.

Analogously to the classical theory, we describe an algorithm to compute a Markov optimal policy for supra-MDPs with finite time horizons. We also prove the existence of a stationary optimal policy for supra-MDPs with infinite time horizons and geometric time discount. Furthermore, we explain the connection between supra-MDPs and stochastic games and summarize known regret bounds. We conclude by showing that crisp supra-POMDPs are equivalent to crisp causal laws.

Supra-Markov Decision Processes

Defining supra-MDPs

To motivate the generalization of MDPs using infra-Bayesianism, we consider a toy problem about the conservation of a Sumatran tiger population.^[1] The population is either locally extinct or extant, and a conservation group that knows the status of the population can annually choose between managing the population with an anti-poaching strategy and doing nothing in order to divert resources elsewhere.

Suppose that when poaching is low, population dynamics are fairly predictable, and thus extinction probability can be reasonably estimated. On the other hand, when poaching is high, the extinction probability is highly dependent on the connectivity between core subpopulations. This connectivity is difficult to directly estimate and sensitive to factors that change over time such as prey abundance.

For example, the conservationists' best estimate may be that when nothing is done, the extinction probability is equal to 0.2 for 25% connectivity, whereas $p = 0.08$ for 50% connectivity. In this case, it may be reasonable to assume that the connectivity ranges between 25% and 50%, and that similarly, $p$ ranges from 0.08 to 0.2. This range of probabilities can be used to define a credal set. This credal set is a valid hypothesis for the ground truth if the next state starting from extant under the "do nothing" action can be sampled according to some element of the set.

This model can be formulated as a crisp supra-MDP, formally defined as follows. Recall that given a set $X,$ $□ X$ denotes the set of credal sets over $X .$

Definition: Crisp supra-MDP with geometric time discount
A crisp supra-Markov decision process (supra-MDP) $M$ is a tuple $M = (S, s_{0}, A, T, L, γ)$ where
$S$ is a set of states,
$s_{0} \in S$ is an initial state,
$A$ is a set of actions,
$T : S \times A \to □ S$ is a transition infrakernel,
$L : S \times A \to [0, 1]$ is a loss function, and
$γ \in [0, 1)$ is a geometric time discount.

Note that the only difference between an MDP and a crisp supra-MDP is the type of the map $T$ that provides information about the state transitions.

In the Sumatran tiger example, we have the supra-MDP $M = (S, s_{0}, A, T, L, γ)$ where

$S = {extinct, extant}$ ,
$s_{0} = extant$ ,
$A = {m, n}$ with $m$ for "manage" and $n$ for "do nothing", and
$T (\cdot, m)$ is given by

\begin{matrix} extinct & extant extinct & 1 & 0 extant & 0.06 & 0.94 \end{matrix}

and $T (\cdot, n)$ is given by

\begin{matrix} extinct & extant extinct & 1 & 0 extant & [0.08, 0.2] & [0.8, 0.92] \end{matrix}

where the numerical entries denote probabilities (of transitioning from states labeling the rows to states labeling the columns) and intervals denote credal sets.

Alternatively, Figure 1 summarizes $T$ via a state transition graph.

Figure 1: State transition graph of a supra-MDP modeling a tiger population that is either extinct or extant. The possible transition probabilities are conditional on the actions $m$ for "manage" and $n$ for "do nothing." In some cases, the transition probabilities are precisely specified. In other cases, credal sets (notated here by intervals) over the two-element state space describe the transition probabilities.

To summarize from a learning theory perspective, a supra-MDP models the following situation. At each time-step, a system is in a state $s \in S .$ The state is known to the agent, who takes an action $a \in A .$ It is assumed that the state transitions stochastically according to some element of the credal set $T (s, a);$ the ground truth is said to satisfy the supra-MDP if this assumption holds. With each state transition, the agent receives loss equal to $L (s, a) .$ The goal of the agent during online learning is to minimize the $γ$ -discounted total loss while under uncertainty about the true state transition dynamics.

Supra-bandits

Recall that a multi-armed bandit is an MDP with one state. Similarly, supra-bandits are a special case of supra-MDPs. In particular, a supra-bandit can be described by a supra-MDP with $S = [0, 1]$ where the state encodes the loss obtained from pulling the most recent arm. In this case, the transition kernel, which then takes the form $T : [0, 1] \times A \to □ [0, 1],$ depends only on the action. The loss $L : S \times A \to [0, 1]$ also only depends on the action.

Alternatively, we can equivalently define supra-MDPs so that the type signature of the transition infrakernel is $T : S \times A \to □ (S \times [0, 1]) .$ Under this definition, a supra-bandit is a one-state supra-MDP. (We choose the former definition for the transition infrakernel so that the equivalence of supra-MDPs and stochastic games can be explained below more easily.)

Regret bounds for infra-bandits^[2] are studied in Imprecise Multi-Armed Bandits (V. Kosoy).

Crisp causal laws arising from supra-MDPs and (supra-)RDPs

Recall that a crisp causal law is a map $Λ_{E} : Π \to □ (A \times O)^{ω}$ generated by a set of environments, meaning that there exists a set of environments $E$ such that $Λ_{E} (π) = ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ch ({μ^{π} : μ \in E}),$ where $ch$ denotes convex hull and $¯ \cdot$ denotes closure.

Every crisp supra-MDP naturally gives rise to a set of environments, and thus gives rise to a crisp causal law generated by that set of environments. This is possible in two ways: in the more simple case, we assume that $S = O .$ More generally, we can assume the existence of a state representation. In this subsection, we describe both constructions, and also discuss how (supra-)regular decision processes (RDPs) provide state representations. Unlike in the partially observable case (discussed below), not every crisp causal law arises from a supra-MDP.

With the goal of describing the first case, we define an $M$ -copolicy, which is a map $σ : (A \times S)^{*} \times A \to Δ S$ that is compatible with the transition kernel of a supra-MDP $M = (S, s_{0}, A, T, L, γ) .$

Let $ϵ \in (A \times S)^{*}$ denote the empty string. Define $φ_{M} : (A \times S)^{*} \to S$ by $φ_{M} (ϵ) = s_{0}$ and $φ_{M} (h a s) = s$ for nonempty strings $h a s \in (A \times S)^{*}$ with prefix $h \in (A \times S)^{*},$ $a \in A,$ and $s \in S .$

Definition: Copolicy to a supra-MDP
Let $M = (S, s_{0}, A, T, L, γ)$ be a supra-MDP. A map $σ : (A \times S)^{*} \times A \to Δ S$ is an M-copolicy if for all $h \in (A \times S)^{*}$ and $a \in A,$ $σ (h, a) \in T (φ_{M} (h), a)$ .

If $S = O,$ an $M$ -copolicy defines an environment, a map of the form $(A \times O)^{*} \times A \to Δ O .$ Thus, a supra-MDP $M$ gives rise to the set of environments that are $M$ -copolicies. This set of environments generates a crisp causal law.

Note that in the classical case, i.e. the case when for all $s \in S$ and $a \in A,$ $T (s, a) \in Δ S,$ there exists a unique copolicy. Thus, an MDP defines an environment.

The transition dynamics of a crisp supra-MDP are (infra-)Markovian in the sense that the credal set $T (s, a)$ is determined entirely by the current state $s$ and action $a .$ However, in general the copolicy can depend on the history. In turn, the environments compatible with a crisp supra-MDP are not necessarily Markovian.

State representations

We now consider the more general case in which we have a state representation^[3] given by functions $f : (A \times O)^{*} \to S$ and $g : (S \times A)^{*} \times S \to O .$ We require that $f$ and $g$ satisfy a consistency condition. In particular, given a history of states and actions, we can compute the associated history of actions (determined by the given history) and observations (determined by $g)$ in the natural way. To see this formally, define $^g : (S \times A)^{*} \times S \to (A \times O)^{*}$ recursively by

^g (s_{0} a_{0}, s_{1}) = a_{0} g (s_{0} a_{0}, s_{1})

and

^g (s_{o} a_{0} \dots s_{n} a_{n}, s_{n + 1}) =^g (s_{0} a_{0} \dots s_{n - 1} a_{n - 1}, s_{n}) a_{n} g (s_{o} a_{0} \dots s_{n} a_{n}, s_{n + 1}) .

Then we say that $f$ and $g$ are consistent if $f (^g (h, s)) = s$ for all $h \in (S \times A)^{*}$ and $s \in S .$

In this context, we say that $σ : (A \times O)^{*} \times A \to Δ S$ is an $M$ -copolicy if for all $h \in (A \times O)^{*}$ and $a \in A,$ $σ (h, a) \in T (f (h), a) .$ An $M$ -copolicy $σ$ defines an environment $μ_{σ}$ in the following way. Similarly to the definition of $^g,$ given a history of actions and observations, we can compute the associated history of states (determined by $f)$ and actions (determined by the given history). More precisely, define $^f : (A \times O)^{*} \to (S \times A)^{*} \times S$ recursively by $^f (ϵ) = s_{0}$ where $ϵ$ denotes the empty string and $^f (h a o) =^f (h) a f (h a o) .$

Given $h \in (A \times O)^{*}$ and $a \in A,$ let $δ_{^f (h) a}$ denote the Dirac delta distribution on $^f (h) a \in (S \times A)^{*} .$ Note that $δ_{^f (h) a} \times σ (h, a) \in Δ ((S \times A)^{*} \times S) .$ Let $g_{*} : Δ ((S \times A)^{*} \times S) \to Δ O$ denote the pushforward of $g .$ Define $μ_{σ} : (A \times O)^{*} \times A \to Δ O$ by

μ_{σ} (h, a) := g_{*} (δ_{^f (h) a} \times σ (h, a)) .

Then ${μ_{σ} | σ is an M -copolicy}$ is a set of environments that generates a crisp causal law.

State representations are naturally induced by regular decision processes (RDPs). RDPs can be viewed in two ways. First, they can be viewed as a special case of POMDPs in which (as explained below) the state can always be deduced from an observation history. An RDP can also be thought of as an MDP that is coupled with a state representation, which is significant as this allows one to define an environment (or, in the supra-case, a crisp causal law).

Definition: Regular decision process (RDP)
A regular decision process (RDP) is a tuple $M = (S, s_{0}, A, O, T, B, L, γ)$ where
$S$ is a set of states,
$s_{0} \in S$ is an initial state,
$A$ is a set of actions,
$O$ is a set of observations,
$L : S \times A \to [0, 1]$ is a loss function, and
$γ \in [0, 1)$ is a geometric time discount.
Furthermore,
$T : S \times A \to Δ S$ is a transition kernel, and
$B : S \to O$ is an observation mapping,
which together satisfy that for all $s \in S,$ $a \in A,$ and $o \in O,$ there exists at most one $s^{'} \in supp (T (s, a))$ such that $B (s^{'}) = o$ where $supp$ denotes support.

Given an RDP $M,$ the corresponding state representation is defined as follows. We obtain $f_{M} : (A \times O)^{*} \to S$ by recursively applying the transition function. In particular, let $f_{M} (ϵ) = s_{0}$ for the empty string $ϵ .$ Given $h \in (A \times O)^{*},$ $a \in A,$ and $o \in O,$ if there exists $s^{'} \in supp (T (f_{M} (h), a))$ such that $B (s^{'}) = o,$ then define $f_{M} (h a o) := s^{'} .$ Otherwise, $f_{M} (h a o)$ can be defined arbitrarily. Define $g_{M} : (S \times A)^{*} \times S \to O$ by $g_{M} (h s) := B (s)$ for all $h \in (S \times A)^{*}$ and $s \in S .$

By replacing the transition kernel with a crisp infrakernel, we can define a crisp supra-RDP. A crisp supra-RDP has all of the data of a crisp supra-MDP, together with the observation mapping that can be used to define a state representation analogously to the classical case.

Definition: Crisp supra-RDP
A crisp supra-regular decision process (RDP) is a tuple $M = (S, s_{0}, A, O, T, B, L, γ)$ where
$S$ is a set of states,
$s_{0} \in S$ is an initial state,
$A$ is a set of actions,
$O$ is a set of observations,
$L : S \times A \to [0, 1]$ is a loss function, and
$γ \in [0, 1)$ is a geometric time discount.
Furthermore,
$T : S \times A \to □ S$ is a transition infrakernel, and
$B : S \to O$ is an observation mapping,
which together satisfy that for all $s \in S,$ $a \in A,$ $o \in O,$ and $θ \in T (s, a),$ there exists at most one $s^{'} \in supp (θ)$ such that $B (s^{'}) = o .$

In the case when a supra-RDP is given (rather than an RDP), there is more freedom in defining the state representation. Given $h \in (A \times O)^{*},$ $a \in A,$ and $o \in O,$ define $f_{M} (h a o) := s^{'}$ if there exists $θ \in T (f_{M} (h), a)$ such that $s^{'} \in supp (θ)$ and $B (s^{'}) = o .$ Otherwise, $f_{M} (h a o)$ can be defined arbitrarily as before. Therefore, a crisp supra-RDP can be thought of as a crisp supra-MDP that is coupled with a state representation.

Approximating an MDP with a supra-MDP

A supra-MDP can be used to approximate an MDP via a coarsening of the state space. By approximate, we mean that if an MDP is the ground truth, then the ground truth satisfies (is consistent with) the hypothesis that is the supra-MDP. As a result, if a policy for the supra-MDP satisfies some loss bound, then the same bound is guaranteed for the policy interacting with the full MDP. This observation is significant in cases when learning algorithms for MDPs are intractable due to the size of the full state space. Furthermore, the coarsening process allows an agent to be agnostic about parts of the state space that may be difficult to precisely describe.

Consider an MDP $M = (S, s_{0}, A, T, L, γ) .$ Define a surjective coarsening map $c : S \to ~ S$ to a set $~ S .$ For the coarsening to be nontrivial, $c$ should be non-injective.

Conceptually, the coarsening map allows an agent to be agnostic about the difference between MDP states that are collapsed together. For example, Figure 2 depicts a coarsening in which an MDP with three states is collapsed to a supra-MDP with two states.

Figure 2: (Left) State transition graph (for simplicity, conditional on some fixed action) for an MDP with state space $S = {s_{0}^{'}, s_{1}^{'}, s_{2}^{'}} .$ (Right) State transition graph, conditional on the same action, for the coarsened supra-MDP with states $s_{1} = c (s_{0}^{'}) = c (s_{1}^{'})$ and $s_{2} = c (s_{2}^{'}) .$

We now define the transition kernel of a supra-MDP with state space $~ S .$ Recall that the transition kernel of $M$ is of type $T : S \times A \to Δ S .$ Conceptually, to define the transition infrakernel for a coarsened state $s$ , we take the preimage of $s$ under the coarsening map, apply the transition kernel from the full MDP, obtain a credal set by taking the convex closure, and then project back to $□ ~ S$ using the coarsening map. More precisely, let $ch$ denote convex hull, $¯ \cdot$ denote closure, and $c_{*} : □ S \to □ ~ S$ denote the pushforward of $c$ extended to credal sets. Define the transition infrakernel $~ T : ~ S \times A \to □ ~ S$ by

~ T (s, a) = c_{*} (¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ch {T (s^{'}, a) | c (s^{'}) = s}) .

If the full state space is finite, then each credal set $~ T (s, a)$ is the closed convex hull of a finite number of points in $Δ ~ S$ (itself a probability simplex) and thus is isomorphic to a closed convex polytope given by a vertex representation. When a transition infrakernel has this special property, the supra-MDP is said to be polytopic.

Computing the Markov optimal policy for finite time horizons

A policy is said to be Markov if it depends on the state and time-step, but not on the history. If the time horizon is finite, there is an efficient algorithm to compute a Markov optimal policy for a special class of supra-MDPs. In particular, we require that for all $s \in S$ and $a \in A,$ there exists an efficient algorithm to maximize a linear functional over the credal set $T (s, a) .$ For example, this holds for polytopic supra-MDPs.

To compute the optimal policy, we iterate backwards from the final time-step $N .$ Given a state $s \in S,$ define the cost at time $N$ to be the minimal achievable loss, i.e. $C (s, N) := {min}_{a \in A} L (s, a) .$ We define the policy at time-step $N$ to achieve this minimal loss in every state. More precisely, let $π : S \times {0, \dots, N} \to A$ and define

π (s, N) := {argmin}_{a \in A} L (s, a) .

Define the cost of state $s$ at time-step $n < N$ by

C (s, n) := min a \in A max θ \in T (s, a) E_{s \sim θ} [C (s, n + 1)] .

Informally speaking, the transition kernel associates every state and action to a credal set of distributions over next states. Then $C (s, n)$ is the minimax expected cost at the next time-step; in particular, it is the best case expected cost over actions and worst case expected cost over the credal set.

By assumption, there exists an algorithm to maximize a linear functional over $T (s, a)$ for all $s \in S$ and $a \in A$ , and thus $C (s, n)$ is computable. Define $π$ to achieve $C (s, n),$ i.e.

π (s, n) := {argmin}_{a \in A} max θ \in T (s, a) E_{s \sim θ} [C (s, n + 1)] .

Then $π$ is an optimal policy, and furthermore it is Markov.

Existence of stationary optimal policy

A policy is said to be stationary if it depends only on the current state, meaning it has no dependence on the history or time-step. A stationary optimal policy always exists for an MDP. We have the following analogous result for supra-MDPs.

Proposition 1 [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]: Let $M = (S, s_{0}, A, O, T, B, L, γ)$ be a crisp supra-MDP with geometric time discount such that $S$ and $A$ are finite. Then there exists a stationary optimal policy.

The detailed proof is contained in the proof section. The key idea is that given any policy, expected loss can only decrease by switching at some time to the policy that is optimal for the current state. A stationary optimal policy is obtained by iteratively applying this argument to an optimal policy.

The proof relies on the fact that expectation with respect to a credal set can be written in an iterated form. The iterated form comes from the observation that given any finite time-step $n,$ the set of destinies can be decomposed into sets of destinies that are the same up to time $n .$ To be more precise, given a history^[4] $h \in (A \times S)^{*}$ , let $h (A \times S)^{ω}$ denote the set of destinies with prefix $h .$ Let $(A \times S)^{n} \subset (A \times S)^{*}$ denote the set of histories of length $n .$ Then for any time $n$ , the set of destinies can be decomposed into a finite, disjoint union of sets

(A \times S)^{ω} = ∐ h \in (A \times S)^{n} h (A \times S)^{ω} .

Given a policy $π : (A \times S)^{*} ⇀ A,$ let $π^{n}$ denote the partial policy obtained by restricting $π$ to the first $n$ time-steps. More specifically, let $(A \times O)^{\leq n}$ denote the set of histories of length at most $n .$ Then $π^{n} = π |_{(A \times O)^{\leq n}} .$ Furthermore, let $^{h} π$ denote the continuation of $π$ after a history $h,$ i.e. $^{h} π = π |_{h (A \times S)^{*}}$ where $h (A \times S)^{*}$ denotes the set of histories with prefix $h .$

Given a crisp causal law $Λ,$ we restrict $Λ$ in the natural way to obtain $Λ^{n} (π^{n}) \in □ (A \times S)^{n}$ and $Λ^{h} (^{h} π) \in □ (h (A \times S)^{ω})$ where $h$ is a history of length $n .$ Then using the decomposition above, the supra-expectation can be written (in a manner reminiscent of Fubini's Theorem) as

E_{Λ (π)} [L^{γ}] = E_{Λ^{n} (π^{n})} [E_{Λ^{h} (^{h} π)} [L^{γ}]] .

Interpreting supra-MDPs as stochastic games

Every supra-MDP can be interpreted as a zero-sum single controller stochastic game. A stochastic game is a game with a set of states in which after each turn, the game transitions to a new state according to a distribution that depends on the current state and the actions taken by the players in it. If only one player influences the state transitions, the game has the single-controller property. The payoff of the players similarly depends on the state and actions. In a zero-sum game, the gain of one player is the loss of the other. See, e.g. Regret Minimization Algorithms for Single-Controller Zero-Sum Stochastic Games (G. Peng, M. Raginsky, R. Willett, and D. Zois) for a study of strategies for this type of game.

To introduce notation parallel to our definition of a supra-MDP, we formally define zero-sum single-controller games as follows.^[5]

Definition: Zero-sum single-controller game
A zero-sum single-controller stochastic game is a tuple $(S, s_{0}^{'}, A_{1}, A_{2}, T, K, L)$ where
$S$ is a set of states,
$s_{0}^{'} \in S$ is an initial state,
$A_{1}$ is the action set of Player 1,
$A_{2}$ is the action set of Player 2,
$T : S \times A_{1} \to A_{2}$ is a function that returns for each $s \in S$ and $a \in A_{1}$ , a set of actions available to Player 2 in the turn,
$K : S \times A_{2} \to Δ S$ is a state transition kernel, and
$L : S \times A_{1} \times A_{2} \to [0, 1]$ is a loss function that Player 1 aims to minimize and Player 2 aims to maximize.

Consider a supra-MDP $M = (S, s_{0}, A, T, L, γ) .$ Let $S = S,$ $s_{0}^{'} = s_{0},$ and $A_{1} = A .$ In the corresponding game, Player 2 ("Nature", or "Murphy"^[6]) controls the transition dynamics in a manner consistent with the transition kernel $T .$ More precisely, let $A_{2} = Δ S,$ and $T = T .$ As a consequence, the action set of Player 2 in each turn is restricted by the transition kernel of the MDP. Define the transition kernel of the game $K : S \times Δ S \to Δ S$ to simply return the second argument, i.e. $K (s, θ) = θ .$ Furthermore, let $L (s, a, θ) = L (s, a) .$

Using the notation of the supra-MDP, the game has the following protocol. Both players observe the state, which is initialized as $s = s_{0} .$ Player 1 selects an action $a \in A .$ Then Player 2 adversarially selects a distribution $θ \in T (s, a) .$ More specifically, constrained by a given $s \in S$ and $a \in A,$ Player 2 selects ${argmax}_{θ \in T (s, a)} E_{s^{'} \sim θ} [{min}_{a^{'} \in A} L (s^{'}, a^{'})] .$ Player 1 receives loss $L (s, a)$ (and Player 2 receives an equal reward). The next state is sampled according to $θ .$

If the supra-MDP is polytopic, then without loss of generality the corresponding game can be reduced to a game in which $A_{2}$ is finite. This is true because if $T (s, a)$ is the closed convex hull of a finite number of distributions ${θ_{0}, \dots, θ_{n}} \subseteq Δ S,$

{argmax}_{θ \in T (s, a)} E_{s^{'} \sim θ} [min a^{'} \in A L (s^{'}, a^{'})] = {argmax}_{θ \in {θ_{0}, \dots, θ_{n}}} E_{s^{'} \sim θ} [min a^{'} \in A L (s^{'}, a^{'})] .

Since we assume that the transition dynamics of $M$ are unknown to the agent, the corresponding game is unknown (as opposed to informed), meaning that Player 1 does not observe the actions of Player 2. As discussed in the next section, this interpretation of supra-MDPs as stochastic games is significant because regret bounds for this class of games can be understood as regret bounds for supra-MDPs.

Regret bounds for episodic supra-MDPs

A supra-MDP is episodic with horizon $H$ if the state is guaranteed to transition to the initial state at the $H$ -th time-step. In other words, the supra-MDP resets back to the initial state after each episode of length $H$ .

In Online Learning in Unknown Markov Games, Tian et al. propose an online learning algorithm for episodic games of the type discussed in the previous section. The regret bound for this algorithm thus can be applied to supra-MDPs. The notion of regret in this work compares the expected reward of a policy and the minimax value or expected reward of Player 1 at Nash equilibrium. Tian et al. show that it is possible to achieve sublinear regret in this setting by a computationally efficient algorithm. In particular, the regret bound of $~ O (H^{2} \sqrt[3]{| S | | A | E^{2}})$ scales with the number of states $| S |$ , the number of actions $| A |$ , the horizon $H$ , and the number of episodes $E .$

This regret bound was improved to $~ O (H \sqrt{| S |^{3} | A | E})$ by Appel and Kosoy as a special case in Regret Bounds for Robust Online Decision Making. Whether or not there exists a computationally efficient algorithm with a regret bound of $~ O (poly (H, | S |, | A |) \sqrt{E})$ remains an open problem.

Supra-Partially Observable Markov Decision Processes

In a (supra-)MDP, the state is observed at each time-step. By contrast, in a (supra-)POMDP, an agent only accesses a set of observations. To account for this, a (supra-)POMDP has the additional structure of an observation mapping from states to observations. Similarly to the supra-MDP case, the transition dynamics are Markovian and multivalued. Uncertainty about the initial state is also described by a credal set. We define supra-POMDPs formally as follows.

Definition: Crisp supra-POMDP with geometric time discount
A crisp supra-partially observable Markov decision process (supra-POMDP) is a tuple $(S, θ_{0}, A, O, T, B, L, γ)$ where
$S$ is a set of states,
$Θ_{0} \in □ S$ is an initial credal set over states,
$A$ is a set of actions,
$O$ is a set of observations,
$T : S \times A \to □ S$ is a transition infrakernel,
$B : S \to O$ is an observation mapping,
$L : S \times A \to [0, 1]$ is a loss function^[7], and
$γ \in [0, 1)$ is a geometric time discount.

One might expect the observation map to have range $□ O$ since POMDPs are typically defined to have a stochastic observation map. These formulations are equivalent since the state space can always be replaced by a state space that encodes observations as part of the state, i.e. $S^{'} = S \times O .$

Equivalence of crisp causal laws and crisp supra-POMDPs

Crisp causal laws naturally give rise to supra-POMDPs, and vice versa. The following proposition and the commutative diagram in Figure 3 make this equivalence explicit.

Proposition 2: Every crisp causal law specifies a crisp supra-POMDP via a map $φ .$ Furthermore, every crisp supra-POMDP specifies a crisp causal law via a map $ψ$ . For all laws $Λ,$ $ψ \circ φ (Λ) = Λ .$

Figure 3: Commutative diagram demonstrating the equivalence of crisp causal laws and supra-POMDPs.

We include a proof of Proposition 2 in the next section in order to illustrate the ideas of an environment compatible with a law and a copolicy to a supra-POMDP.

Proof of Proposition 2

First we define $φ$ . Let $Λ$ denote a crisp causal law. Recall that given an environment $μ$ and a policy $π,$ $μ^{π} \in Δ (A \times O)^{ω}$ denotes the distribution on destinies determined by the interaction of the policy with the environment.

Definition: Environment compatible with a law
An environment $μ : (A \times O)^{*} \times A \to Δ O$ is compatible with a crisp causal law $Λ : Π \to □ (A \times O)^{ω}$ if for all $π \in Π,$ $μ^{π} \in Λ (π) .$ If $μ$ is compatible with $Λ,$ we write $μ ⪯ Λ .$

Let $E := {μ | μ ⪯ Λ} .$ ^[8] Note that $E$ is convex and closed. Let $S := E \times (A \times O)^{*}$ . So, the supra-POMDP we will construct has states that are environments together with a recorded history.

Let $ϵ$ denote the empty history. Define the initial credal set over states as $Θ_{0} := ⊤_{E \times {ϵ}} = Δ (E \times ϵ) .$ This credal set captures Knightian uncertainty with respect to the entire set of environments.

Define the observation mapping $B : E \times (A \times O)^{*} \to O$ to simply return the most recent observation in the history, i.e. $B (μ, h a o) := o .$ Note that $B (μ, ϵ)$ can be chosen arbitrarily.

We now define the transition infrakernel. Towards that end, for a fixed history $h \in (A \times O)^{*}$ and action $a \in A,$ define $f_{μ, h, a} (o) : O \to S$ to be a function that appends observations to the argument of the state that keeps track of the history, i.e. $f_{μ, h, a} (o) = (μ, h a o)$ . Let $(f_{μ, h, a})_{*} : Δ O \to Δ S$ denote the pushforward of $f_{μ, h, a}$ . Then define the transition infrakernel to be $T ((μ, h), a) := (f_{μ, h, a})_{*} μ (h, a) .$ Note that here we use the natural embedding of $Δ S$ in $□ S .$

We now discuss how a crisp supra-POMDP $M = (S, Θ_{0}, A, O, T, B, L, γ)$ determines a crisp causal law using copolicies.

Definition: Copolicy to a supra-POMDP
Let $M = (S, Θ_{0}, A, O, T, B, L, γ)$ be a supra-POMDP. A map $σ : (S \times A)^{*} \to Δ S$ is an $M$ -copolicy if
$σ (ϵ) \in Θ_{0}$ for the empty string $ϵ,$ and
For all non-empty strings $h s a \in (S \times A)^{*}$ , $σ (h s a) \in T (s, a)$ .

Informally speaking, a copolicy is a map from state-action strings to distributions over next states that is consistent with the transition infrakernel. A copolicy $σ$ together with the observation mapping $B : S \to O$ induces an environment $μ_{σ} : (A \times O)^{*} \times A \to Δ O$ . Let

E_{M} := {μ_{σ} | σ is an M -copolicy} .

Define $ψ (M)$ to be the crisp causal law generated by $E_{M}$ . If $E$ generates the law $Λ$ , then $ch (¯ ¯ ¯ E) = E_{φ (Λ)}$ , and thus for all laws $Λ,$ $ψ (φ (Λ)) = Λ .$ $□$

Acknowledgements

I'm grateful to Vanessa Kosoy and Alexander Appel for insightful discussions, and Marcus Ogren and Mateusz Bagiński for their valuable feedback on the initial draft.

^{^}
This toy example is based on When to stop managing or surveying cryptic threatened species (I. Chadès, E. McDonald-Madden, M. A. McCarthy, B. Wintle, M. Linkie, H. P. Possingham).
^{^}
Infra-bandits and supra-bandits are distinguished by whether expectation (of a reward function or a loss function respectively) is defined by an infimum or a supremum.
^{^}
See, e.g. State Representation Learning for Control: An Overview (T. Lesort, N. Díaz-Rodríguez, J. Goudou, D. Filliat) for an introduction to state representation learning, the problem of learning a map from a history of interactions (e.g. actions and observations) to states.
^{^}
Since the states of a supra-MDP are observable, this terminology is consistent with the previous usage.
^{^}
These games can be defined more broadly; cf. Regret Minimization Algorithms for Single-Controller Zero-Sum Stochastic Games (G. Peng, M. Raginsky, R. Willett, and D. Zois).
^{^}
This name references Murphy's Law and the infra-Bayesian decision rule.
^{^}
Whenever relevant (such as below where $S$ is a compact Polish space), we assume that $L$ is continuous.
^{^}
Alternatively, we could define $E$ to be a set of environments that generates $Λ,$ which exists by definition but is not unique.

LESSWRONG
LW