Quantilal control for finite MDPs

Vanessa Kosoy

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter $λ$ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

Background

Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function $U^{*} : A \to R$ by aggressively optimizing a proxy $U : A \to R$ , we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function $C : A \to [0, \infty)$ whose expectation $E ζ [C]$ w.r.t. some reference probability measure $ζ \in Δ A$ is bounded by 1. $ζ$ can be thought of as defining the "domain" within which $U$ is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize $E [U]$ while constraining $E [C]$ by a fixed bound $C_{max}$ :

${~ ξ}^{*} :\in a r g m a x ξ \in Δ A {E ξ [U] ∣ ∣ ∣ \forall C : A \to [0, \infty) : E ζ [C] \leq 1 ⟹ E ξ [C] \leq C_{max}}$

Alternatively, we can choose some parameter $η \in (0, \infty)$ and maximize the minimal guaranteed expectation of $U - η C$ :

$ξ^{*} :\in a r g m a x ξ \in Δ A inf C : A \to [0, \infty) {E ξ [U - η C] ∣ ∣ ∣ E ζ [C] \leq 1}$

These two formulations are almost equivalent since both amount to finding a strategy that is Pareto efficient w.r.t. to the two objectives $E [U]$ and $- E [C]$ . For ${~ ξ}^{*}$ the tradeoff is governed by the parameter $C_{max}$ and for $ξ^{*}$ the tradeoff is governed by the parameter $η$ . Indeed, it is easy to see that any $ξ^{*}$ is also optimal for the first criterion if we take $C_{max} = E ξ^{*} [C]$ , and any ${~ ξ}^{*}$ is also optimal for the latter criterion for an appropriate choice of $η$ (it needs to be a subderivative of the Pareto frontier).

In the following, we will use as our starting point the second formulation, which can be thought of as a zero-sum game in which $U - η C$ is the utility function of an agent whose strategy set is $A$ , and $C$ is chosen by the adversary. The quantilal strategy $ξ^{*}$ is the Nash equilibrium of the game.

This formulation seems natural if we take $η := E ζ [max (U - U^{*}, 0)]$ (a measure of how "optimistic" is $U$ in the domain $ζ$ ) and $C := η^{- 1} max (U - U^{*}, 0)$ . In particular, the quantilum value (the value of the game) is a lower bound on the expectation of $U^{*}$ .

In principle, this formalism can be applied to sequential interaction between an agent and an environment, if we replace $A$ by the set of policies. However, if it is possible to make structural assumptions about $U$ and $C$ , we can do better. Taylor explores one such structural assumption, namely a sequence of independent games in which both $U$ and $C$ are additive across the games. We consider a more general setting, namely that of a finite Markov decision process (MDP).

Notation

Given a set $A$ , the notation $A^{*}$ will denote the set of finite strings over alphabet $A$ , i.e.

$A^{*} := \infty ⨆ n = 0 A^{n}$

$A^{ω}$ denotes the space of infinite strings over alphabet $A$ , equipped with the product topology and the corresponding Borel sigma-algebra. Given $x \in A^{ω}$ and $n \in N$ , $x_{n} \in A$ is the $n$ -th symbol of the string $x$ (in our conventions, $0 \in N$ so the string begins from the 0th symbol.) Given $h \in A^{*}$ and $x \in A^{ω}$ , the notation $h ⊏ x$ means that $h$ is a prefix of $x$ .

Given a set $A$ , $x \in A^{ω}$ and $n \in N$ , the notation $x_{: n}$ will indicate the prefix of $x$ of length $n$ . That is, $x_{: n} \in A^{n}$ and $x_{: n} ⊏ x$ .

Given a measurable space $X$ , we denote $Δ X$ the space of probability measures on $X$ .

Given measurable spaces $X$ and $Y$ , the notation $K : X k \to Y$ means that $K$ is a Markov kernel from $X$ to $Y$ . Given $x \in X$ , $K (x)$ is the corresponding probability measure on $Y$ . Given $A \subseteq Y$ measurable, $K (A ∣ x) := K (x) (A)$ . Given $y \in Y$ , $K (y ∣ x) := K ({y} | x)$ . Given $J : Y k \to Z$ , $J K : X k \to Z$ is the composition of $J$ and $K$ , and when $Y = X$ , $K^{n}$ is the $n$ -th composition power.

Results

A finite MDP is defined by a non-empty finite set of states $S$ , a non-empty finite set of actions $A$ , a transition kernel $T : S \times A k \to S$ and a reward function $R : S \to R$ . To specify the utility function, we also need to fix a time discount function $γ \in Δ N$ . This allows defining $U : S^{ω} \to R$ by

$U (x) := E n \sim γ [R (x_{n})]$

We also fix an initial distribution over states $ζ_{0} \in Δ S$ . In "classical" MDP theory, it is sufficient to consider a deterministic initial state $s_{0} \in S$ , since the optimal policy doesn't depend on the initial state anyway. However, quantilization is different since the worst-case cost function depends on the initial conditions.

We now assume that the cost function $C : S^{ω} \to R$ (or, the true utility function $U^{*}$ ) has the same form. That is, there is some penalty function $P : S \to [0, \infty)$ s.t.

$C (x) = E n \sim γ [P (x_{n})]$

Given a policy $π : S^{*} \times S k \to A$ (where the $S^{*}$ factor represents the past history the factor $S$ represents the current state), we define $H π \in Δ S^{ω}$ in the usual way (the distribution over histories resulting from $π$ ). Finally, we fix $η \in (0, \infty)$ . We are now ready to define quantilization in this setting

Definition 1

$π^{*} : S^{*} \times S k \to A$ is said to be quantilal relatively to reference policy $σ : S^{*} \times S k \to A$ when

$π^{*} :\in a r g m a x π : S^{*} \times S k \to A inf P : S \to [0, \infty) {E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P (x_{n})] ∣ ∣ ∣ ∣ E \begin{matrix} x \sim H σ n \sim γ \end{matrix} [P (x_{n})] \leq 1}$

We also define the quantilum value $QV \in R$ by

$QV := sup π : S^{*} \times S k \to A inf P : S \to [0, \infty) {E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P (x_{n})] ∣ ∣ ∣ ∣ E \begin{matrix} x \sim H σ n \sim γ \end{matrix} [P (x_{n})] \leq 1}$

( $QV$ cannot be $- \infty$ since taking $π = σ$ yields a lower bound of $E \begin{matrix} x \sim H σ n \sim γ \end{matrix} [R (x_{n})] - η$ .)

In the original quantilization formalism, the quantilal strategy can be described more explicitly, as sampling according to the reference measure $ζ$ from some top fraction of actions ranked by expected utility. Here, we don't have an analogous description, but we can in some sense evaluate the infimum over $P$ .

For any $π : S^{*} \times S k \to A$ , define $Z π \in Δ S$ by

$Z π (s) := Pr \begin{matrix} x \sim H π n \sim γ \end{matrix} [x_{n} = s]$

For any $μ, ν \in Δ S$ , the notation $D_{\infty} (μ | | ν)$ signifies the Renyi divergence of order $\infty$ :

$D_{\infty} (μ | | ν) := ln max \begin{matrix} s \in S μ (s) > 0 \end{matrix} \frac{μ (s)}{ν (s)}$

In general, $D_{\infty} (μ | | ν) \in [0, \infty]$ .

Proposition 1

$π^{*} : S^{*} \times S k \to A$ is quantilal relatively to reference policy $σ : S^{*} \times S k \to A$ if and only if

$π^{*} \in a r g m a x π : S^{*} \times S k \to A (E Z π [R] - η exp D_{\infty} (Z π | | Z σ))$

Also, we have

$QV = sup π : S^{*} \times S k \to A (E Z π [R] - η exp D_{\infty} (Z π | | Z σ))$

If the maximization in Proposition 1 was over arbitrary $ξ \in Δ S$ rather than $ξ$ of the form $Z π$ , we would get ordinary quantilization and sampling $ξ$ would be equivalent to sampling some top fraction of $ζ := Z σ$ . However, in general, the image of the $Z$ operator is some closed convex set which is not the entire $Δ S$ .

So far we considered arbitrary (non-stationary) policies. From classical MDP theory, we know that an optimal policy can always be chosen to be Markov:

Definition 2

$π : S^{*} \times S k \to A$ is said to be a Markov policy, when there is some $π_{M} : N \times S k \to A$ s.t. $π (h, s) = π_{M} (| h |, s)$ .

Note that a priori it might be unclear whether there is even a non-stationary quantilal policy. However, we have

Proposition 2

For any $σ : S^{*} \times S k \to A$ , there exists a Markov policy which is quantilal relatively to $σ$ .

Now assume that our time discount function is geometric, i.e. there exists $λ \in [0, 1)$ s.t. $γ (n) = (1 - λ) λ^{n}$ . Then it is known than an optimal policy can be chosen to be stationary:

Definition 3

$π : S^{*} \times S k \to A$ is said to be a stationary policy, when there is some $π_{T} : S k \to A$ s.t. $π (h, s) = π_{T} (s)$ .

Once again, the situation for quantilal policies is analogous:

Proposition 3

If $γ$ is geometric, then for any $σ : S^{*} \times S k \to A$ , there exists a stationary policy which is quantilal relatively to $σ$ .

What is not analogous is that an optimal policy can be chosen to be deterministic whereas, of course, this is not the case for quantilal policies.

It is known that the value of an optimal policy depends on the parameters as a piecewise rational function, and in particular it converges as $λ \to 1$ and has a Taylor expansion at $λ = 1$ . The quantilum value has the same property.

Proposition 4

$QV$ is a piecewise rational continuous function of $λ$ , $η$ , the matrix elements of $T$ , the values of $R$ , the values of $ζ_{0}$ and the values of $Z σ$ , with a final number of "pieces".

Corollary 1

Assume that $σ$ is a stationary policy. Then, $QV$ converges as $λ \to 1$ , holding all other parameters fixed (in the sense that, $σ$ is fixed whereas $Z σ$ changes as a function of $λ$ ). It is analytic in $λ$ for some interval $[λ_{0}, 1]$ and therefore can be described by a Taylor expansion around $λ = 1$ inside this interval.

Note that for optimal policies, Proposition 4 holds for a simpler reason. Specifically, the optimal policy is piecewise constant (since it's deterministic) and there is a Blackwell policy i.e. a fixed policy which is optimal for any $λ$ sufficiently close to 1. And, it is easy to see that for a constant policy, the value is a rational function of $λ$ . On the other hand, the quantilal policy is not guaranteed to be locally constant anywhere. Nevertheless the quantilum value is still piecewise rational.

Finally, we address the question of the computational complexity of quantilization. We prove the following

Proposition 5

Assume geometric time discount. Assume further that $R (s)$ , $T (t | s, a)$ , $ζ_{0} (s)$ , $λ$ , $Z σ (s)$ and $η$ are rational numbers. Then:

a. There is an algorithm for computing $QV$ which runs in time polynomial in the size of the input $R$ , $T$ , $ζ_{0}$ , $λ$ , $Z σ$ and $η$ . Also, if $σ$ is stationary and $σ (t | s, a)$ are rational, then $Z σ (s)$ are also rational and can be computed from $σ$ , $T$ , $ζ_{0}$ and $λ$ in polynomial time.

b. Given an additional rational input parameter $ϵ \in (0, 1)$ , there is an algorithm for computing a stationary policy which is an $ϵ$ -equilibrium in the zero-sum game associated with quantilization, which runs in time polynomial in the size of the input and $ln \frac{1}{ϵ}$ .

EDIT: In fact, it is possible to do better and compute an exact quantilal policy in polynomial time.

Future Directions

To tease a little, here are some developments of this work that I'm planning:

Apply quantilization to reinforcement learning, i.e. when we don't know the MDP in advance. In particular, I believe that the principles of quantilization can be used not only to deal with misspecified rewards, but also to deal with traps to some extent (assuming it a priori known that the reference policy has at most a small probability of falling into a trap). This has some philosophical implications on how humans avoid traps.
Unify that formalism with DRL. The role of the reference policy will be played by the advisor (thus the reference policy is not known in advance but is learned online). This means we can drop the sanity condition for the advisor, at the price of (i) having a regret bound defined w.r.t. some kind of quantilum value rather than optimal value (ii) having a term in the regret bound proportional to the (presumably small) rate of falling into traps when following the reference (advisor) policy. It should be possible to further develop that by unifying it with the ideas of catastrophe DRL.
Deal with more general environments, e.g. POMDPs and continuous state spaces.

Proofs

Proposition A.1

$inf P : S \to [0, \infty) {E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P (x_{n})] ∣ ∣ ∣ ∣ E \begin{matrix} x \sim H σ n \sim γ \end{matrix} [P (x_{n})] \leq 1} = E Z π [R] - η exp D_{\infty} (Z π | | Z σ)$

Proof of Proposition A.1

The definition of $Z$ implies that

$E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P (x_{n})] = E Z π [R - η P]$

Also

$E \begin{matrix} x \sim H σ n \sim γ \end{matrix} [P (x_{n})] = E Z σ [P]$

Observe that

$E Z π [P] = \sum s \in S Z π (s) P (s) \leq max \begin{matrix} s \in S Z π (s) > 0 \end{matrix} \frac{Z π (s)}{Z σ (s)} \cdot \sum s \in S Z σ (s) P (s) = exp D_{\infty} (Z π | | Z σ) \cdot E Z σ [P]$

It follows that for any $P$ that satisfies the constraint $E \begin{matrix} x \sim H σ n \sim γ \end{matrix} [P (x_{n})] \leq 1$ , we have $E Z σ [P] \leq 1$ and therefore

$E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P (x_{n})] = E Z π [R] - η E Z π [P] \geq E Z π [R] - η exp D_{\infty} (Z π | | Z σ)$

To show that the inequality can be arbitrarily close to equality, choose $s^{*} \in S$ s.t. $Z π (s^{*}) > 0$ and $\frac{Z π (s^{*})}{Z σ (s^{*})} = exp D_{\infty} (Z π | | Z σ)$ . If $Z σ (s^{*}) > 0$ , we can define $P^{*}$ by

$P^{*} (s) := {\begin{matrix} Z σ (s^{*})^{- 1} if s = s^{*} 0 otherwise \end{matrix}$

Clearly $E Z σ [P^{*}] = 1$ and $E Z π [P^{*}] = exp D_{\infty} (Z π | | Z σ)$ . We get

$E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P^{*} (x_{n})] = E Z π [R] - η E Z π [P^{*}] = E Z π [R] - η exp D_{\infty} (Z π | | Z σ)$

In the case $Z σ (s^{*}) > 0$ , we can take any $M > 0$ and define $P^{M}$ by

$P^{M} (s) := {\begin{matrix} M if s = s^{*} 0 otherwise \end{matrix}$

Clearly $E Z σ [P^{M}] = 0 \leq 1$ and $E Z π [P^{M}] = M \cdot Z π (s^{*})$ . We get

$E \begin{matrix} x \sim H π n \sim γ \end{matrix} [R (x_{n}) - η P^{M} (x_{n})] = E Z π [R] - η E Z π [P^{M}] = E Z π [R] - M \cdot Z π (s^{*})$

Since $Z π (s^{*}) > 0$ , we can make this expression arbitrarily low and therefore the infimum is $- \infty$ which is what we need since in this case $D_{\infty} (Z π | | Z σ) = \infty$ .

Proposition 1 now follows immediately from Proposition A.1.

We will use the notation $Π := {S^{*} \times S k \to A}$ (the space of all policies). We also have $Π_{M} := {N \times S k \to A}$ (the space of Markov policies) and $Π_{T} := {S k \to A}$ (the space of stationary policies). Mildly abusing notation, we will view $Π_{M}$ as a subspace of $Π$ and $Π_{T}$ as a subspace of $Π_{M}$ .

Proposition A.2

For any $σ : S^{*} \times S k \to A$ , there exists some policy which is quantilal relatively to $σ$ .

Proof of Proposition A.2

$Π$ is the product of a countable number of copies of $Δ A$ (indexed by $S^{*} \times S$ ). $Δ A$ is naturally a topological space (a simplex), and we can thereby equip $Π$ by the product topology. By Tychonoff's theorem, $Π$ is compact. Moreover, it is easy to see that $Z : Π \to Δ S$ is a continuous mapping. Finally, observe that $D_{\infty} (ξ | | Z σ)$ is lower semicontinuous in $ξ$ (since it is the maximum of a number of continuous functions) and therefore $E ξ [R] - η exp D_{\infty} (ξ | | Z σ)$ is upper semicontinuous in $ξ$ . By the extreme value theorem, it follows that a quantilal policy exists.

For any $n \in N$ , we define $Z_{n} : Π k \to S$ by

$Z_{n} π (s) := Pr x \sim H π [x_{n} = s]$

Proof of Proposition 2

By Proposition A.2, there is a quantilal policy $π^{*} : S^{*} \times S k \to A$ . We define $π^{†} : N \times S k \to A$ by

$π^{†} (n, s) := E x \sim H π^{*} [π^{*} (x_{: n}, s) | x_{n} = s]$

We now prove by induction that for any $n \in N$ , $Z_{n} π^{*} = Z_{n} π^{†}$ .

For $n = 0$ , we have $Z_{0} π^{*} = Z_{0} π^{†} = ζ_{0}$ .

For any $n \in N$ and any $π : S^{*} \times S k \to A$ , we have

$Z_{n + 1} π (t) = Pr x \sim H π [x_{n + 1} = t] = E s \sim Z_{n} π [Pr x \sim H π [x_{n + 1} = t | x_{n} = s]] = E s \sim Z_{n} π ⎡ ⎢ ⎣ E \begin{matrix} x \sim H π a \sim π (x_{: n}, s) \end{matrix} [T (t | s, a) | x_{n} = s] ⎤ ⎥ ⎦$

To complete the induction step, observe that, by definition of $π^{†}$

$E \begin{matrix} x \sim H π a \sim π^{*} (x_{: n}, s) \end{matrix} [T (t | s, a) | x_{n} = s] = E a \sim π^{†} (n, s) [T (t | s, a)] = E \begin{matrix} x \sim H π a \sim π^{†} (n, s) \end{matrix} [T (t | s, a) | x_{n} = s]$

Now, for any $π$ , $Z π = E n \sim γ [Z_{n} π]$ and therefore $Z π^{†} = Z π^{*}$ . We conclude that $π^{†}$ is also quantilal.

Proof of Proposition 3

By Proposition 2, there is $π^{*} : N \times S k \to A$ which is a Markov quantilal policy. We have

$Z_{n + 1} π^{*} (t) = Pr x \sim H π^{*} [x_{n + 1} = t] = E s \sim Z_{n} π^{*} [Pr x \sim H π^{*} [x_{n + 1} = t | x_{n} = s]] = E \begin{matrix} s \sim Z_{n} π^{*} a \sim π^{*} (n, s) \end{matrix} [T (t | s, a)]$

Taking the expected value of this equation w.r.t. $n \sim γ$ we get

$E n \sim γ [Z_{n + 1} π^{*}] = E \begin{matrix} n \sim γ s \sim Z_{n} π^{*} a \sim π^{*} (n, s) \end{matrix} [T (s, a)]$

Also, we have

$Z π^{*} = E n \sim γ [Z_{n} π^{*}] = (1 - λ) \infty \sum n = 0 λ^{n} Z_{n} π^{*} = (1 - λ) (ζ_{0} + λ \infty \sum n = 0 λ^{n} Z_{n + 1} π^{*}) = (1 - λ) ζ_{0} + λ E n \sim γ [Z_{n + 1} π^{*}]$

It follows that

$Z π^{*} = (1 - λ) ζ_{0} + λ E \begin{matrix} n \sim γ s \sim Z_{n} π^{*} a \sim π^{*} (n, s) \end{matrix} [T (s, a)]$

$Z π^{*} = (1 - λ) ζ_{0} + λ E s \sim Z π^{*} ⎡ ⎢ ⎢ ⎢ ⎣ E \begin{matrix} n \sim γ t \sim Z_{n} π^{*} a \sim π^{*} (n, s) \end{matrix} [T (s, a) | t = s] ⎤ ⎥ ⎥ ⎥ ⎦$

Define $π^{†} : S k \to A$ by

$π^{†} (s) := E \begin{matrix} n \sim γ t \sim Z_{n} π^{*} \end{matrix} [π^{*} (n, s) | t = s]$

We get

$Z π^{*} = (1 - λ) ζ_{0} + λ E s \sim Z π^{*} [E a \sim π^{†} (s) [T (s, a)]]$

Define the linear operator $T^{†} : R^{S} \to R^{S}$ by the matrix

$T_{t s}^{†} := E a \sim π^{†} (s) [T (t | s, a)]$

Viewing $Δ S$ as a subset of $R^{S}$ , we get

$Z π^{*} = (1 - λ) ζ_{0} + λ T^{†} Z π^{*}$

$Z π^{*} = (1 - λ) {(1 - λ T^{†})}^{- 1} ζ_{0}$

On the other hand, we have

$Z π^{†} = (1 - λ) \infty \sum n = 0 λ^{n} T^{† n} ζ_{0} = (1 - λ) {(1 - λ T^{†})}^{- 1} ζ_{0}$

Therefore, $Z π^{†} = Z π^{*}$ and $π^{†}$ is also quantilal.

Proposition A.3

Assume geometric time discount. Consider any $ζ \in Δ S$ . Define the linear operators $I : R^{S} \to R^{S \times A}$ and $T : R^{S} \to R^{S \times A}$ by the matrices

$I_{s a, t} := [[t = s]]$

$T_{s a, t} := T (t | s, a)$

Define the linear operator $A : R^{S} \oplus R^{S} \to R^{S \times A} \oplus R^{S} \oplus R$ by

$A (V, P) := ((I - λ T) V + η (1 - λ) I P, P, - E ζ [P])$

Define $B \in R^{S \times A} \oplus R^{S} \oplus R$ by

$B := ((1 - λ) I R, 0, - 1)$

Define $D \subseteq R^{S} \oplus R^{S}$ as follows

$D := {X \in R^{S} \oplus R^{S} ∣ ∣ A X \geq B}$

Here, vector inequalities are understood to be componentwise.

Then,

${QV}^{ζ} := sup π : S^{*} \times S k \to A (E Z π [R] - η exp D_{\infty} (Z π | | ζ)) = inf (V, P) \in D E ζ_{0} [V]$

In particular, if $ζ = Z σ$ , the above expression describes $QV$

Proof of Proposition A.3

Denote $Π_{det} := {S^{*} \times S \to A}$ . As usual in extensive-form games, behavioral strategies are equivalent to mixed strategies and therefore the image of the pushforward operator $Z_{*} : Δ Π_{det} \to Δ S$ is the same as the image of $Z : Π \to Δ S$ . It follows that

$QV = sup μ \in Δ Π_{det} inf P : S \to [0, \infty) {E Z_{*} μ [R - η P] ∣ ∣ ∣ E ζ [P] \leq 1}$

$Π_{det}$ is a compact Polish space (in the sense of the product topology) and therefore $Δ Π_{det}$ is compact in the weak topology. By Sion's minimax theorem

$QV = inf P : S \to [0, \infty) max π \in Π_{det} {E Z π [R - η P] ∣ ∣ ∣ E ζ [P] \leq 1}$

Now consider any $X = (V, P) \in D$ . $A X \geq B$ implies (looking at the $R^{S \times A}$ component) that

$(I - λ T) V + η (1 - λ) I P \geq (1 - λ) I R$

$(I - λ T) V \geq (1 - λ) I (R - η P)$

$V (s) - λ \sum t \in S T (t | s, a) V (t) \geq (1 - λ) (R (s) - η P (s))$

$V (s) \geq (1 - λ) (R (s) - η P (s)) + λ max a \in A \sum t \in S T (t | s, a) V (t)$

Therefore, there is some $R^{'} \geq R$ s.t.

$V (s) = (1 - λ) (R^{'} (s) - η P (s)) + λ max a \in A \sum t \in S T (t | s, a) V (t)$

The above is the Bellman equation for a modified MDP with reward function $R^{'} - η P$ . Denote $Z_{s}$ the version of the $Z$ operator for the initial state $s$ (instead of $ζ_{0}$ ). We get

$V (s) = max π \in Π_{det} E Z_{s} π [R^{'} - η P] \geq max π \in Π_{det} E Z_{s} π [R - η P]$

Observing that $E s \sim ζ_{0} [Z_{s} π] = Z π$ , we get

$E ζ_{0} [V] \geq E s \sim ζ_{0} [max π \in Π_{det} E Z_{s} π [R - η P]] \geq max π \in Π_{det} E Z π [R - η P]$

On the other hand, for any $P \in R^{S}$ s.t. $P \geq 0$ and $E Z σ [P] \leq 1$ (these inequalities correspond to the $R^{S} \oplus R$ component of $A X \geq B$ ), there is $(V, P) \in D$ s.t.

$E ζ_{0} [V] = max π \in Π_{det} E Z π [R - η P]$

Namely, this $V$ is the solution of the Bellman equation for the reward function $R - η P$ . Therefore, we have

$max π \in Π_{det} E Z π [R - η P] = min V \in R^{S} {E ζ_{0} [V] ∣ ∣ ∣ (V, P) \in D}$

Taking the infimum of both sides over $P$ inside the domain ${P \geq 0, E ζ [P] \leq 1}$ we get

${QV}^{ζ} = inf P : S \to [0, \infty) max π \in Π_{det} {E Z π [R - η P] ∣ ∣ ∣ E ζ [P] \leq 1} = inf (V, P) \in D E ζ_{0} [V]$

Proof of Proposition 4

Consider the characterization of $QV$ in Proposition A.3. By general properties of systems of linear inequalities, there is some $J \subseteq (S \times A) ⊔ S ⊔ {∙}$ s.t.

$D_{J}^{♯} := {X \in R^{S} \oplus R^{S} ∣ ∣ A_{J} X = B_{J}} \subseteq a r g m i n (V, P) \in D E ζ_{0} [V]$

Here, the notation $A_{J}$ means taking the submatrix of $A$ consisting of the rows corresponding to $J$ . Similarly, $B_{J}$ is the subvector of $B$ consisting of the components corresponding to $J$ .

(To see this, use the fact that the minimum of a linear function on a convex set is always attained on the boundary, and apply induction by dimension.)

Moreover, $D_{J}^{♯}$ has to be a single point $X_{J} \in R^{S} \oplus R^{S}$ . Indeed, if it has more than one point then it contains a straight line $L$ . The projection of $L$ on the second $R^{S}$ has to be a single point $P_{0}$ , otherwise some point in $L$ violates the inequality $P \geq 0$ which would contradict $L \subseteq D$ . Therefore, the projection of $L$ on the first $R^{S}$ is also a straight line $L^{'}$ . As in the proof of Proposition A.3, for any $(V, P) \in D$ , $V (s)$ is an upper bound on the value of state $s$ in the MDP with reward function $R - η P$ . In particular, if $(V, P_{0}) \in D$ then $V (s) \geq {min}_{t \in S} (R (t) - η P_{0} (t))$ . However, since $L^{'}$ is a line, there must be some $s^{*} \in S$ s.t. $V (s^{*})$ can be any real number for some $(V, P_{0}) \in L$ . This is a contradiction.

Denote $Q : R^{S} \oplus R^{S} \to R^{S}$ the projection operator on the first direct summand. It follows that we can always choose $J$ s.t. $| J | = 2 | S |$ and we have

$X_{J} = A_{J}^{- 1} B_{J}$

$QV = E ζ_{0} [Q A_{J}^{- 1} B_{J}]$

For each $J$ , the expression on the right hand side is a rational function of $ζ_{0}$ and the matrix elements of $A$ and $B$ which, in turn, are polynomials in the parameters the dependence on which we are trying to establish. Therefore, this expression is a rational function in those parameters (unless $det A_{J}$ vanishes identically, in which case this $J$ can ignored). So, $QV$ always equals one of a finite set of rational functions (corresponding to difference choices of $J$ ). The continuity of $QV$ also easily follows from its characterization as ${min}_{(V, P) \in D} E ζ_{0} [V]$ .

Proposition A.4

Assume geometric time discount and stationary $σ$ . Then, $Z σ$ is a rational function of $σ$ , $T$ , $ζ_{0}$ and $λ$ with rational coefficients.

Proof of Proposition A.4

Define the linear operator $T^{σ} : R^{S} \to R^{S}$ by the matrix

$T_{t s}^{σ} = E a \sim σ (s) [T (t | s, a)]$

We have

$Z σ = (1 - λ) \infty \sum n = 0 λ^{n} T^{σ n} ζ_{0} = (1 - λ) {(1 - λ T^{σ})}^{- 1} ζ_{0}$

Proof of Corollary 1

By Propositions 4 and A.4, $QV$ is a continuous piecewise rational function of $λ$ with a finite number of pieces. Two rational functions can only coincide at a finite number of points (since a polynomial can only have a finite number of roots), therefore there is only a finite number of values of $λ$ in which $QV$ "switches" from one rational function to another. It follows that there is some $λ_{0} \in [0, 1)$ s.t. $QV$ is a fixed rational function of $λ$ in the interval $[λ_{0}, 1)$ .

Moreover, it always holds that

$min s \in S R (s) - η \leq QV \leq max s \in S R (s)$

The first inequality holds since, the guaranteed performance of the quantilal policy is at least as good as the guaranteed performance of $σ$ . The second inequality is a consequence of the requirement that $P$ is non-negative.

It follows that $QV$ cannot have a pole at $λ = 1$ and therefore must converge to a finite value there.

Proof of Proposition 5

The algorithm for $QV$ is obtained from Proposition A.3 using linear programming.

The claim regarding $Z σ$ for stationary $σ$ follows from Proposition A.4.

We now describe the algorithm for computing an $ϵ$ -equilibrium.

For any $a \in A$ , define $d_{a} : A ⊔ {⊥} \to A$ by

$d_{a} (b) := {\begin{matrix} a if b = ⊥ b otherwise \end{matrix}$

Consider any $β : S k \to A ⊔ {⊥}$ . We define $T_{β} : S \times A k \to S$ by

$T_{β} (s, a) := E b \sim β (s) [T (s, d_{a} (b))]$

Let $Z_{β} : Π k \to S$ be the $Z$ -operator for the MDP with transition kernel $T_{β}$ , and define

${QV}_{β} := sup π : S^{*} \times S k \to A (E Z_{β} π [R] - η exp D_{\infty} (Z_{β} π ∣ ∣ ∣ ∣ Z σ))$

It is easy to see that ${QV}_{β} = QV$ if an only if there is $π : S k \to A$ quantilal s.t. $π (a | s) \geq β (a | s)$ . Indeed, the MDP with kernel $T_{β}$ is equivalent to the original MDP under the constraint that, when in state $s$ , any action $a$ has to be taken with the minimal probability $β (a | s)$ . In particular ${QV}_{β} \leq QV$ (since we constrained the possible policies but not the possible penalties). So, if $π$ as above exists, then we can use it to construct a stationary policy for the new MDP with guaranteed value $QV$ . Conversely, if the new MDP has a stationary policy with guaranteed value $QV$ then it can be used to construct the necessary $π$ .

Using Proposition A.3, we can compute ${QV}_{β}$ for any rational $β$ by linear programming. This allows us to find the desired policy by binary search on $β$ , one (state,action) pair at a time.

LESSWRONG
LW