We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter λ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

Background

Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function U∗:A→R by aggressively optimizing a proxyU:A→R, we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function C:A→[0,∞) whose expectation Eζ[C] w.r.t. some reference probability measure ζ∈ΔA is bounded by 1. ζ can be thought of as defining the "domain" within which U is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize E[U] while constraining E[C] by a fixed bound Cmax:

Alternatively, we can choose some parameter η∈(0,∞) and maximize the minimal guaranteed expectation of U−ηC:

ξ∗:∈argmaxξ∈ΔAinfC:A→[0,∞){Eξ[U−ηC]∣∣∣Eζ[C]≤1}

These two formulations are almost equivalent since both amount to finding a strategy that is Pareto efficient w.r.t. to the two objectives E[U] and −E[C]. For ~ξ∗ the tradeoff is governed by the parameter Cmax and for ξ∗ the tradeoff is governed by the parameter η. Indeed, it is easy to see that any ξ∗ is also optimal for the first criterion if we take Cmax=Eξ∗[C], and any ~ξ∗ is also optimal for the latter criterion for an appropriate choice of η (it needs to be a subderivative of the Pareto frontier).

In the following, we will use as our starting point the second formulation, which can be thought of as a zero-sum game in which U−ηC is the utility function of an agent whose strategy set is A, and C is chosen by the adversary. The quantilal strategy ξ∗ is the Nash equilibrium of the game.

This formulation seems natural if we take η:=Eζ[max(U−U∗,0)] (a measure of how "optimistic" is U in the domain ζ) and C:=η−1max(U−U∗,0). In particular, the quantilum value (the value of the game) is a lower bound on the expectation of U∗.

In principle, this formalism can be applied to sequential interaction between an agent and an environment, if we replace A by the set of policies. However, if it is possible to make structural assumptions about U and C, we can do better. Taylor explores one such structural assumption, namely a sequence of independent games in which both U and C are additive across the games. We consider a more general setting, namely that of a finite Markov decision process (MDP).

Notation

Given a set A, the notation A∗ will denote the set of finite strings over alphabet A, i.e.

A∗:=∞⨆n=0An

Aω denotes the space of infinite strings over alphabet A, equipped with the product topology and the corresponding Borel sigma-algebra. Given x∈Aω and n∈N, xn∈A is the n-th symbol of the string x (in our conventions, 0∈N so the string begins from the 0th symbol.) Given h∈A∗ and x∈Aω, the notation h⊏x means that h is a prefix of x.

Given a set A, x∈Aω and n∈N, the notation x:n will indicate the prefix of x of length n. That is, x:n∈An and x:n⊏x.

Given a measurable space X, we denote ΔX the space of probability measures on X.

Given measurable spaces X and Y, the notation K:Xk→Y means that K is a Markov kernel from X to Y. Given x∈X, K(x) is the corresponding probability measure on Y. Given A⊆Y measurable, K(A∣x):=K(x)(A). Given y∈Y, K(y∣x):=K({y}|x). Given J:Yk→Z, JK:Xk→Z is the composition of J and K, and when Y=X, Kn is the n-th composition power.

Results

A finite MDP is defined by a non-empty finite set of states S, a non-empty finite set of actions A, a transition kernel T:S×Ak→S and a reward function R:S→R. To specify the utility function, we also need to fix a time discount function γ∈ΔN. This allows defining U:Sω→R by

U(x):=En∼γ[R(xn)]

We also fix an initial distribution over states ζ0∈ΔS. In "classical" MDP theory, it is sufficient to consider a deterministic initial state s0∈S, since the optimal policy doesn't depend on the initial state anyway. However, quantilization is different since the worst-case cost function depends on the initial conditions.

We now assume that the cost function C:Sω→R (or, the true utility function U∗) has the same form. That is, there is some penalty function P:S→[0,∞) s.t.

C(x)=En∼γ[P(xn)]

Given a policy π:S∗×Sk→A (where the S∗ factor represents the past history the factor S represents the current state), we define Hπ∈ΔSω in the usual way (the distribution over histories resulting from π). Finally, we fix η∈(0,∞). We are now ready to define quantilization in this setting

Definition 1

π∗:S∗×Sk→A is said to be quantilal relatively to reference policy σ:S∗×Sk→A when

(QV cannot be −∞ since taking π=σ yields a lower bound of Ex∼Hσn∼γ[R(xn)]−η.)

In the original quantilization formalism, the quantilal strategy can be described more explicitly, as sampling according to the reference measure ζ from some top fraction of actions ranked by expected utility. Here, we don't have an analogous description, but we can in some sense evaluate the infimum over P.

For any π:S∗×Sk→A, define Zπ∈ΔS by

Zπ(s):=Prx∼Hπn∼γ[xn=s]

For any μ,ν∈ΔS, the notation D∞(μ||ν) signifies the Renyi divergence of order ∞:

D∞(μ||ν):=lnmaxs∈Sμ(s)>0μ(s)ν(s)

In general, D∞(μ||ν)∈[0,∞].

Proposition 1

π∗:S∗×Sk→A is quantilal relatively to reference policy σ:S∗×Sk→A if and only if

π∗∈argmaxπ:S∗×Sk→A(EZπ[R]−ηexpD∞(Zπ||Zσ))

Also, we have

QV=supπ:S∗×Sk→A(EZπ[R]−ηexpD∞(Zπ||Zσ))

If the maximization in Proposition 1 was over arbitrary ξ∈ΔS rather than ξ of the form Zπ, we would get ordinary quantilization and sampling ξ would be equivalent to sampling some top fraction of ζ:=Zσ. However, in general, the image of the Z operator is some closed convex set which is not the entire ΔS.

So far we considered arbitrary (non-stationary) policies. From classical MDP theory, we know that an optimal policy can always be chosen to be Markov:

Definition 2

π:S∗×Sk→A is said to be a Markov policy, when there is some πM:N×Sk→A s.t. π(h,s)=πM(|h|,s).

Note that a priori it might be unclear whether there is even a non-stationary quantilal policy. However, we have

Proposition 2

For any σ:S∗×Sk→A, there exists a Markov policy which is quantilal relatively to σ.

Now assume that our time discount function is geometric, i.e. there exists λ∈[0,1) s.t. γ(n)=(1−λ)λn. Then it is known than an optimal policy can be chosen to be stationary:

Definition 3

π:S∗×Sk→A is said to be a stationary policy, when there is some πT:Sk→A s.t. π(h,s)=πT(s).

Once again, the situation for quantilal policies is analogous:

Proposition 3

If γ is geometric, then for any σ:S∗×Sk→A, there exists a stationary policy which is quantilal relatively to σ.

What is not analogous is that an optimal policy can be chosen to be deterministic whereas, of course, this is not the case for quantilal policies.

It is known that the value of an optimal policy depends on the parameters as a piecewise rational function, and in particular it converges as λ→1 and has a Taylor expansion at λ=1. The quantilum value has the same property.

Proposition 4

QV is a piecewise rational continuous function of λ, η, the matrix elements of T, the values of R, the values of ζ0 and the values of Zσ, with a final number of "pieces".

Corollary 1

Assume that σ is a stationary policy. Then, QV converges as λ→1, holding all other parameters fixed (in the sense that, σ is fixed whereas Zσ changes as a function of λ). It is analytic in λ for some interval [λ0,1] and therefore can be described by a Taylor expansion around λ=1 inside this interval.

Note that for optimal policies, Proposition 4 holds for a simpler reason. Specifically, the optimal policy is piecewise constant (since it's deterministic) and there is a Blackwell policy i.e. a fixed policy which is optimal for any λ sufficiently close to 1. And, it is easy to see that for a constant policy, the value is a rational function of λ. On the other hand, the quantilal policy is not guaranteed to be locally constant anywhere. Nevertheless the quantilum value is still piecewise rational.

Finally, we address the question of the computational complexity of quantilization. We prove the following

Proposition 5

Assume geometric time discount. Assume further that R(s), T(t|s,a), ζ0(s), λ, Zσ(s) and η are rational numbers. Then:

a. There is an algorithm for computing QV which runs in time polynomial in the size of the input R, T, ζ0, λ, Zσ and η. Also, if σ is stationary and σ(t|s,a) are rational, then Zσ(s) are also rational and can be computed from σ, T, ζ0 and λ in polynomial time.

b. Given an additional rational input parameter ϵ∈(0,1), there is an algorithm for computing a stationary policy which is an ϵ-equilibrium in the zero-sum game associated with quantilization, which runs in time polynomial in the size of the input and ln1ϵ.

EDIT: In fact, it is possible to do better and compute an exact quantilal policy in polynomial time.

Future Directions

To tease a little, here are some developments of this work that I'm planning:

Apply quantilization to reinforcement learning, i.e. when we don't know the MDP in advance. In particular, I believe that the principles of quantilization can be used not only to deal with misspecified rewards, but also to deal with traps to some extent (assuming it a priori known that the reference policy has at most a small probability of falling into a trap). This has some philosophical implications on how humans avoid traps.

Unify that formalism with DRL. The role of the reference policy will be played by the advisor (thus the reference policy is not known in advance but is learned online). This means we can drop the sanity condition for the advisor, at the price of (i) having a regret bound defined w.r.t. some kind of quantilum value rather than optimal value (ii) having a term in the regret bound proportional to the (presumably small) rate of falling into traps when following the reference (advisor) policy. It should be possible to further develop that by unifying it with the ideas of catastrophe DRL.

Deal with more general environments, e.g. POMDPs and continuous state spaces.

To show that the inequality can be arbitrarily close to equality, choose s∗∈S s.t. Zπ(s∗)>0 and Zπ(s∗)Zσ(s∗)=expD∞(Zπ||Zσ). If Zσ(s∗)>0, we can define P∗ by

P∗(s):={Zσ(s∗)−1 if s=s∗0 otherwise

Clearly EZσ[P∗]=1 and EZπ[P∗]=expD∞(Zπ||Zσ). We get

Since Zπ(s∗)>0, we can make this expression arbitrarily low and therefore the infimum is −∞ which is what we need since in this case D∞(Zπ||Zσ)=∞.

Proposition 1 now follows immediately from Proposition A.1.

We will use the notation Π:={S∗×Sk→A} (the space of all policies). We also have ΠM:={N×Sk→A} (the space of Markov policies) and ΠT:={Sk→A} (the space of stationary policies). Mildly abusing notation, we will view ΠM as a subspace of Π and ΠT as a subspace of ΠM.

Proposition A.2

For any σ:S∗×Sk→A, there exists some policy which is quantilal relatively to σ.

Proof of Proposition A.2

Π is the product of a countable number of copies of ΔA (indexed by S∗×S). ΔA is naturally a topological space (a simplex), and we can thereby equip Π by the product topology. By Tychonoff's theorem, Π is compact. Moreover, it is easy to see that Z:Π→ΔS is a continuous mapping. Finally, observe that D∞(ξ||Zσ) is lower semicontinuous in ξ (since it is the maximum of a number of continuous functions) and therefore Eξ[R]−ηexpD∞(ξ||Zσ) is upper semicontinuous in ξ. By the extreme value theorem, it follows that a quantilal policy exists.

For any n∈N, we define Zn:Πk→S by

Znπ(s):=Prx∼Hπ[xn=s]

Proof of Proposition 2

By Proposition A.2, there is a quantilal policy π∗:S∗×Sk→A. We define π†:N×Sk→A by

π†(n,s):=Ex∼Hπ∗[π∗(x:n,s)|xn=s]

We now prove by induction that for any n∈N, Znπ∗=Znπ†.

In particular, if ζ=Zσ, the above expression describes QV

Proof of Proposition A.3

Denote Πdet:={S∗×S→A}. As usual in extensive-form games, behavioral strategies are equivalent to mixed strategies and therefore the image of the pushforward operator Z∗:ΔΠdet→ΔS is the same as the image of Z:Π→ΔS. It follows that

QV=supμ∈ΔΠdetinfP:S→[0,∞){EZ∗μ[R−ηP]∣∣∣Eζ[P]≤1}

Πdet is a compact Polish space (in the sense of the product topology) and therefore ΔΠdet is compact in the weak topology. By Sion's minimax theorem

QV=infP:S→[0,∞)maxπ∈Πdet{EZπ[R−ηP]∣∣∣Eζ[P]≤1}

Now consider any X=(V,P)∈D. AX≥B implies (looking at the RS×A component) that

(I−λT)V+η(1−λ)IP≥(1−λ)IR

(I−λT)V≥(1−λ)I(R−ηP)

V(s)−λ∑t∈ST(t|s,a)V(t)≥(1−λ)(R(s)−ηP(s))

V(s)≥(1−λ)(R(s)−ηP(s))+λmaxa∈A∑t∈ST(t|s,a)V(t)

Therefore, there is some R′≥R s.t.

V(s)=(1−λ)(R′(s)−ηP(s))+λmaxa∈A∑t∈ST(t|s,a)V(t)

The above is the Bellman equation for a modified MDP with reward function R′−ηP. Denote Zs the version of the Z operator for the initial state s (instead of ζ0). We get

Consider the characterization of QV in Proposition A.3. By general properties of systems of linear inequalities, there is some J⊆(S×A)⊔S⊔{∙} s.t.

D♯J:={X∈RS⊕RS∣∣AJX=BJ}⊆argmin(V,P)∈DEζ0[V]

Here, the notation AJ means taking the submatrix of A consisting of the rows corresponding to J. Similarly, BJ is the subvector of B consisting of the components corresponding to J.

(To see this, use the fact that the minimum of a linear function on a convex set is always attained on the boundary, and apply induction by dimension.)

Moreover, D♯J has to be a single point XJ∈RS⊕RS. Indeed, if it has more than one point then it contains a straight line L. The projection of L on the second RS has to be a single point P0, otherwise some point in L violates the inequality P≥0 which would contradict L⊆D. Therefore, the projection of L on the first RS is also a straight line L′. As in the proof of Proposition A.3, for any (V,P)∈D, V(s) is an upper bound on the value of state s in the MDP with reward function R−ηP. In particular, if (V,P0)∈D then V(s)≥mint∈S(R(t)−ηP0(t)). However, since L′ is a line, there must be some s∗∈S s.t. V(s∗) can be any real number for some (V,P0)∈L. This is a contradiction.

Denote Q:RS⊕RS→RS the projection operator on the first direct summand. It follows that we can always choose J s.t. |J|=2|S| and we have

XJ=A−1JBJ

QV=Eζ0[QA−1JBJ]

For each J, the expression on the right hand side is a rational function of ζ0 and the matrix elements of A and B which, in turn, are polynomials in the parameters the dependence on which we are trying to establish. Therefore, this expression is a rational function in those parameters (unless detAJ vanishes identically, in which case this J can ignored). So, QV always equals one of a finite set of rational functions (corresponding to difference choices of J). The continuity of QV also easily follows from its characterization as min(V,P)∈DEζ0[V].

Proposition A.4

Assume geometric time discount and stationary σ. Then, Zσ is a rational function of σ, T, ζ0 and λ with rational coefficients.

Proof of Proposition A.4

Define the linear operator Tσ:RS→RS by the matrix

Tσts=Ea∼σ(s)[T(t|s,a)]

We have

Zσ=(1−λ)∞∑n=0λnTσnζ0=(1−λ)(1−λTσ)−1ζ0

Proof of Corollary 1

By Propositions 4 and A.4, QV is a continuous piecewise rational function of λ with a finite number of pieces. Two rational functions can only coincide at a finite number of points (since a polynomial can only have a finite number of roots), therefore there is only a finite number of values of λ in which QV "switches" from one rational function to another. It follows that there is some λ0∈[0,1) s.t. QV is a fixed rational function of λ in the interval [λ0,1).

Moreover, it always holds that

mins∈SR(s)−η≤QV≤maxs∈SR(s)

The first inequality holds since, the guaranteed performance of the quantilal policy is at least as good as the guaranteed performance of σ. The second inequality is a consequence of the requirement that P is non-negative.

It follows that QV cannot have a pole at λ=1 and therefore must converge to a finite value there.

Proof of Proposition 5

The algorithm for QV is obtained from Proposition A.3 using linear programming.

The claim regarding Zσ for stationary σ follows from Proposition A.4.

We now describe the algorithm for computing an ϵ-equilibrium.

For any a∈A, define da:A⊔{⊥}→A by

da(b):={a if b=⊥b otherwise

Consider any β:Sk→A⊔{⊥}. We define Tβ:S×Ak→S by

Tβ(s,a):=Eb∼β(s)[T(s,da(b))]

Let Zβ:Πk→S be the Z-operator for the MDP with transition kernel Tβ, and define

QVβ:=supπ:S∗×Sk→A(EZβπ[R]−ηexpD∞(Zβπ∣∣∣∣Zσ))

It is easy to see that QVβ=QV if an only if there is π:Sk→A quantilal s.t. π(a|s)≥β(a|s). Indeed, the MDP with kernel Tβ is equivalent to the original MDP under the constraint that, when in state s, any action a has to be taken with the minimal probability β(a|s). In particular QVβ≤QV (since we constrained the possible policies but not the possible penalties). So, if π as above exists, then we can use it to construct a stationary policy for the new MDP with guaranteed value QV. Conversely, if the new MDP has a stationary policy with guaranteed value QV then it can be used to construct the necessary π.

Using Proposition A.3, we can compute QVβ for any rational β by linear programming. This allows us to find the desired policy by binary search on β, one (state,action) pair at a time.

We introduce a variant of the concept of a "quantilizer" for the setting of choosing a policy for a finite Markov decision process (MDP), where the generic unknown cost is replaced by an unknown penalty term in the reward function. This is essentially a generalization of quantilization in repeated games with a cost independence assumption. We show that the "quantilal" policy shares some properties with the ordinary optimal policy, namely that (i) it can always be chosen to be Markov (ii) it can be chosen to be stationary when time discount is geometric (iii) the "quantilum" value of an MDP with geometric time discount is a continuous piecewise rational function of the parameters, and it converges when the discount parameter λ approaches 1. Finally, we demonstrate a polynomial-time algorithm for computing the quantilal policy, showing that quantilization is not qualitatively harder than ordinary optimization.

## Background

Quantilization (introduced in Taylor 2015) is a method of dealing with "Extremal Goodhart's Law". According to Extremal Goodhart, when we attempt to optimize a utility function U∗:A→R by aggressively optimizing a

proxyU:A→R, we are likely to land outside of the domain where the proxy is useful. Quantilization addresses this by assuming an unknown cost function C:A→[0,∞) whose expectation Eζ[C] w.r.t. some reference probability measure ζ∈ΔA is bounded by 1. ζ can be thought of as defining the "domain" within which U is well-behaved (for example it can be the probability measure of choices made by humans). We can then seek to maximize E[U] while constraining E[C] by a fixed bound Cmax:~ξ∗:∈argmaxξ∈ΔA{Eξ[U]∣∣∣∀C:A→[0,∞):Eζ[C]≤1⟹Eξ[C]≤Cmax}

Alternatively, we can choose some parameter η∈(0,∞) and maximize the minimal guaranteed expectation of U−ηC:

ξ∗:∈argmaxξ∈ΔAinfC:A→[0,∞){Eξ[U−ηC]∣∣∣Eζ[C]≤1}

These two formulations are almost equivalent since both amount to finding a strategy that is Pareto efficient w.r.t. to the two objectives E[U] and −E[C]. For ~ξ∗ the tradeoff is governed by the parameter Cmax and for ξ∗ the tradeoff is governed by the parameter η. Indeed, it is easy to see that any ξ∗ is also optimal for the first criterion if we take Cmax=Eξ∗[C], and any ~ξ∗ is also optimal for the latter criterion for an appropriate choice of η (it needs to be a subderivative of the Pareto frontier).

In the following, we will use as our starting point the

secondformulation, which can be thought of as a zero-sum game in which U−ηC is the utility function of an agent whose strategy set is A, and C is chosen by the adversary. The quantilal strategy ξ∗ is the Nash equilibrium of the game.This formulation seems natural if we take η:=Eζ[max(U−U∗,0)] (a measure of how "optimistic" is U in the domain ζ) and C:=η−1max(U−U∗,0). In particular, the quantilum value (the value of the game) is a lower bound on the expectation of U∗.

In principle, this formalism can be applied to sequential interaction between an agent and an environment, if we replace A by the set of

policies. However, if it is possible to make structural assumptions about U and C, we can do better. Taylor explores one such structural assumption, namely a sequence of independent games in which both U and C are additive across the games. We consider a more general setting, namely that of a finite Markov decision process (MDP).## Notation

Given a set A, the notation A∗ will denote the set of finite strings over alphabet A, i.e.

A∗:=∞⨆n=0An

Aω denotes the space of infinite strings over alphabet A, equipped with the product topology and the corresponding Borel sigma-algebra. Given x∈Aω and n∈N, xn∈A is the n-th symbol of the string x (in our conventions, 0∈N so the string begins from the 0th symbol.) Given h∈A∗ and x∈Aω, the notation h⊏x means that h is a prefix of x.

Given a set A, x∈Aω and n∈N, the notation x:n will indicate the prefix of x of length n. That is, x:n∈An and x:n⊏x.

Given a measurable space X, we denote ΔX the space of probability measures on X.

Given measurable spaces X and Y, the notation K:Xk→Y means that K is a Markov kernel from X to Y. Given x∈X, K(x) is the corresponding probability measure on Y. Given A⊆Y measurable, K(A∣x):=K(x)(A). Given y∈Y, K(y∣x):=K({y}|x). Given J:Yk→Z, JK:Xk→Z is the composition of J and K, and when Y=X, Kn is the n-th composition power.

## Results

A finite MDP is defined by a non-empty finite set of states S, a non-empty finite set of actions A, a transition kernel T:S×Ak→S and a reward function R:S→R. To specify the utility function, we also need to fix a time discount function γ∈ΔN. This allows defining U:Sω→R by

U(x):=En∼γ[R(xn)]

We also fix an initial distribution over states ζ0∈ΔS. In "classical" MDP theory, it is sufficient to consider a deterministic initial state s0∈S, since the optimal policy doesn't depend on the initial state anyway. However, quantilization is different since the worst-case cost function depends on the initial conditions.

We now assume that the cost function C:Sω→R (or, the true utility function U∗) has the same form. That is, there is some penalty function P:S→[0,∞) s.t.

C(x)=En∼γ[P(xn)]

Given a policy π:S∗×Sk→A (where the S∗ factor represents the past history the factor S represents the current state), we define Hπ∈ΔSω in the usual way (the distribution over histories resulting from π). Finally, we fix η∈(0,∞). We are now ready to define quantilization in this setting

## Definition 1

π∗:S∗×Sk→A is said to be

quantilal relatively to reference policy σ:S∗×Sk→Awhenπ∗:∈argmaxπ:S∗×Sk→AinfP:S→[0,∞){Ex∼Hπn∼γ[R(xn)−ηP(xn)]∣∣ ∣∣Ex∼Hσn∼γ[P(xn)]≤1}

We also define the

quantilum valueQV∈R byQV:=supπ:S∗×Sk→AinfP:S→[0,∞){Ex∼Hπn∼γ[R(xn)−ηP(xn)]∣∣ ∣∣Ex∼Hσn∼γ[P(xn)]≤1}

(QV cannot be −∞ since taking π=σ yields a lower bound of Ex∼Hσn∼γ[R(xn)]−η.)

In the original quantilization formalism, the quantilal strategy can be described more explicitly, as sampling according to the reference measure ζ from some top fraction of actions ranked by expected utility. Here, we don't have an analogous description, but we can in some sense evaluate the infimum over P.

For any π:S∗×Sk→A, define Zπ∈ΔS by

Zπ(s):=Prx∼Hπn∼γ[xn=s]

For any μ,ν∈ΔS, the notation D∞(μ||ν) signifies the Renyi divergence of order ∞:

D∞(μ||ν):=lnmaxs∈Sμ(s)>0μ(s)ν(s)

In general, D∞(μ||ν)∈[0,∞].

## Proposition 1

π∗:S∗×Sk→A is quantilal relatively to reference policy σ:S∗×Sk→A if and only if

π∗∈argmaxπ:S∗×Sk→A(EZπ[R]−ηexpD∞(Zπ||Zσ))

Also, we have

QV=supπ:S∗×Sk→A(EZπ[R]−ηexpD∞(Zπ||Zσ))

If the maximization in Proposition 1 was over arbitrary ξ∈ΔS rather than ξ of the form Zπ, we would get ordinary quantilization and sampling ξ would be equivalent to sampling some top fraction of ζ:=Zσ. However, in general, the image of the Z operator is some closed convex set which is

notthe entire ΔS.So far we considered arbitrary (non-stationary) policies. From classical MDP theory, we know that an

optimalpolicy can always be chosen to be Markov:## Definition 2

π:S∗×Sk→A is said to be a

Markovpolicy, when there is some πM:N×Sk→A s.t. π(h,s)=πM(|h|,s).Note that a priori it might be unclear whether there is even a non-stationary quantilal policy. However, we have

## Proposition 2

For any σ:S∗×Sk→A, there exists a Markov policy which is quantilal relatively to σ.

Now assume that our time discount function is geometric, i.e. there exists λ∈[0,1) s.t. γ(n)=(1−λ)λn. Then it is known than an optimal policy can be chosen to be stationary:

## Definition 3

π:S∗×Sk→A is said to be a

stationarypolicy, when there is some πT:Sk→A s.t. π(h,s)=πT(s).Once again, the situation for quantilal policies is analogous:

## Proposition 3

If γ is geometric, then for any σ:S∗×Sk→A, there exists a stationary policy which is quantilal relatively to σ.

What is

notanalogous is that an optimal policy can be chosen to bedeterministicwhereas, of course, this is not the case for quantilal policies.It is known that the value of an optimal policy depends on the parameters as a piecewise rational function, and in particular it converges as λ→1 and has a Taylor expansion at λ=1. The quantilum value has the same property.

## Proposition 4

QV is a piecewise rational continuous function of λ, η, the matrix elements of T, the values of R, the values of ζ0 and the values of Zσ, with a final number of "pieces".

## Corollary 1

Assume that σ is a stationary policy. Then, QV converges as λ→1, holding all other parameters fixed (in the sense that, σ is fixed whereas Zσ changes as a function of λ). It is analytic in λ for some interval [λ0,1] and therefore can be described by a Taylor expansion around λ=1 inside this interval.

Note that for optimal policies, Proposition 4 holds for a simpler reason. Specifically, the optimal policy is piecewise constant (since it's deterministic) and there is a Blackwell policy i.e. a fixed policy which is optimal for any λ sufficiently close to 1. And, it is easy to see that for a constant policy, the value is a rational function of λ. On the other hand, the quantilal policy is not guaranteed to be locally constant anywhere. Nevertheless the quantilum value is still piecewise rational.

Finally, we address the question of the computational complexity of quantilization. We prove the following

## Proposition 5

Assume geometric time discount. Assume further that R(s), T(t|s,a), ζ0(s), λ, Zσ(s) and η are rational numbers. Then:

a. There is an algorithm for computing QV which runs in time polynomial in the size of the input R, T, ζ0, λ, Zσ and η. Also, if σ is stationary and σ(t|s,a) are rational, then Zσ(s) are also rational and can be computed from σ, T, ζ0 and λ in polynomial time.

b. Given an additional rational input parameter ϵ∈(0,1), there is an algorithm for computing a stationary policy which is an ϵ-equilibrium in the zero-sum game associated with quantilization, which runs in time polynomial in the size of the input and ln1ϵ.

EDIT: In fact, it is possible to do better and compute an

exactquantilal policy in polynomial time.## Future Directions

To tease a little, here are some developments of this work that I'm planning:

Apply quantilization to

reinforcement learning, i.e. when we don't know the MDP in advance. In particular, I believe that the principles of quantilization can be used not only to deal with misspecified rewards, but also to deal with traps to some extent (assuming it a priori known that the reference policy has at most a small probability of falling into a trap). This has some philosophical implications on howhumansavoid traps.Unify that formalism with DRL. The role of the reference policy will be played by the advisor (thus the reference policy is not known in advance but is learned online). This means we can drop the sanity condition for the advisor, at the price of (i) having a regret bound defined w.r.t. some kind of

quantilumvalue rather than optimal value (ii) having a term in the regret bound proportional to the (presumably small) rate of falling into traps when following the reference (advisor) policy. It should be possible to further develop that by unifying it with the ideas of catastrophe DRL.Deal with more general environments, e.g. POMDPs and continuous state spaces.

## Proofs

## Proposition A.1

infP:S→[0,∞){Ex∼Hπn∼γ[R(xn)−ηP(xn)]∣∣ ∣∣Ex∼Hσn∼γ[P(xn)]≤1}=EZπ[R]−ηexpD∞(Zπ||Zσ)

## Proof of Proposition A.1

The definition of Z implies that

Ex∼Hπn∼γ[R(xn)−ηP(xn)]=EZπ[R−ηP]

Also

Ex∼Hσn∼γ[P(xn)]=EZσ[P]

Observe that

EZπ[P]=∑s∈SZπ(s)P(s)≤maxs∈SZπ(s)>0Zπ(s)Zσ(s)⋅∑s∈SZσ(s)P(s)=expD∞(Zπ||Zσ)⋅EZσ[P]

It follows that for any P that satisfies the constraint Ex∼Hσn∼γ[P(xn)]≤1, we have EZσ[P]≤1 and therefore

Ex∼Hπn∼γ[R(xn)−ηP(xn)]=EZπ[R]−ηEZπ[P]≥EZπ[R]−ηexpD∞(Zπ||Zσ)

To show that the inequality can be arbitrarily close to equality, choose s∗∈S s.t. Zπ(s∗)>0 and Zπ(s∗)Zσ(s∗)=expD∞(Zπ||Zσ). If Zσ(s∗)>0, we can define P∗ by

P∗(s):={Zσ(s∗)−1 if s=s∗0 otherwise

Clearly EZσ[P∗]=1 and EZπ[P∗]=expD∞(Zπ||Zσ). We get

Ex∼Hπn∼γ[R(xn)−ηP∗(xn)]=EZπ[R]−ηEZπ[P∗]=EZπ[R]−ηexpD∞(Zπ||Zσ)

In the case Zσ(s∗)>0, we can take any M>0 and define PM by

PM(s):={M if s=s∗0 otherwise

Clearly EZσ[PM]=0≤1 and EZπ[PM]=M⋅Zπ(s∗). We get

Ex∼Hπn∼γ[R(xn)−ηPM(xn)]=EZπ[R]−ηEZπ[PM]=EZπ[R]−M⋅Zπ(s∗)

Since Zπ(s∗)>0, we can make this expression arbitrarily low and therefore the infimum is −∞ which is what we need since in this case D∞(Zπ||Zσ)=∞.

Proposition 1 now follows immediately from Proposition A.1.

We will use the notation Π:={S∗×Sk→A} (the space of all policies). We also have ΠM:={N×Sk→A} (the space of Markov policies) and ΠT:={Sk→A} (the space of stationary policies). Mildly abusing notation, we will view ΠM as a subspace of Π and ΠT as a subspace of ΠM.

## Proposition A.2

For any σ:S∗×Sk→A, there exists

somepolicy which is quantilal relatively to σ.## Proof of Proposition A.2

Π is the product of a countable number of copies of ΔA (indexed by S∗×S). ΔA is naturally a topological space (a simplex), and we can thereby equip Π by the product topology. By Tychonoff's theorem, Π is compact. Moreover, it is easy to see that Z:Π→ΔS is a continuous mapping. Finally, observe that D∞(ξ||Zσ) is lower semicontinuous in ξ (since it is the maximum of a number of continuous functions) and therefore Eξ[R]−ηexpD∞(ξ||Zσ) is upper semicontinuous in ξ. By the extreme value theorem, it follows that a quantilal policy exists.

For any n∈N, we define Zn:Πk→S by

Znπ(s):=Prx∼Hπ[xn=s]

## Proof of Proposition 2

By Proposition A.2, there is a quantilal policy π∗:S∗×Sk→A. We define π†:N×Sk→A by

π†(n,s):=Ex∼Hπ∗[π∗(x:n,s)|xn=s]

We now prove by induction that for any n∈N, Znπ∗=Znπ†.

For n=0, we have Z0π∗=Z0π†=ζ0.

For any n∈N and any π:S∗×Sk→A, we have

Zn+1π(t)=Prx∼Hπ[xn+1=t]=Es∼Znπ[Prx∼Hπ[xn+1=t|xn=s]]=Es∼Znπ⎡⎢⎣Ex∼Hπa∼π(x:n,s)[T(t|s,a)|xn=s]⎤⎥⎦

To complete the induction step, observe that, by definition of π†

Ex∼Hπa∼π∗(x:n,s)[T(t|s,a)|xn=s]=Ea∼π†(n,s)[T(t|s,a)]=Ex∼Hπa∼π†(n,s)[T(t|s,a)|xn=s]

Now, for any π, Zπ=En∼γ[Znπ] and therefore Zπ†=Zπ∗. We conclude that π† is also quantilal.

## Proof of Proposition 3

By Proposition 2, there is π∗:N×Sk→A which is a Markov quantilal policy. We have

Zn+1π∗(t)=Prx∼Hπ∗[xn+1=t]=Es∼Znπ∗[Prx∼Hπ∗[xn+1=t|xn=s]]=Es∼Znπ∗a∼π∗(n,s)[T(t|s,a)]

Taking the expected value of this equation w.r.t. n∼γ we get

En∼γ[Zn+1π∗]=En∼γs∼Znπ∗a∼π∗(n,s)[T(s,a)]

Also, we have

Zπ∗=En∼γ[Znπ∗]=(1−λ)∞∑n=0λnZnπ∗=(1−λ)(ζ0+λ∞∑n=0λnZn+1π∗)=(1−λ)ζ0+λEn∼γ[Zn+1π∗]

It follows that

Zπ∗=(1−λ)ζ0+λEn∼γs∼Znπ∗a∼π∗(n,s)[T(s,a)]

Zπ∗=(1−λ)ζ0+λEs∼Zπ∗⎡⎢ ⎢ ⎢⎣En∼γt∼Znπ∗a∼π∗(n,s)[T(s,a)|t=s]⎤⎥ ⎥ ⎥⎦

Define π†:Sk→A by

π†(s):=En∼γt∼Znπ∗[π∗(n,s)|t=s]

We get

Zπ∗=(1−λ)ζ0+λEs∼Zπ∗[Ea∼π†(s)[T(s,a)]]

Define the linear operator T†:RS→RS by the matrix

T†ts:=Ea∼π†(s)[T(t|s,a)]

Viewing ΔS as a subset of RS, we get

Zπ∗=(1−λ)ζ0+λT†Zπ∗

Zπ∗=(1−λ)(1−λT†)−1ζ0

On the other hand, we have

Zπ†=(1−λ)∞∑n=0λnT†nζ0=(1−λ)(1−λT†)−1ζ0

Therefore, Zπ†=Zπ∗ and π† is also quantilal.

## Proposition A.3

Assume geometric time discount. Consider any ζ∈ΔS. Define the linear operators I:RS→RS×A and T:RS→RS×A by the matrices

Isa,t:=[[t=s]]

Tsa,t:=T(t|s,a)

Define the linear operator A:RS⊕RS→RS×A⊕RS⊕R by

A(V,P):=((I−λT)V+η(1−λ)IP, P, −Eζ[P])

Define B∈RS×A⊕RS⊕R by

B:=((1−λ)IR, 0, −1)

Define D⊆RS⊕RS as follows

D:={X∈RS⊕RS∣∣AX≥B}

Here, vector inequalities are understood to be componentwise.

Then,

QVζ:=supπ:S∗×Sk→A(EZπ[R]−ηexpD∞(Zπ||ζ))=inf(V,P)∈DEζ0[V]

In particular, if ζ=Zσ, the above expression describes QV

## Proof of Proposition A.3

Denote Πdet:={S∗×S→A}. As usual in extensive-form games, behavioral strategies are equivalent to mixed strategies and therefore the image of the pushforward operator Z∗:ΔΠdet→ΔS is the same as the image of Z:Π→ΔS. It follows that

QV=supμ∈ΔΠdetinfP:S→[0,∞){EZ∗μ[R−ηP]∣∣∣Eζ[P]≤1}

Πdet is a compact Polish space (in the sense of the product topology) and therefore ΔΠdet is compact in the weak topology. By Sion's minimax theorem

QV=infP:S→[0,∞)maxπ∈Πdet{EZπ[R−ηP]∣∣∣Eζ[P]≤1}

Now consider any X=(V,P)∈D. AX≥B implies (looking at the RS×A component) that

(I−λT)V+η(1−λ)IP≥(1−λ)IR

(I−λT)V≥(1−λ)I(R−ηP)

V(s)−λ∑t∈ST(t|s,a)V(t)≥(1−λ)(R(s)−ηP(s))

V(s)≥(1−λ)(R(s)−ηP(s))+λmaxa∈A∑t∈ST(t|s,a)V(t)

Therefore, there is some R′≥R s.t.

V(s)=(1−λ)(R′(s)−ηP(s))+λmaxa∈A∑t∈ST(t|s,a)V(t)

The above is the Bellman equation for a modified MDP with reward function R′−ηP. Denote Zs the version of the Z operator for the initial state s (instead of ζ0). We get

V(s)=maxπ∈ΠdetEZsπ[R′−ηP]≥maxπ∈ΠdetEZsπ[R−ηP]

Observing that Es∼ζ0[Zsπ]=Zπ, we get

Eζ0[V]≥Es∼ζ0[maxπ∈ΠdetEZsπ[R−ηP]]≥maxπ∈ΠdetEZπ[R−ηP]

On the other hand, for any P∈RS s.t. P≥0 and EZσ[P]≤1 (these inequalities correspond to the RS⊕R component of AX≥B), there is (V,P)∈D s.t.

Eζ0[V]=maxπ∈ΠdetEZπ[R−ηP]

Namely, this V is the solution of the Bellman equation for the reward function R−ηP. Therefore, we have

maxπ∈ΠdetEZπ[R−ηP]=minV∈RS{Eζ0[V]∣∣∣(V,P)∈D}

Taking the infimum of both sides over P inside the domain {P≥0, Eζ[P]≤1} we get

QVζ=infP:S→[0,∞)maxπ∈Πdet{EZπ[R−ηP]∣∣∣Eζ[P]≤1}=inf(V,P)∈DEζ0[V]

## Proof of Proposition 4

Consider the characterization of QV in Proposition A.3. By general properties of systems of linear inequalities, there is some J⊆(S×A)⊔S⊔{∙} s.t.

D♯J:={X∈RS⊕RS∣∣AJX=BJ}⊆argmin(V,P)∈DEζ0[V]

Here, the notation AJ means taking the submatrix of A consisting of the rows corresponding to J. Similarly, BJ is the subvector of B consisting of the components corresponding to J.

(To see this, use the fact that the minimum of a linear function on a convex set is always attained on the boundary, and apply induction by dimension.)

Moreover, D♯J has to be a single point XJ∈RS⊕RS. Indeed, if it has more than one point then it contains a straight line L. The projection of L on the second RS has to be a single point P0, otherwise some point in L violates the inequality P≥0 which would contradict L⊆D. Therefore, the projection of L on the first RS is also a straight line L′. As in the proof of Proposition A.3, for any (V,P)∈D, V(s) is an upper bound on the value of state s in the MDP with reward function R−ηP. In particular, if (V,P0)∈D then V(s)≥mint∈S(R(t)−ηP0(t)). However, since L′ is a line, there must be some s∗∈S s.t. V(s∗) can be

anyreal number for some (V,P0)∈L. This is a contradiction.Denote Q:RS⊕RS→RS the projection operator on the first direct summand. It follows that we can always choose J s.t. |J|=2|S| and we have

XJ=A−1JBJ

QV=Eζ0[QA−1JBJ]

For each J, the expression on the right hand side is a rational function of ζ0 and the matrix elements of A and B which, in turn, are polynomials in the parameters the dependence on which we are trying to establish. Therefore, this expression is a rational function in those parameters (unless detAJ vanishes identically, in which case this J can ignored). So, QV always equals one of a finite set of rational functions (corresponding to difference choices of J). The continuity of QV also easily follows from its characterization as min(V,P)∈DEζ0[V].

## Proposition A.4

Assume geometric time discount and stationary σ. Then, Zσ is a rational function of σ, T, ζ0 and λ with rational coefficients.

## Proof of Proposition A.4

Define the linear operator Tσ:RS→RS by the matrix

Tσts=Ea∼σ(s)[T(t|s,a)]

We have

Zσ=(1−λ)∞∑n=0λnTσnζ0=(1−λ)(1−λTσ)−1ζ0

## Proof of Corollary 1

By Propositions 4 and A.4, QV is a continuous piecewise rational function of λ with a finite number of pieces. Two rational functions can only coincide at a finite number of points (since a polynomial can only have a finite number of roots), therefore there is only a finite number of values of λ in which QV "switches" from one rational function to another. It follows that there is some λ0∈[0,1) s.t. QV is a fixed rational function of λ in the interval [λ0,1).

Moreover, it always holds that

mins∈SR(s)−η≤QV≤maxs∈SR(s)

The first inequality holds since, the guaranteed performance of the quantilal policy is at least as good as the guaranteed performance of σ. The second inequality is a consequence of the requirement that P is non-negative.

It follows that QV cannot have a pole at λ=1 and therefore must converge to a finite value there.

## Proof of Proposition 5

The algorithm for QV is obtained from Proposition A.3 using linear programming.

The claim regarding Zσ for stationary σ follows from Proposition A.4.

We now describe the algorithm for computing an ϵ-equilibrium.

For any a∈A, define da:A⊔{⊥}→A by

da(b):={a if b=⊥b otherwise

Consider any β:Sk→A⊔{⊥}. We define Tβ:S×Ak→S by

Tβ(s,a):=Eb∼β(s)[T(s,da(b))]

Let Zβ:Πk→S be the Z-operator for the MDP with transition kernel Tβ, and define

QVβ:=supπ:S∗×Sk→A(EZβπ[R]−ηexpD∞(Zβπ∣∣∣∣Zσ))

It is easy to see that QVβ=QV if an only if there is π:Sk→A quantilal s.t. π(a|s)≥β(a|s). Indeed, the MDP with kernel Tβ is equivalent to the original MDP under the constraint that, when in state s, any action a has to be taken with the minimal probability β(a|s). In particular QVβ≤QV (since we constrained the possible policies but not the possible penalties). So, if π as above exists, then we can use it to construct a stationary policy for the new MDP with guaranteed value QV. Conversely, if the new MDP has a stationary policy with guaranteed value QV then it can be used to construct the necessary π.

Using Proposition A.3, we can compute QVβ for any rational β by linear programming. This allows us to find the desired policy by binary search on β, one (state,action) pair at a time.