Earlier we established that the quantilal policy can be computed in polynomial time to any given approximation (see "Proposition 5"). Now we show that an *exact* quantilal policy can be computed in polynomial time (in particular there is always a rational quantilal policy).

We assume geometric time discount throughout.

# Lemma

Consider . Define the linear operators and by

(Note that this is the *transpose* of the defined in "Proposition A.3" of the previous essay.)

Then, if and only if there is s.t.

# Proof

(This is actually well known, but we spell out the proof to be self-contained.)

Suppose that . We already know that this implies that there is a *stationary* policy s.t. (we abuse notation in the obvious way): see the proofs of "Proposition 2" and "Proposition 3". Define the linear operator by

It follows that

Define by

We have

Also, obviously . We get

Conversely, suppose that is as above. Since , there is s.t. for any , if then

Again, we have

Also, for the same reason as before

By the assumption, the left hand side equals . We conclude

# Theorem

Assuming all parameters are rational like before, there is a polynomial time algorithm that computes a quantilal policy.

# Proof

The algorithm starts by solving the following linear program. The indeterminates are and . The goal is maximizing . The constraints are

Then, the algorithm computes s.t. for any , if then

For s.t. , is arbitrary.

Now we need to explain why this algorithm is correct.

Observe that, the first 3 constraints mean that defined by lies in (by Lemma 1) and, moreover, for s.t. . It remains to show that the linear program amounts to maximizing inside . Indeed, the 4th constraint just means that . The last constraint implies that we are actually maximizing

The latter is indeed , since every corresponds to a pure strategy of the adversary in the corresponding zero-sum game: namely, this strategy is setting the penalty function to

(Strategies that place non-vanishing penalty on states outside of become irrelevant after imposing the 4th constraint. The remaining penalty functions form a simplex with vertices as above.)