This post accompanies Crisp Supra-Decision Processes and contains the proof of the following proposition.
Proposition 1 [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]: Let be a crisp supra-MDP with geometric time discount such that and are finite. Then there exists a stationary optimal policy.
Proof: We first recall some notation. Let denote the set of actions, and let denote the set of states. Let denote the set of histories and denote the set of histories of length . For , let denote the set of destinies with prefix
Throughout, we assume that is the sum of the momentary losses at each time-step with geometric time discount More specifically, given we write Then is given by
Fix Recall that can be written as the finite disjoint union
This fact, together with Fubini's theorem, implies that the expected loss can be written as an iterated expectation. More concretely, let Then the (classical) expected loss with respect to can be written as
Equation (1) implies the following claim. Let denote a crisp causal law. Given a policy , let denote the partial policy obtained by restricting to the first time steps. More specifically, let denote the set of histories of length at most Then Furthermore, let denote the continuation of after a history i.e. where denotes the set of histories with prefix Given a policy , let and denote the credal sets induced by and by restricting in the natural way.[1]
Then using the decomposition above, the supra-expectation can be written (in a manner reminiscent of Fubini's Theorem) as
To prove (2), we use (1) and observe that
We now prove the key claim, which states that given any policy, the expected loss can only decrease by switching at some time to the policy that is optimal for the current state.
Claim: Given a policy , let denote the policy obtained by following for the first time steps and then following the policy that is optimal for the state observed at time Then for all
Proof of claim: By Equation (2),
Note that Thus,
Since is optimal for the state observed at time
By monotonicity, we then have
Repeating the same argument as above,
By construction, the partial policy obtained by restricting to the first time steps is equal to i.e. Then by Equation (2),
Therefore,
We now apply the claim to finish the proof of the proposition. For ease of notation, let Let denote a given optimal policy. Define the sequence of policies recursively as follows: and The limit of this sequence is a stationary policy, which is optimal by the claim above.
For ease of notation, we drop the superscripts on that appear in the main post.