Proof Section to Formalizing Newcombian Problems with Fuzzy Infra-Bayesianism

Brittany Gelb

This proof section accompanies Formalizing Newcombian problems with fuzzy infra-Bayesianism. We prove the following result.

Theorem [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]:
Let be a Newcombian problem of horizon $H \in N$ that satisfies pseudocausality. Let $M_{ν} = (S, Θ_{0}, A, O, T, B)$ denote the associated supra-POMDP with infinite time horizon and time discount $γ \in [0, 1) .$ Then
$lim γ \to 1 | min π \in Π E_{M_{ν}^{π}} [L^{γ}] - min π \in Π E_{ν^{π}} [L^{γ}] | = 0.$
Furthermore, if ${π_{γ}}_{γ \in [0, 1)}$ is a family of policies such that ${lim}_{γ \to 1} E_{M_{ν}^{π_{γ}}} [L^{γ}] - {min}_{π \in Π} E_{M_{ν}^{π}} [L^{γ}] = 0,$ then
$lim γ \to 1 E_{ν^{π_{γ}}} [L^{γ}] - min π \in Π E_{ν^{π}} [L^{γ}] = 0.$

Proof: Let $λ$ denote the empty history. Given a supracontribution $Θ$ , let $max (Θ)$ denote the set of maximal extreme points of $Θ .$ First we remark that for any supra-POMDP, without loss of generality, a set of copolicies $Ξ$ can always be replaced by

{τ \in Ξ | τ (λ) \in max (Θ_{0}), \forall h \neq λ \in (S \times A)^{*}, τ (h) \in max (T (s, a))} .

Given an episode policy $π \in Π_{H},$ let $τ_{π}$ denote the episode copolicy that initializes the state to $(π, λ),$ i.e. $τ_{π} (λ) = δ_{π \times λ} .$ Let $τ_{π}^{π} \in Δ (A \times O)^{H}$ denote the distribution over outcomes determined by the interaction of $π$ and $τ_{π} .$ Note that the expected loss with respect to $τ_{π}^{π}$ is equal to the expected loss for the Newcombian problem, i.e.

E_{τ_{π}^{π}} [L] = E_{ν^{π}} [L] .

Recall that throughout this sequence, we assume that $O$ is finite. By the remark at the beginning of the proof, the expected loss in one episode for the corresponding supra-POMDP can be written as a maximum expected loss over a finite set of $M_{ν}$ -copolicies $τ .$ Namely,

E_{M_{ν}^{π}} [L] = max τ E_{τ^{π}} [L] .

Then

max τ E_{τ^{π}} [L] \geq E_{τ_{π}^{π}} [L] = E_{ν^{π}} [L],

and thus for any episode policy $π \in Π_{H},$

E_{M_{ν}^{π}} [L] \geq E_{ν^{π}} [L] .

We now extend this analysis to the optimal loss over $n$ episodes for $n \in N .$ ^[1] Let $L^{*} := {min}_{π \in Π_{H}} E_{ν^{π}} [L]$ denote the episode optimal loss for $ν .$ Let $π_{n} \in Π$ be an arbitrary policy for $n$ episodes of $M_{ν} .$ Then as before,

E_{M_{ν}^{π_{n}}} [L_{n}^{γ}] = max τ_{n} E_{{τ_{n}}^{π_{n}}} [L_{n}^{γ}]

where the maximum is over a finite set of $n$ -episode copolicies $τ_{n} .$ By the single episode case,

max τ_{n} E_{{τ_{n}}^{π_{n}}} [L_{n}^{γ}] \geq (1 - γ) n - 1 \sum k = 0 γ^{k} L^{*} = L^{*},

and thus ${min}_{π_{n} \in Π} E_{M_{ν}^{π_{n}}} [L_{n}^{γ}] \geq L^{*} .$

It remains to show that the opposite inequality holds in the many-episode and $γ \to 1$ limit.

Recall that given $π \in Π_{H},$ we define

C_{π} := {h \in (A \times O)^{H} | \forall prefixes h^{'} a \in (A \times O)^{*} \times A of h, a = π (h^{'})} .

Recall that since $ν$ satisfies pseudocausality, there exists a $ν$ -optimal policy $π^{*} \in Π_{H}$ such that for all $π \in Π_{H},$ if $supp (ν^{π}) \subseteq C_{π^{*}},$ then $π$ is also optimal for $ν .$ Consequently, for any episode copolicy $τ,$ either $E_{τ^{π^{*}}} [L] \leq L^{*}$ or $E_{τ^{π^{*}}} [1] < 1.$ To see this, suppose there exists an episode copolicy $τ$ such that $E_{τ^{π^{*}}} [1] = 1.$ Then there exists a policy $π^{†}$ such that $supp (ν^{π^{†}}) \subseteq C_{π^{*}}$ and $E_{τ^{π^{*}}} [L] = E_{ν^{π^{†}}} [L]$ . By pseudocausality, $E_{ν^{π^{†}}} [L] = L^{*} .$ Thus $E_{τ^{π^{*}}} [L] \leq L^{*} .$

Define

α := max τ : E_{τ^{π^{*}}} [L] > L^{*} E_{τ^{π^{*}}} [1] .

By the remark at the beginning of the proof, the relevant set of copolicies in the definition of $α$ is finite, and thus $α$ is well-defined. If $E_{τ^{π^{*}}} [L] > L^{*}$ then $E_{τ^{π^{*}}} [1] < 1.$ Thus $α \in [0, 1) .$

Consider the iterated Newcombian problem over $n$ episodes. Let $π_{n}^{*}$ denote the multi-episode policy such that $π_{n}^{*}$ restricted to every episode is $π^{*} .$ Let $τ_{n}$ denote an arbitrary copolicy that interacts with $π_{n}^{*} .$ Furthermore, let $m \leq n$ denote the number of episodes for which the episode-restriction $τ$ of $τ_{n}$ interacting with $π_{n}^{*}$ satisfies $E_{τ^{π^{*}}} [L] > L^{*} .$ ^[2]

We have

E_{τ_{n}^{π_{n}^{*}}} [L_{n}^{γ}] \leq E_{τ_{n}^{π_{n}^{*}}} [1] \leq α^{m} .

Furthermore,

\begin{matrix} E_{τ_{n}^{π_{n}^{*}}} [L_{n}^{γ}] & \leq (1 - γ) (m - 1 \sum t = 0 γ^{t} \cdot 1 + n - 1 \sum t = m γ^{t} \cdot L^{*}) = (1 - γ) (\frac{1 - γ^{m}}{1 - γ} + L^{*} γ^{m} \cdot \frac{1 - γ^{n - m}}{1 - γ}) = 1 - γ^{m} + L^{*} (γ^{m} - γ^{n}) . \end{matrix}

We leave it to the reader to verify that

lim γ \to 1 lim n \to \infty E_{τ_{n}^{π_{n}^{*}}} [L_{n}^{γ}] \leq lim γ \to 1 lim n \to \infty max m \leq n min (α^{m}, 1 - γ^{m} + L^{*} (γ^{m} - γ^{n})) = L^{*} . □

^{^}
Recall that if $n \in N$ and $h \in (A \times O)^{n H}$ is given by $h = a_{0} o_{0} \dots a_{n H - 1} o_{n H - 1},$ then the loss over $n$ episodes with geometric time discount $γ \in [0, 1)$ is defined by
$L_{n}^{γ} (h) = (1 - γ) n - 1 \sum k = 0 γ^{k} L (a_{k H} o_{k H} \dots a_{k H + H - 1} o_{k H + H - 1}) .$
^{^}
A copolicy can depend on the past, meaning it can depend on the policy. Thus $m$ can depend on $π_{n}^{*}$ .

12

Proof Section to Formalizing Newcombian Problems with Fuzzy Infra-Bayesianism

12

Ω 6

12

Ω 6

12

Ω 6