Logical counterfactuals for random algorithms

Vanessa Kosoy

Updateless decision theory was informally defined by Wei Dai in terms of logical conditional expected utility, where the condition corresponds to an algorithm (the agent) producing a given output (action or policy). This kind of conditional expected values can be formalised by optimal predictors. However, since optimal predictor systems which are required to apply optimal predictors to decision theory generally have random advice, we need counterfactuals well-defined for random algorithms i.e. algorithms that produce different outputs with different probabilities depending on internal coin tosses. We propose to define these counterfactuals by a generalization of the notion of conditional expected utility which amounts to linear regression of utility with respect to the probabilities of different outputs in the space of "impossible possible worlds." We formalise this idea by introducing "relative optimal predictors," prove the analogue of the conditional probability formula (which takes matrix form) and uniqueness theorems.

Motivation

We start by explaining the analogous construction in classical probability theory and proceed to defining the logical counterpart in the Results section.

Consider $ζ$ a probability measure on some space, a random variable $u$ representing utility, a finite set $A$ representing possible actions and another random variable $p$ taking values in $[0, 1]^{A}$ and satisfying $\sum_{a \in A} p_{a} = 1$ representing the probabilities of taking different actions. For a deterministic algorithm, $p$ takes values ${0, 1}^{A}$ allowing defining conditional expected utility as

$u_{a} := E_{ζ} [u ∣ p_{a} = 1] = \frac{E_{ζ} [u p_{a}]}{E_{ζ} [p_{a}]}$

In the general case, it is tempting to consider

$E_{ζ ⋉ p} [u ∣ a] = \frac{E_{ζ} [u p_{a}]}{E_{ζ} [p_{a}]}$

where $ζ ⋉ p$ stands for the semidirect product of $ζ$ with $p$ , the latter regarded as a Markov kernel with target $A$ . However, this would lead to behavior similar to EDT since conditioning by $a$ is meaningful even for a single "world" (i.e. completely deterministic $u$ and $p$ ). Instead, we select $u^{*} \in R^{A}$ that minimizes $E_{ζ} [(u - p^{t} u^{*})^{2}]$ (we regard elements of $R^{A}$ as column vectors so $p^{t}$ is a row vector). This means $u^{*}$ has to satisfy the matrix equation

$E_{ζ} [p p^{t}] u^{*} = E_{ζ} [u p]$

The solution to this equation is only unique when $E_{ζ} [p p^{t}]$ is non-degenerate. This corresponds to requiring positive probability of the condition for usual conditional expected values. In case $p$ takes values in ${0, 1}^{A}$ , $u^{*}$ is the usual conditional expected value.

Preliminaries

Notation

We will suppress the $k, j$ indices associated with word ensembles and bischemes whenever possible to reduce clutter. When we do show $k, j$ explicitly, evaluation of functions $δ : N^{2} \to R$ will be written as $δ^{k j}$ rather than $δ (k, j)$ as before.

The concept of "proto-error space" is renamed to "error space" whereas the previous definition of "error space" is now called "radical error space".

We slightly extend the definition of "distributional estimation problem" as follows.

Definition 1

A distributional estimation problem is a pair $(μ, f)$ where $μ$ is a word ensemble and $f : supp μ \to R$ is bounded.

The corresponding notion of optimal predictor is the following.

Definition 2

Fix $E$ an space of rank 2 and $(μ, f)$ a distributional estimation problem. Consider $^P$ a $Q$ -valued $(p o l y, r l o g)$ -bischeme taking bounded values. $^P$ is called an $E (p o l y, r l o g)$ -optimal predictor for $(μ, f)$ when for any $Q$ -valued $(p o l y, r l o g)$ -bischeme $^Q$ , there is $δ \in E$ s.t.

$E_{μ \times {^σ}_{P}} [(^P - f)^{2}] \leq E_{μ \times {^σ}_{Q}} [(^Q - f)^{2}] + δ$

The theory of optimal predictors remains intact with these new definitions (although some care is wish to consider reflective systems with range $R$ instead of $[0, 1]$ ).

The following concept of "orthogonal predictor" is often more convenient than "optimal predictor", since we no longer require our error spaces to be radical.

Definition 3

Fix $E$ an error space of rank 2 and $(μ, f)$ a distributional estimation problem. Consider $^P$ a $Q$ -valued $(p o l y, r l o g)$ -bischeme taking bounded values.

$^P$ is called an $E (p o l y, r l o g)$ -orthogonal predictor for $(μ, f)$ when for any $Q$ -valued $(p o l y, r l o g)$ -bischeme $^Q$ s.t. ${^σ}_{Q}$ factors as ${^σ}_{P} \times τ$ , there is $δ \in E$ s.t.

$| E_{μ \times {^σ}_{P} \times τ} [Q (x, y, z) (P (x, y) - f (x))] | \leq (E_{μ \times {^σ}_{Q}} [{^Q}^{2}] + 1) δ$

The two concepts are tightly related due to the following

Theorem 1

Fix $E$ an error space of rank 2 and $(μ, f)$ a distributional estimation problem. Then any $E (p o l y, r l o g)$ -orthogonal predictor for $(μ, f)$ is an $E (p o l y, r l o g)$ -optimal predictor for $(μ, f)$ . Conversely any $E (p o l y, r l o g)$ -optimal predictor for $(μ, f)$ is an $E^{\frac{1}{2}} (p o l y, r l o g)$ -orthogonal predictor for $(μ, f)$ .

We skip the proof since it is almost identical to the orthogonality lemma.

Results

The proofs of the results are found in the Appendix.

Definition 4

A relative estimation problem is a quadruple $(A, μ, f, g)$ where $A$ is a finite set, $f : supp μ \to R$ bounded and $g : supp μ \to R^{A}$ bounded.

Definition 5

Fix $E$ an error space of rank 2. Consider $(A, μ, f, g)$ a relative estimation problem and $^P$ a $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme. $^P$ is called an $E (p o l y, r l o g)$ -optimal predictor for $(A, μ, f, g)$ when for any $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme $^Q$ s.t. ${^σ}_{Q}$ factors as ${^σ}_{P} \times τ$ , there is $δ \in E$ s.t.

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] \leq E_{μ \times {^σ}_{Q}} [(g^{t}^Q - f)^{2}] + (E_{μ \times {^σ}_{P} \times τ} [(g^{t} (^P -^Q))^{2}] + 1) δ$

The appearance of the $E_{μ \times {^σ}_{P} \times τ} [(g^{t} (^P -^Q))^{2}]$ term might seem as making the condition substantially weaker. However, we the condition is still sufficient strong to produce the uniqueness theorem which appears as Theorem 1 below. Also, the novelty disappears in the corresponding orthogonality condition.

Definition 6

Fix $E$ an error space of rank 2. Consider $(A, μ, f, g)$ a relative estimation problem and $^P$ a $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme. $^P$ is called an $E (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f, g)$ when for any $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme $^Q$ s.t. ${^σ}_{Q}$ factors as ${^σ}_{P} \times τ$ , there is $δ \in E$ s.t.

$| E_{μ \times {^σ}_{P} \times τ} [g^{t} Q (x, y, z) (g^{t} P (x, y) - f (x))] | \leq (E_{μ \times {^σ}_{Q}} [(g^{t} Q)^{2}] + 1) δ$

Theorem 2

Fix $E$ an error space of rank 2 and $(A, μ, f, g)$ a relative estimation problem. Then any $E (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f, g)$ is an $E (p o l y, r l o g)$ -optimal predictor for $(A, μ, f, g)$ . Conversely any $E (p o l y, r l o g)$ -optimal predictor for $(A, μ, f, g)$ is an $E^{\frac{1}{2}} (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f, g)$ .

Definition 7

Consider $E$ an error space space of rank $n$ . A set $I \subseteq N^{n}$ is called $E$ -negligible when $χ_{I} \in E$ .

Note 1

The $E$ -negligible sets form a set-theoretic ideal on $N^{n}$ .

All finite sets are $E$ -negligible (thanks to the the condition $2^{- h} \in E$ ).

Definition 8

Fix $E$ an error space of rank 2. Consider $A$ a finite set, $μ$ a word ensemble, $g : supp μ \to R^{A}$ and ${^Q}_{1}, {^Q}_{2}$ $Q^{A}$ -valued $(p o l y, r l o g)$ -bischemes. We say ${^Q}_{1}$ is $E$ -similar to ${^Q}_{2}$ relative to $g$ (denoted ${^Q}_{1} g ≃ E {^Q}_{2}$ ) when there is an $E$ -negligible set $I$ and $δ \in E$ s.t.

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1} \times {^σ}_{2}} [(g^{t} ({^Q}_{1}^{k j} - {^Q}_{2}^{k j}))^{2}] = δ^{k j}$

Note 2

It's impossible to remove $I$ from the definition since $E_{μ^{k} \times {^σ}_{1} \times {^σ}_{2}} [(g^{t} ({^Q}_{1}^{k j} - {^Q}_{2}^{k j}))^{2}]$ doesn't have to be bounded.

Theorem 3

Fix $E$ an error space of rank 2. Consider $(A, μ, f, g)$ a relative estimation problem and ${^P}_{1}, {^P}_{2}$ $E (p o l y, r l o g)$ -orthogonal predictors for $(A, μ, f, g)$ . Then ${^P}_{1} g ≃ E {^P}_{2}$ .

Theorem 4 below is the orthogonal predictor analogue of the matrix equation from the previous section.

Proposition 1

Consider $E$ an error space of rank $n$ , $ϵ > 0$ and $γ : N^{n} \to [ϵ, \infty)$ . Then $γ E := {γ δ ∣ δ \in E}$ is an error space.

Theorem 4

Fix $E$ an error space of rank 2. Consider $(A, μ, f, g)$ a relative estimation problem, ${^P}_{g g}$ an $End (Q^{A})$ -valued $(p o l y, r l o g)$ -bischeme, ${^P}_{f g}$ a $Q^{A}$ -valued bischeme, $ϵ > 0$ and $γ : N^{2} \to [ϵ, \infty)$ . Assume that for any $x \in supp μ^{k}$ , ${^P}_{g g}^{k j} (x)$ is always invertible and the operator norm of ${^P}_{g g}^{k j} (x)^{- 1}$ is at most $γ^{k j}$ . Assume further that ${^P}_{g g}$ is an $E (p o l y, r l o g)$ -orthogonal predictor for $g g^{t}$ (componentwise) and ${^P}_{f g}$ is a $γ^{2} E (p o l y, r l o g)$ -orthogonal predictor for $f g$ (componentwise). Define the $Q^{A}$ -valued bischeme ${^P}_{f ∣ g}$ by ${^P}_{f ∣ g} := {^P}_{g g}^{- 1} {^P}_{f g}$ . Then, ${^P}_{f ∣ g}$ is a $γ^{2} E (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f, g)$ .

Note 3

It is always possible to choose ${^P}_{g g}$ to be symmetric in which case the operator norm of ${^P}_{g g}^{- 1}$ is the inverse of the lowest absolute value of an eigenvalue of ${^P}_{g g}$ .

Finally, we extend the stability result for usual conditional probabilities to the matrix setting.

Definition 9

Fix $E$ an error space of rank 2. Consider $A$ a finite set, $μ$ a word ensemble and ${^Q}_{1}, {^Q}_{2}$ $Q^{A}$ -valued $(p o l y, r l o g)$ -bischemes. We say ${^Q}_{1}$ is $E$ -similar to ${^Q}_{2}$ relative to $μ$ (denoted ${^Q}_{1} μ ≃ E {^Q}_{2}$ ) when there is an $E$ -negligible set $I$ and $δ \in E$ s.t.

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1} \times {^σ}_{2}} [∥ {^Q}_{1}^{k j} - {^Q}_{2}^{k j} ∥^{2}] = δ^{k j}$

Theorem 5

Fix $E$ an error space of rank 2. Consider $ϵ > 0$ and $γ : N^{2} \to [ϵ, \infty)$ , $(A, μ, f, g)$ a relative estimation problem, ${^P}_{g g}$ a positive-definite $E (p o l y, l o g)$ -orthogonal predictor for $g g^{t}$ and ${^P}_{1}$ , ${^P}_{2}$ $γ^{4} E (p o l y, l o g)$ -orthogonal predictors for $(A, μ, f, g)$ . Assume that for any $x \in supp μ^{k}$ , the lowest eigenvalue of ${^P}_{g g}^{k j} (x)$ is always at least $(γ^{k j})^{- 1}$ and $∥ {^P}_{1, 2}^{k j} (x) ∥ \leq γ^{k j}$ . Then, ${^P}_{1} μ ≃ γ^{5} E {^P}_{2}$ .

The Corollary below shows that in simple situations these counterfactuals produce the values we expect.

Definition 10

Fix $E$ an space of rank 2 and $(μ, f)$ a distributional estimation problem. Consider $^P$ a $Q$ -valued $(p o l y, r l o g)$ -bischeme taking bounded values. $^P$ is called an $E (p o l y, r l o g)$ -perfect predictor for $(μ, f)$ when $E_{μ \times {^σ}_{P}} [(^P - f)^{2}] \in E$ .

Corollary

Fix $E$ an error space of rank 2. Consider $ϵ > 0$ and $γ : N^{2} \to [ϵ, \infty)$ , $(A, μ, f^{*}, g)$ a relative estimation problem and $f : supp μ \to R^{A}$ bounded s.t. $f^{*} = g^{t} f$ . Suppose ${^P}_{f}$ is a $γ^{8} E^{2} (p o l y, r l o g)$ -perfect predictor for $(μ, f)$ . Suppose ${^P}_{g g}$ is a positive-definite $E (p o l y, r l o g)$ -orthogonal predictor for $g g^{t}$ s.t. for any $x \in supp μ^{k}$ , the lowest eigenvalue of ${^P}_{g g}^{k j} (x)$ is always at least $(γ^{k j})^{- 1}$ . Suppose ${^P}_{f g}$ is a $γ^{2} E (p o l y, r l o g)$ -orthogonal predictor for $f g$ . Then, ${^P}_{g g}^{- 1} {^P}_{f g} μ ≃ γ^{5} E {^P}_{f}$ .

Appendix

Proof of Theorem 2

Consider $^P$ an $E (p o l y, r l o g)$ -optimal predictor for $(A, μ, f, g)$ . Consider $^R$ a $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme with ${^σ}_{R} = {^σ}_{P} \times τ$ . For any $ζ : N^{2} \to {- 1, + 1}$ and $n : N^{2} \to Z$ s.t. $| n |$ can be bounded by a polynomial (and hence its number of digits is logarithmic), we can define $t^{k j} := ζ^{k j} 2^{- n^{k j}}$ and the $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme $^Q$ with ${^σ}_{Q} := {^σ}_{R}$ and $Q (x, y, z) := P (x, y) + t R (x, y, z)$ . We have $δ \in E$ s.t.

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] \leq E_{μ \times {^σ}_{Q}} [(g^{t}^Q - f)^{2}] + (E_{μ \times {^σ}_{P} \times τ} [(g^{t} (^P -^Q))^{2}] + 1) δ$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] \leq E_{μ \times {^σ}_{P} \times τ} [(g^{t} (^P + t^R) - f)^{2}] + (E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] t^{2} + 1) δ$

$E_{μ \times {^σ}_{P} \times τ} [(g^{t}^P - f)^{2} - (g^{t} (^P + t^R) - f)^{2}] \leq (E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] t^{2} + 1) δ$

$- E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] t^{2} - 2 E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] t \leq (E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] t^{2} + 1) δ$

$- 2 E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] t \leq E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] (1 + sup δ) t^{2} + δ$

By choosing $ζ := - sgn E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)]$ we get

$| E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] | 2^{- n + 1} \leq E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] (1 + sup δ) 4^{- n} + δ$

$| E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] | \leq E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] (1 + sup δ) 2^{- n} + δ 2^{n}$

Take $n := min (- ⌊ \frac{1}{2} log δ ⌋, h)$ where $h : N^{2} \to N$ is a polynomial s.t. $2^{- h} \in E$ . We get

$| E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] | \leq E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] (1 + sup δ) max (δ^{\frac{1}{2}}, 2^{- h}) + δ min (2 δ^{- \frac{1}{2}}, 2^{h})$

$| E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] | \leq E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] (1 + sup δ) max (δ^{\frac{1}{2}}, 2^{- h}) + 2 δ^{\frac{1}{2}}$

$| E_{μ \times {^σ}_{P} \times τ} [g^{t}^R (g^{t}^P - f)] | \leq (E_{μ \times {^σ}_{R}} [(g^{t}^R)^{2}] + 1) (1 + sup δ) max (2 δ^{\frac{1}{2}}, 2^{- h})$

Conversely, suppose $^P$ an $E (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f, g)$ . Consider $^Q$ a $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme with ${^σ}_{Q} = {^σ}_{P} \times τ$ . We have

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] - E_{μ \times {^σ}_{Q}} [(g^{t}^Q - f)^{2}] = E_{μ \times {^σ}_{P} \times τ} [g^{t} (^P -^Q) (g^{t} (^P +^Q) - 2 f)]$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] - E_{μ \times {^σ}_{Q}} [(g^{t}^Q - f)^{2}] = E_{μ \times {^σ}_{P} \times τ} [g^{t} (^P -^Q) (2 g^{t}^P + g^{t} (^Q -^P) - 2 f)]$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] - E_{μ \times {^σ}_{Q}} [(g^{t}^Q - f)^{2}] = 2 E_{μ \times {^σ}_{P} \times τ} [g^{t} (^P -^Q) (g^{t}^P - f)] - E_{μ \times {^σ}_{P} \times τ} [(g^{t} (^P -^Q))^{2}]$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] - E_{μ \times {^σ}_{Q}} [(g^{t}^Q - f)^{2}] \leq (E_{μ \times {^σ}_{P} \times τ} [(g^{t} (^P -^Q))^{2}] + 1) δ$

Proof of Theorem 3

Since ${^P}_{1}$ is orthogonal, we have $δ_{1} \in E$ s.t.

$| E_{μ \times {^σ}_{1} \times {^σ}_{2}} [g^{t} ({^P}_{1} - {^P}_{2}) (g^{t} {^P}_{1} - f)] | \leq (E_{μ \times {^σ}_{1} \times {^σ}_{2}} [(g^{t} ({^P}_{1} - {^P}_{2}))^{2}] + 1) δ_{1}$

Since ${^P}_{2}$ is orthogonal, we have $δ_{2} \in E$ s.t.

$| E_{μ \times {^σ}_{1} \times {^σ}_{2}} [g^{t} ({^P}_{1} - {^P}_{2}) (g^{t} {^P}_{2} - f)] | \leq (E_{μ \times {^σ}_{1} \times {^σ}_{2}} [(g^{t} ({^P}_{1} - {^P}_{2}))^{2}] + 1) δ_{2}$

It follows that

$E_{μ \times {^σ}_{1} \times {^σ}_{2}} [(g^{t} ({^P}_{1} - {^P}_{2}))^{2}] \leq (E_{μ \times {^σ}_{1} \times {^σ}_{2}} [(g^{t} ({^P}_{1} - {^P}_{2}))^{2}] + 1) (δ_{1} + δ_{2})$

Define $I := {(k, j) ∣ δ_{1}^{k j} + δ_{2}^{k j} \geq \frac{1}{2}}$ . $χ_{I} \leq 2 (δ_{1} + δ_{2})$ so $I$ is $E$ -negligible. We get

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [(g^{t} ({^P}_{1}^{k j} - {^P}_{2}^{k j}))^{2}] \leq \frac{δ_{1}^{k j} + δ_{2}^{k j}}{1 - δ_{1}^{k j} - δ_{2}^{k j}}$

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [(g^{t} ({^P}_{1}^{k j} - {^P}_{2}^{k j}))^{2}] \leq 2 (δ_{1}^{k j} + δ_{2}^{k j})$

Proof of Proposition 1

Given $δ_{1}, δ_{2} \in E$

$γ δ_{1} + γ δ_{2} = γ (δ_{1} + δ_{2}) \in γ E$

Given $δ_{1} \in E$

$δ_{2} \leq γ δ_{1} ⟹ γ^{- 1} δ_{2} \leq δ_{1} ⟹ γ^{- 1} δ_{2} \in E ⟹ δ_{2} \in γ E$

Given $h$ polynomial s.t. $2^{- h} \in E$

$2^{- h} \leq γ ϵ^{- 1} 2^{- h} \in γ E$

Proof of Theorem 4

We know that ${^σ}_{f ∣ g} = {^σ}_{g g} \times {^σ}_{f g}$ and

$P_{f g} (x, z) - f (x) g (x) = (g (x)^{t} P_{f ∣ g} (x, y, z) - f (x)) g (x) + (P_{g g} (x, y) - g (x) g (x)^{t}) P_{f ∣ g} (x, y, z)$

Consider a $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme $^Q$ s.t. ${^σ}_{Q}$ factors as ${^σ}_{g g} \times {^σ}_{f g} \times τ$ . We get

$| E_{μ \times {^σ}_{Q}} [g^{t} Q (g^{t} {^P}_{f ∣ g} - f)] | \leq | E_{μ \times {^σ}_{Q}} [Q^{t} ({^P}_{f g} - f g)] | + | E_{μ \times {^σ}_{Q}} [Q^{t} ({^P}_{g g} - g g^{t}) {^P}_{f ∣ g}] |$

Since ${^P}_{f g}$ is a $γ^{2} E (p o l y, r l o g)$ -orthogonal predictor for $f g$ , there is $δ_{f g} \in E$ s.t.

$| E_{μ \times {^σ}_{Q}} [Q^{t} ({^P}_{f g} - f g)] | \leq (E_{μ \times {^σ}_{Q}} [∥ Q ∥^{2}] + 1) γ^{2} δ_{f g}$

Since ${^P}_{g g}$ is an $E (p o l y, r l o g)$ -orthogonal predictor for $g g^{t}$ , there is $δ_{g g} \in E$ s.t.

$| E_{μ \times {^σ}_{Q}} [Q^{t} ({^P}_{g g} - g g^{t}) {^P}_{f ∣ g}] | \leq (E_{μ \times {^σ}_{Q}} [∥ Q ∥^{2} ∥ {^P}_{f ∣ g} ∥^{2}] + 1) δ_{g g}$

${^P}_{f g}$ is bounded and the operator norm of ${^P}_{g g}^{- 1}$ is at most $γ$ , therefore there is $c > 0$ s.t. $∥ {^P}_{f ∣ g} ∥^{2} \leq c γ^{2}$ and

$| E_{μ \times {^σ}_{Q}} [Q^{t} ({^P}_{g g} - g g^{t}) {^P}_{f ∣ g}] | \leq (E_{μ \times {^σ}_{Q}} [∥ Q ∥^{2}] c γ^{2} + 1) δ_{g g}$

$| E_{μ \times {^σ}_{Q}} [Q^{t} ({^P}_{g g} - g g^{t}) {^P}_{f ∣ g}] | \leq (E_{μ \times {^σ}_{Q}} [∥ Q ∥^{2}] + 1) max (ϵ^{- 2}, c) γ^{2} δ_{g g}$

Putting everything together we get

$| E_{μ \times {^σ}_{Q}} [g^{t} Q (g^{t} {^P}_{f ∣ g} - f)] | \leq (E_{μ \times {^σ}_{Q}} [∥ Q ∥^{2}] + 1) γ^{2} (δ_{f g} + max (ϵ^{- 2}, c) δ_{g g})$

Proof of Theorem 5

According to Theorem 3, there is $δ_{1} \in E$ and $I \subseteq N^{2}$ which is $γ^{4} E$ -negligible s.t.

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [(g^{t} ({^P}_{1}^{k j} - {^P}_{2}^{k j}))^{2}] \leq (γ^{k j})^{4} δ_{1}^{k j}$

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [({^P}_{1}^{k j} - {^P}_{2}^{k j})^{t} g g^{t} ({^P}_{1}^{k j} - {^P}_{2}^{k j})] \leq (γ^{k j})^{4} δ_{1}^{k j}$

On the other hand, since ${^P}_{g g}$ is an $E (p o l y, r l o g)$ -orthogonal predictor for $g g^{t}$ , there is $δ_{2} \in E$ s.t.

$| E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [({^P}_{1}^{k j} - {^P}_{2}^{k j})^{t} ({^P}_{g g}^{k j} - g g^{t}) ({^P}_{1}^{k j} - {^P}_{2}^{k j})] | \leq (E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [∥ {^P}_{1}^{k j} - {^P}_{2}^{k j} ∥^{4}] + 1) δ_{2}^{k j}$

Since $∥ {^P}_{1, 2}^{k j} (x) ∥ \leq γ^{k j}$ , there is $c > 0$ s.t.

$| E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [({^P}_{1}^{k j} - {^P}_{2}^{k j})^{t} ({^P}_{g g}^{k j} - g g^{t}) ({^P}_{1}^{k j} - {^P}_{2}^{k j})] | \leq (c (γ^{k j})^{4} + 1) δ_{2}^{k j}$

$| E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [({^P}_{1}^{k j} - {^P}_{2}^{k j})^{t} ({^P}_{g g}^{k j} - g g^{t}) ({^P}_{1}^{k j} - {^P}_{2}^{k j})] | \leq (c + ϵ^{- 4}) (γ^{k j})^{4} δ_{2}^{k j}$

Combining the two together, we conclude that

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [({^P}_{1}^{k j} - {^P}_{2}^{k j})^{t} {^P}_{g g}^{k j} ({^P}_{1}^{k j} - {^P}_{2}^{k j})] \leq (γ^{k j})^{4} (δ_{1}^{k j} + (c + ϵ^{- 4}) δ_{2}^{k j})$

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [∥ ({^P}_{g g}^{k j})^{\frac{1}{2}} ({^P}_{1}^{k j} - {^P}_{2}^{k j}) ∥^{2}] \leq (γ^{k j})^{4} (δ_{1}^{k j} + (c + ϵ^{- 4}) δ_{2}^{k j})$

$\forall (k, j) \in N^{2} ∖ I : (γ^{k j})^{- 1} E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [∥ {^P}_{1}^{k j} - {^P}_{2}^{k j} ∥^{2}] \leq (γ^{k j})^{4} (δ_{1}^{k j} + (c + ϵ^{- 4}) δ_{2}^{k j})$

$\forall (k, j) \in N^{2} ∖ I : E_{μ^{k} \times {^σ}_{1}^{k j} \times {^σ}_{2}^{k j}} [∥ {^P}_{1}^{k j} - {^P}_{2}^{k j} ∥^{2}] \leq (γ^{k j})^{5} (δ_{1}^{k j} + (c + ϵ^{- 4}) δ_{2}^{k j})$

Definition 11

Fix $E$ an space of rank 2 and $(A, μ, f, g)$ a relative estimation problem. Consider $^P$ a $Q^{A}$ -valued $(p o l y, r l o g)$ -bischeme. $^P$ is called an $E (p o l y, r l o g)$ -perfect predictor for $(A, μ, f, g)$ when $E_{μ \times {^σ}_{P}} [(g^{t}^P - f)^{2}] \in E$ .

Proposition 2

Fix $E$ an error space of rank 2. Consider $(A, μ, f^{*}, g)$ a relative estimation problem and $f : supp μ \to R^{A}$ bounded s.t. $f^{*} = g^{t} f$ . Suppose ${^P}_{f}$ is an $E (p o l y, r l o g)$ -perfect predictor for $(μ, f)$ . Then ${^P}_{f}$ is an $E (p o l y, r l o g)$ -perfect predictor for $(A, μ, f^{*}, g)$ .

Proof of Proposition 2

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f^{*})^{2}] = E_{μ \times {^σ}_{P}} [(g^{t}^P - g^{t} f)^{2}]$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f^{*})^{2}] = E_{μ \times {^σ}_{P}} [(g^{t} (^P - f))^{2}]$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f^{*})^{2}] \leq E_{μ \times {^σ}_{P}} [∥ g ∥^{2} ∥^P - f ∥^{2}]$

$E_{μ \times {^σ}_{P}} [(g^{t}^P - f^{*})^{2}] \leq sup ∥ g ∥^{2} E_{μ \times {^σ}_{P}} [∥^P - f ∥^{2}]$

Proof of Corollary

By Proposition 2, ${^P}_{f}$ is a $γ^{8} E^{2} (p o l y, r l o g)$ -perfect predictor for $(A, μ, f^{*}, g)$ and in particular a $γ^{8} E^{2} (p o l y, r l o g)$ -optimal predictor for $(A, μ, f^{*}, g)$ . By Theorem 2 it is a $γ^{4} E (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f^{*}, g)$ . By Theorem 4, ${^P}_{g g}^{- 1} {^P}_{f g}$ is a $γ^{2} E (p o l y, r l o g)$ -orthogonal predictor for $(A, μ, f^{*}, g)$ . Applying Theorem 5, we get the desired result (the condition $∥ {^P}_{g g}^{- 1} {^P}_{f g} ∥ \leq γ$ holds because ${^P}_{f g}$ is bounded and ${^P}_{g g}$ 's lowest eigenvalue is at most $γ^{- 1}$ ).

5