Sleeping Experts in the (reflective) Solomonoff Prior

Cole Wyeth

Epistemic status: This first collaboration between Daniel Chiang (who is interested in the algorithmic information theory of incrementally constructed representations) and myself (Cole Wyeth) contains some fairly simple but elegant results that help illustrate differences between ordinary and reflective Oracle Solomonoff Induction.

Practical reasoning relies on context-specific prediction rules, which can be described as "incomplete models" or "partial experts." One simple example is "sleeping experts" which make predictions only at certain times (when they are "awake"), discussed here: https://www.lesswrong.com/posts/CGzAu8F3fii7gdgMC/a-simple-explanation-of-incomplete-models

Can Solomonoff's universal distribution take advantage of sleeping experts? We argue that the ordinary universal distribution has only a weak version of this property, while reflective Oracle Solomonoff Induction (rOSI) is perfectly capable of representing sleeping experts.

Note on inequalities: Globally assume that is up to a small additive constant independent of all measures and sequences but possibly depending on the choice of UTM.

Independent Sub-Environments with M

We first prove a simple lemma which we call "modular loss bound" for the ordinary universal distribution $M (x)$ where $x$ is a binary string. The rough idea is that if we have a true environment $μ (x)$ , while $ν (x)$ is a simpler environment which agrees with $μ (x)$ on the first $n$ bits, then we should we be able to bound the KL divergence between $μ (x)$ and $M (x)$ for the first $n$ bits using $K (ν)$ , since both environments have identical behavior on the first $n$ bits. Let $M$ be the set of lower semicomputable semimeasures.

Lemma 1 (Modular loss bound): Denote $D_{n} = \sum_{t = 1}^{n} E_{μ} [d_{t} (\cdot)]$ where $d_{t} (x_{< t}) := \sum_{x_{t} \in X} μ (x_{t} | x_{< t}) ln (\frac{μ (x_{t} | x_{< t})}{M (x_{t} | x_{< t})})$ as the KL divergence between $μ (x)$ and $M (x)$ for the first $n$ bits, let $M_{μ}^{n} = {ν \in M | \forall x_{1 : n} \in B^{n}, ν (x_{1 : n}) = μ (x_{1 : n})}$ and $ν = argmin ν^{'} \in M_{μ}^{n} K (ν^{'})$ , then we have $D_{n} \leq l n (2) K (ν)$ with $K (ν) \leq K (μ)$ by definition.

Proof:

Define $ξ (x) = \sum_{ν \in M} 2^{- K (ν)} ν (x)$

By universality, we have

$M (x) \times = ξ (x) \geq 2^{- K (ν)} ν (x)$ $⟹ l n \frac{ν (x)}{M (x)} \leq l n (2) K (ν)$

The first equality can be made exact to avoid introducing an additive constant on the right. This is possible by e.g. Sterkenberg's "Universal Prediction" Theorem 2.16. or slightly less rigorous but self-contained https://arxiv.org/abs/1111.3854. Unfortunately that result is high context; a casual reader who still wants a rigorous proof can simply substitute $ξ$ for $M$ in our theorem statement and proof then proceed.

From the assumption $ν \in M_{μ}^{n}$ , we have $μ (x_{1 : n}) = ν (x_{1 : n})$ , using the telescoping property of KL divergence, we get:

$D_{n} = \sum_{t = 1}^{n} E_{μ} [d_{t} (\cdot)] = \sum_{x_{1 : n}} μ (x_{1 : n}) l n \frac{μ (x_{1 : n})}{M (x_{1 : n})}$

$= \sum_{x_{1 : n}} μ (x_{1 : n}) l n \frac{ν (x_{1 : n})}{M (x_{1 : n})} \leq \sum_{x_{1 : n}} μ (x_{1 : n}) l n (2) K (ν)$

$= l n (2) K (ν)$

with $K (ν) \leq K (μ)$ by definition

Note that we can swap $ν$ with any other $ν^{'} \in M_{μ}^{n}$ , with $ν$ giving us the optimal bound.

QED

We now consider the scenario where we have a sequence of independent sub-environments $ν_{1}, . . . ν_{k}$ where we switch to the next sub-environment only at particular time-indices. By constructing an environment $ν^{*}$ that implements the switching of sub-environments, we can obtain a modular loss bound based on the kolmogorov complexity of the sequence and all of the sub environments

Proposition 2 (Modular loss bound for independent sub-environments):

Let $0 = t_{0} < t_{1} < . . < t_{k}$ be a sequence of index and r be the K complexity of the sequence, let $ν_{1} . . . ν_{k} \in M$ be a a sequence of environments, we can construct an environment $ν^{*}$ such that for $t_{j} < n \leq t_{j + 1}$ we have $ν^{*} (x_{1 : n}) = ν_{j} (x_{t_{j} : n}) Π_{i = 1}^{j - 1} ν_{i} (x_{t_{i - 1} : t_{i}})$ . Then, for all environments $μ$ that agree with $ν^{*}$ for the first $n$ bits, we have $D_{n} + \leq l n (2) (r + \sum_{i = 1}^{k} K (ν_{i}))$ where $D_{n}$ is the KL divergence between $μ (x)$ and $M (x)$ .

Proof
We have $K (ν^{*}) \leq r + \sum_{i = 1}^{k} K (ν_{i})$ since we can construct $ν^{*}$ by first computing the time indices, then implement switching of each of the sub-environments $ν_{i}$ . Then, for all true environments $μ s . t . ν^{*} \in M_{μ}^{n}, K (μ) \geq K (ν^{*})$ , we have

$D_{n} = \sum_{x_{1 : n}} μ (x_{1 : n}) l n \frac{μ (x_{1 : n})}{M (x_{1 : n})}$

$= \sum_{x_{1 : n}} μ (x_{1 : n}) l n \frac{ν^{*} (x_{1 : n})}{M (x_{1 : n})}$

$\leq \sum_{x_{1 : n}} μ (x_{1 : n}) l n (2) K (ν^{*}) = l n (2) K (ν^{*})$

$+ \leq l n (2) (r + \sum_{i = 1}^{k} K (ν_{i}))$

QED

Note: Setting some $ν_{j} = M$ for $j > 1$ indicates a kind of fractal structure of $M$ at simple indices.

Sleeping Experts with rOSI

There is a similar modular loss bound for rOSI, but now we are able to condition the "experts" that predict each segment of the sequence on the entire previous sequence, because the class $M_{refl}^{O}$ of reflective oracle computable (rO-computable) measures is constructed in terms of Markov kernels, not joint distributions. Let $ξ^{O} \times \geq M_{refl}^{O}$ be a universal "mixture" in the class constructed as in Algorithm 1 of "Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games" with the prior weight $w_{ν} := 2^{- K (ν)}$ .

Proposition 3 (Modular loss bound for rO-computable sub-environments):

Let $0 = t_{0} < t_{1} < . . < t_{k}$ be a sequence of indices and r be the K complexity of the sequence, let $ν_{1} . . . ν_{k} \in M_{refl}^{O}$ be a a sequence of environments, we can construct an environment $ν^{*}$ such that for $t_{j} < n \leq t_{j + 1}$ we have $ν^{*} (x_{n} | x_{< n}) = ν_{j} (x_{n} | x_{< n})$ . Then, for all environments $μ$ that agree with $ν^{*}$ for the first $n$ bits, we have $D_{n} + \leq l n (2) (r + \sum_{i = 1}^{k} K (ν_{i}))$ where $D_{n}$ is the KL divergence of $ξ^{O}$ from $μ$ on the first $n$ bits.

Proof: Unchanged from above.

If we do not wish to make any assumptions about a "true" environment $μ$ , it is also possible to bound the surprisal of $ξ$ on an arbitrary prefix $x_{1 : n}$ by the cumulative surprisals ( $- lg$ loss) of $ν_{1} . . . ν_{k}$ along with Kolmogorov complexity terms as above. This is more in the spirit of prediction with expert advice. Note that we do not assume anything about a "true environment" $μ$ in the following:

Proposition 4 (Uniform loss bound against rO-computable sleeping experts):

Let $0 = t_{0} < t_{1} < . . < t_{k}$ be a sequence of indices and r be the K complexity of the sequence, let $ν_{1} . . . ν_{k} \in M_{refl}^{O}$ be a a sequence of environments. Then

$- lg ξ^{O} (x_{1 : n}) + \leq r + \sum_{i = 1}^{k} K (ν_{i}) + \sum_{j = 1}^{k} \sum_{n = t_{j - 1}}^{t_{j}} (- lg ν_{j} (x_{n} | x_{< n}))$

where the additive constant is independent of the sequence $t_{j}$ and measures $ν_{j}$ .

Proof: Simply note that $ξ^{O} \geq 2^{- K (ν^{*})} ν^{*}$ , take $- lg ()$ of both sides and expand.

A version of Proposition 4 for an infinite but algorithmically simple sequence of switches between experts is an easy extension. Though we have not explicitly proved that ordinary Solomonoff induction fails to satisfy any of these properties, the negative results in "Universal Prediction of Selected Bits" indicate that it at least does not satisfy the infinite switching extension.

Future work: An alternative proof by a market metaphor should be accessible by adapting a result of @SamEisenstat.

LESSWRONG
LW

LESSWRONG
LW

16

Sleeping Experts in the (reflective) Solomonoff Prior

16

Ω 8

16

Ω 8

Independent Sub-Environments with M

Sleeping Experts with rOSI