Epistemic status: This first collaboration between Daniel Chiang (who is interested in the algorithmic information theory of incrementally constructed representations) and myself (Cole Wyeth) contains some fairly simple but elegant results that help illustrate differences between ordinary and reflective Oracle Solomonoff Induction.
Practical reasoning relies on context-specific prediction rules, which can be described as "incomplete models" or "partial experts." One simple example is "sleeping experts" which make predictions only at certain times (when they are "awake"), discussed here: https://www.lesswrong.com/posts/CGzAu8F3fii7gdgMC/a-simple-explanation-of-incomplete-models
Can Solomonoff's universal distribution take advantage of sleeping experts? We argue that the ordinary universal distribution has only a weak version of this property, while reflective Oracle Solomonoff Induction (rOSI) is perfectly capable of representing sleeping experts.
Note on inequalities: Globally assume that is up to a small additive constant independent of all measures and sequences but possibly depending on the choice of UTM.
We first prove a simple lemma which we call "modular loss bound" for the ordinary universal distribution where is a binary string. The rough idea is that if we have a true environment , while is a simpler environment which agrees with on the first bits, then we should we be able to bound the KL divergence between and for the first bits using , since both environments have identical behavior on the first bits. Let be the set of lower semicomputable semimeasures.
Lemma 1 (Modular loss bound): Denote where as the KL divergence between and for the first bits, let and , then we have with by definition.
Proof:
Define
By universality, we have
The first equality can be made exact to avoid introducing an additive constant on the right. This is possible by e.g. Sterkenberg's "Universal Prediction" Theorem 2.16. Unfortunately that result is high context; a casual reader who still wants a rigorous proof can simply substitute for in our theorem statement and proof then proceed.
From the assumption , we have , using the telescoping property of KL divergence, we get:
with by definition
Note that we can swap with any other , with giving us the optimal bound.
QED
We now consider the scenario where we have a sequence of independent sub-environments where we switch to the next sub-environment only at particular time-indices. By constructing an environment that implements the switching of sub-environments, we can obtain a modular loss bound based on the kolmogorov complexity of the sequence and all of the sub environments
Proposition 2 (Modular loss bound for independent sub-environments):
Let be a sequence of index and r be the K complexity of the sequence, let be a a sequence of environments, we can construct an environment such that for we have . Then, for all environments that agree with for the first bits, we have where is the KL divergence between and .
Proof
We have since we can construct by first computing the time indices, then implement switching of each of the sub-environments . Then, for all true environments , we have
QED
Note: Setting some for indicates a kind of fractal structure of at simple indices.
There is a similar modular loss bound for rOSI, but now we are able to condition the "experts" that predict each segment of the sequence on the entire previous sequence, because the class of reflective oracle computable (rO-computable) measures is constructed in terms of Markov kernels, not joint distributions. Let be a universal "mixture" in the class constructed as in Algorithm 1 of "Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games" with the prior weight .
Proposition 3 (Modular loss bound for rO-computable sub-environments):
Let be a sequence of indices and r be the K complexity of the sequence, let be a a sequence of environments, we can construct an environment such that for we have . Then, for all environments that agree with for the first bits, we have where is the KL divergence of from on the first bits.
Proof: Unchanged from above.
If we do not wish to make any assumptions about a "true" environment , it is also possible to bound the surprisal of on an arbitrary prefix by the cumulative surprisals ( loss) of along with Kolmogorov complexity terms as above. This is more in the spirit of prediction with expert advice. Note that we do not assume anything about a "true environment" in the following:
Proposition 4 (Uniform loss bound against rO-computable sleeping experts):
Let be a sequence of indices and r be the K complexity of the sequence, let be a a sequence of environments. Then
where the additive constant is independent of the sequence and measures .
Proof: Simply note that , take of both sides and expand.
A version of Proposition 4 for an infinite but algorithmically simple sequence of switches between experts is an easy extension. Though we have not explicitly proved that ordinary Solomonoff induction fails to satisfy any of these properties, the negative results in "Universal Prediction of Selected Bits" indicate that it at least does not satisfy the infinite switching extension.
Future work: An alternative proof by a market metaphor should be accessible by adapting a result of @SamEisenstat.