LESSWRONG
LW

Solomonoff inductionAI
Frontpage

14

Sleeping Experts in the (reflective) Solomonoff Prior

by Daniel C, Cole Wyeth
31st Aug 2025
4 min read
0

14

Solomonoff inductionAI
Frontpage

14

New Comment
Moderation Log
More from Daniel C
View more
Curated and popular this week
0Comments

Epistemic status: This first collaboration between Daniel Chiang (who is interested in the algorithmic information theory of incrementally constructed representations) and myself (Cole Wyeth) contains some fairly simple but elegant results that help illustrate differences between ordinary and reflective Oracle Solomonoff Induction.  

Practical reasoning relies on context-specific prediction rules, which can be described as "incomplete models" or "partial experts." One simple example is "sleeping experts" which make predictions only at certain times (when they are "awake"), discussed here: https://www.lesswrong.com/posts/CGzAu8F3fii7gdgMC/a-simple-explanation-of-incomplete-models

Can Solomonoff's universal distribution take advantage of sleeping experts? We argue that the ordinary universal distribution has only a weak version of this property, while reflective Oracle Solomonoff Induction (rOSI) is perfectly capable of representing sleeping experts.

Note on inequalities: Globally assume that +≤ is up to a small additive constant independent of all measures and sequences but possibly depending on the choice of UTM.  

Independent Sub-Environments with M

We first prove a simple lemma which we call "modular loss bound" for the ordinary universal distribution M(x) where x is a binary string. The rough idea is that if we have a true environment μ(x), while ν(x) is a simpler environment which agrees with μ(x) on the first n bits, then we should we be able to bound the KL divergence between μ(x) and M(x) for the first n bits using K(ν), since both environments have identical behavior on the first n bits. Let M be the set of lower semicomputable semimeasures.

Lemma 1 (Modular loss bound): Denote Dn=∑nt=1Eμ[dt(⋅)] where dt(x<t):=∑xt∈Xμ(xt|x<t)ln(μ(xt|x<t)M(xt|x<t)) as the KL divergence between μ(x) and M(x) for the first n bits, let Mnμ={ν∈M|∀x1:n∈Bn,ν(x1:n)=μ(x1:n)}  and ν=argminν′∈MnμK(ν′), then we have Dn≤ln(2)K(ν) with K(ν)≤K(μ) by definition.

Proof:

Define ξ(x)=∑ν∈M2−K(ν)ν(x)

By universality, we have

M(x)×=ξ(x)≥2−K(ν)ν(x)⟹lnν(x)M(x)≤ln(2)K(ν)

The first equality can be made exact to avoid introducing an additive constant on the right. This is possible by e.g. Sterkenberg's "Universal Prediction" Theorem 2.16. Unfortunately that result is high context; a casual reader who still wants a rigorous proof can simply substitute ξ for M in our theorem statement and proof then proceed.

From the assumption ν∈Mnμ, we have μ(x1:n)=ν(x1:n), using the telescoping property of KL divergence, we get:

Dn=∑nt=1Eμ[dt(⋅)]=∑x1:nμ(x1:n)lnμ(x1:n)M(x1:n)

=∑x1:nμ(x1:n)lnν(x1:n)M(x1:n)≤∑x1:nμ(x1:n)ln(2)K(ν)

=ln(2)K(ν)

with K(ν)≤K(μ) by definition
 

Note that we can swap ν with any other ν′∈Mnμ , with ν giving us the optimal bound.

QED

We now consider the scenario where we have a sequence of independent sub-environments ν1,...νk where we switch to the next sub-environment only at particular time-indices. By constructing an environment ν∗ that implements the switching of sub-environments, we can obtain a modular loss bound based on the kolmogorov complexity of the sequence and all of the sub environments

Proposition 2 (Modular loss bound for independent sub-environments):

Let 0=t0<t1<..<tk be a sequence of index and r be the K complexity of the sequence,  let ν1...νk∈M be a a sequence of environments, we can construct an environment ν∗  such that for tj<n≤tj+1  we have ν∗(x1:n)=νj(xtj:n)Πj−1i=1νi(xti−1:ti). Then, for all environments μ that agree with ν∗ for the first n bits, we have Dn+≤ln(2)(r+∑ki=1K(νi)) where Dn is the KL divergence between μ(x) and  M(x). 

Proof
We have K(ν∗)≤r+∑ki=1K(νi) since we can construct ν∗ by first computing the time indices, then implement switching of each of the sub-environments νi. Then, for all true environments μs.t.ν∗∈Mnμ,K(μ)≥K(ν∗) , we have

Dn=∑x1:nμ(x1:n)lnμ(x1:n)M(x1:n)

=∑x1:nμ(x1:n)lnν∗(x1:n)M(x1:n)

≤∑x1:nμ(x1:n)ln(2)K(ν∗)=ln(2)K(ν∗)

+≤ln(2)(r+∑ki=1K(νi))

QED

Note: Setting some νj=M for j>1 indicates a kind of fractal structure of M at simple indices.

Sleeping Experts with rOSI

There is a similar modular loss bound for rOSI, but now we are able to condition the "experts" that predict each segment of the sequence on the entire previous sequence, because the class  MOrefl of reflective oracle computable (rO-computable) measures is constructed in terms of Markov kernels, not joint distributions. Let ξO×≥MOrefl be a universal "mixture" in the class constructed as in Algorithm 1 of "Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games" with the prior weight wν:=2−K(ν).

Proposition 3 (Modular loss bound for rO-computable sub-environments):

Let 0=t0<t1<..<tk be a sequence of indices and r be the K complexity of the sequence,  let ν1...νk∈MOrefl be a a sequence of environments, we can construct an environment ν∗  such that for tj<n≤tj+1  we have ν∗(xn|x<n)=νj(xn|x<n). Then, for all environments μ that agree with ν∗ for the first n bits, we have Dn+≤ln(2)(r+∑ki=1K(νi)) where Dn is the KL divergence of ξO from μ on the first n bits.  

Proof: Unchanged from above.

If we do not wish to make any assumptions about a "true" environment μ, it is also possible to bound the surprisal of ξ on an arbitrary prefix x1:n by the cumulative surprisals (−lg loss) of ν1...νk along with Kolmogorov complexity terms as above. This is more in the spirit of prediction with expert advice. Note that we do not assume anything about a "true environment" μ in the following:

Proposition 4 (Uniform loss bound against rO-computable sleeping experts):

Let 0=t0<t1<..<tk be a sequence of indices and r be the K complexity of the sequence,  let ν1...νk∈MOrefl be a a sequence of environments. Then

−lgξO(x1:n)+≤r+∑ki=1K(νi)+∑kj=1∑tjn=tj−1(−lgνj(xn|x<n))  

where the additive constant is independent of the sequence tj and measures νj. 

Proof: Simply note that ξO≥2−K(ν∗)ν∗, take −lg() of both sides and expand. 

A version of Proposition 4 for an infinite but algorithmically simple sequence of switches between experts is an easy extension. Though we have not explicitly proved that ordinary Solomonoff induction fails to satisfy any of these properties, the negative results in "Universal Prediction of Selected Bits" indicate that it at least does not satisfy the infinite switching extension. 

Future work: An alternative proof by a market metaphor should be accessible by adapting a result of @SamEisenstat.