Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
Previously we derived a regret bound for DRL which assumed the advisor is "locally sane." Such an advisor can only take actions that don't lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into "dangerous" (impending catastrophe) and "safe" (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain "leaking" of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the "value leak" rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1. %Thus, this formalism can be regarded as a simple model of aligned superintelligence: the agent is aligned since near-optimal utility is achieved despite the presence of corrupted states (which are treated more or less the same as before) and it is superintelligent since its performance is vastly better than the performance of the advisor.
Appendix A proves the main theorem. Appendix B contains the proof of an important lemma which is however almost identical to what appeared in the previous essay. Appendix C contains several propositions from the previous essay which we are used in the proof.
##Results
We start by formalising the concepts of a "catastrophe" and "catastrophe mitigation" in the language of MDPs.
#Definition 1
A catastrophe MDP is an MDP M together with a partition of SM into subsets SM:=SFM⊔SDM⊔SCM (safe, dangerous and corrupt states respectively).
#Definition 2
Fix a catastrophe MDP M. Define A♯M:SM→2AM by
A♯M(s):=⎧⎪⎨⎪⎩{a∈AM∣suppT(s,a)⊆SFM} if s∈SFM{a∈AM∣suppT(s,a)∩SCM=∅} if s∈SDMAM if s∈SCM
is called a mitigation policy for M when
i. For any s∈SM, suppπ(s)⊆A♯M(s).
π is called a proper mitigation policy for M when condition i holds and
ii. For any s∈SDM, limn→∞TnMπ(SFM∣s)=1.
#Definition 3
Fix ¯τ∈(0,∞), a catastrophe MDP M and a proper mitigation policy π. π is said to have expected mitigation time ¯τ when for any s∈SDM
∞∑n=0(n+1)(Tn+1Mπ(SFM∣s)−TnMπ(SFM∣s))=¯τ
Next, we introduce the notion of an MDP perturbation. We will use it by considering perturbations of a catastrophe MDP which "eliminate the catastrophe."
#Definition 4
Fix δ∈(0,1) and consider a catastrophe MDP M. An MDP ~M is said to be a δ-perturbation of M when
i. S~M=SM
ii. A~M=AM
iii. R~M=RM
iv. For any s∈SM∖SDM and a∈AM, T~M(s,a)=TM(s,a)
v. For any s∈SDM and a∈AM, there exists ζ∈ΔSM s.t. TM(s,a)=(1−δ)T~M(s,a)+δζ.
Similarly, we can consider perturbations of a policy.
#Definition 5
Fix δ∈(0,1) and consider a catastrophe MDP M. Given and , ~π is said to be a δ-perturbation of π when
i. For any s∈SM∖SDM, ~π(s)=π(s).
ii. For any s∈SDM, there exists α∈ΔA s.t. π(s)=(1−δ)~π(s)+δα.
We will also need to introduce policy-specific value functions, Q-functions and relatively k-optimal actions.
#Definition 6
Fix an MDP M and . We define VMπ:SM×(0,1)→[0,1] and QMπ:SM×AM×(0,1)→[0,1] by
VMπ(s,γ):=(1−γ)∞∑n=0γnETnMπ(s)[RM]
QMπ(s,a,γ):=(1−γ)RM(s)+γEt∼TM(s,a)[VMπ(t,γ)]
For each k∈N, we define VkMπ:SM→R, QkMπ:SM×AM→R and AkMπ:SM→2AM by
VkMπ(s):=(−1)kdkVMπ(s,γ)dγk∣∣∣γ=1
QkMπ(s,a):=(−1)kdkQMπ(s,a,γ)dγk∣∣∣γ=1
A0Mπ(s):={a∈AM∣Q0Mπ(s,a)≥V0Mπ(s)}
Ak+1Mπ(s):={a∈AkMπ(s)∣Qk+1Mπ(s,a)≥Vk+1Mπ(s) or ∃j≤k:QjMπ(s,a)>VjMπ(s)}
Now we give the new (weaker) condition on the advisor policy. For notational simplicity, we assume the policy is stationary. It is easy to generalize these results to non-stationary advisor policies and to policies that depend on irrelevant additional information (i.e. policies for universes that are realizations of the MDP).
#Definition 7
Given a catastrophe MDP M, we denote M♭ the MDP defined by
SM♭=SM
AM♭=AM
TM♭=TM
For any s∉SFM, RM♭(s)=0.
For any s∈SFM, RM♭(s)=12+12RM(s).
#Definition 8
Fix ϵ,δ,γ∈(0,1). Consider a catastrophe MDP M. A policy is said to be locally (ϵ,δ)-sane for (M,γ) when there exists a δ-perturbation ~M of M with a deterministic proper mitigation policy π∗:SM→AM and a δ-perturbation ~π of π s.t.
i. For all s∈SM, VMπ∗(s,γ)=VM(s,γ).
ii. ~π is a mitigation policy for ~M.
iii. For any s∈SM∖SCM: supp~π(s)⊆A0~M♭π∗(s)
iv. For any s∈SM∖SCM: ~π(π∗(s)∣s)>ϵ
% Alternatively we could allow \pi^* to be stochastic...
% and require \tilde{\pi} to be of the form \epsilon \pi^* + (1 - \epsilon) \pi'
Given ¯τ∈(0,∞), π is said to have potential mitigation time ¯τ when π∗ has it as expected mitigation time.
Note that a locally (ϵ,δ)-sane policy still has to be 0-optimal in SFM. This requirement seems reasonably realistic, since, roughly speaking, it only means that there is some way to "rearrange the universe" that the agent can achieve, and that would be "endorsed" by the advisor, s.t this rearrangement doesn't destroy substantially much value and s.t. after this rearrangement, there is no "impending catastrophe" that the agent has to prevent and the advisor wouldn't be able to prevent in its place. In particular, this rearrangement may involve creating some subagents inside the environment and destroying the original agent, in which case any policy on SFM is "vacuously" optimal (since all actions have no effect).
We can now formulate the main result.
#Theorem 1
Fix an interface I=(A,O), N∈N, ϵ∈(0,1) and for each k∈[N], an MDP MFk s.t. AMFk=A. Now, consider for each k∈[N], an I-universe υk=(μk,rk) which is an O-realization of a catastrophe MDP Mk with state function Sk s.t.
i. SFMk=SMFk
ii. For each s∈SMFk and a∈A, TMk(s,a)∣SMFk=TMFk(s,a).
iii. For each s∈SMFk, RMk(s)=RMFk(s).
iv. Given k,j∈[N] and h∈hdomμk∩hdomμj, if Sk(h)∈SMk∖SCMk and Sj(h)∈SMj∖SCMj, then rk(h)=rj(h) (this condition means that in uncorrupted states, the reward is observable).
Consider also α,δ∈(0,1), ¯τ∈(0,∞) and σk a locally (ϵ,δ)-sane policy for (Mk,1−α). Assume σk has potential mitigation time ¯τ. Then, there exists an ¯I-policy π∗ s.t. for any k∈[N]
Here, σkSk is the I-policy defined by σkSk(h):=σk(Sk(h)). ϵ and the MFk are regarded as fixed and we don't explicitly examine their effect on regret, whereas α, δ, ¯τ and the Mk are regarded as variable with the asymptotics α,δ→0, ¯τ→∞.
In most interesting cases, δ≫α (i.e. the "mean time between catastrophes" is much shorter than a discount horizon) and ¯τα≪1 (i.e. the expected mitigation time is much shorter than the discount horizon), which allows simplifying the above to
EU∗υk(1−α)−EUπ∗¯υk[σkSk](1−α)=O(δ¯τ1/4α−3/4)
We give a simple example.
#Example 1
Let A={0,1,∗}, O={0,1}. For any n∈N and k∈[N], we fix some wkn∈{0,1}n and define the catastrophe MDP Mkn by
SDMkn={0,1}≤n, SFMkn={⊥,⊤}, SCMkn=∅ (adding corrupted states is an easy exercise).
If s∈{0,1}<n and a∈{0,1} then
TMkn(sa∣s,a)=1−δ
TMkn(⊥∣s,a)=δ
TMkn(s∣sa,∗)=1−δ
TMkn(⊥∣sa,∗)=δ
If a∈0,1 then
TMkn(⊤∣wkn,a)=1
If s∈{0,1}n∖wkn and a∈0,1 then
TMkn(⊥∣s,a)=1
If s∈{⊥,⊤} and a∈A then
TMkn(s∣s,a)=1
RMkn(⊥)=0, if s∈SMkn∖⊥ then RMkn(s)=1.
Skn(λA×O)=λ{0,1} and Skn(hao)=⊥ iff o=0 (this defines a unique Skn).
If s∈{0,1}<n∪{⊥,⊤} then σkn(a∣s)=13 for any a∈A.
σkn(0∣wkn)=ϵ, σkn(∗∣wkn)=1−ϵ.
If s∈{0,1}n∖wkn then σkn(0∣s)=δ, σkn(∗∣s)=1−δ.
We have ¯τ=n. Consider the asymptotic regime n→∞, αn=Θ(n−6), δn=Θ(n−5). According to Theorem 1, we get
The probability of a catastrophe (i.e. ending up in state ⊥) for the optimal policy for a given k is O(¯τδ)=O(n−4). Therefore, the probability of a catastrophe for policy π∗n is O(n−4+n−1/4)=O(n−1/4). On the other hand, it is easy to see that the policy σkn has a probability of catastrophe 1−o(1) (and in particular regret Ω(1)): it spends Ω(2n) time "exploring" {0,1}≤n with a probability δ=Θ(n−5) of a catastrophe on every step.
Note that this example can be interpreted as a version of Christiano's approval-directed agent, if we regard the state s∈{0,1}i as a "plan of action" that the advisor may either approve or not. But in this formalism, it is a special case of consequentialist reasoning.
Theorem 1 speaks of a finite set of environments, but as before (see Proposition 1 here and Corollary 3 here), there is a "structural" equivalent, i.e. we can use it to produce corollaries about Bayesian agents with priors over a countable set of environments. The difference is, in this case we consider asymptotic regimes in which the environment is also variable, so the probability weight of the environment in the prior will affect the regret bound. We leave out the details for now.
##Appendix A
We start by deriving a more general and more precise version of the non-catastrophic regret bound, in which the optimal policy is replaced by an arbitrary "reference policy" (later it will be related to the mitigation policy) and the dependence on the MDPs is expressed via a bound on the derivative of V by γ.
#Definition A.1
Fix ϵ∈(0,1). Consider an MDP M and policies π:SM→AM, . σ is called ϵ-sane relatively to π when for any s∈SM
i. suppσ(s)⊆A0Mπ
ii. σ(π(s)∣s)>ϵ
#Lemma A.1
Fix an interface I=(A,O), N∈N and ϵ∈(0,1). Now, consider for each k∈[N], an I-universe υk=(μk,r) which is an O-realization of an MDP Mk with state function Sk and policies πk:SMk→A, . Consider also α∈(0,1), ¯τ∈(0,∞) and assume that
i. σk is ϵ-sane relatively to πk.
ii. For any s∈SMk and γ∈(0,1)∣∣∣dVMkπk(s,γ)dγ∣∣∣≤¯τ
Then, there exists an ¯I-policy π∗ s.t. for any k∈[N]
EUπkSkυk(1−α)−EUπ∗¯υk[σkSk](1−α)≤O((¯τα)1/4)
The O-notation refers to the asymptotics where ϵ is fixed (so we don't explicitly examine its effect on regret) whereas α, ¯τ and the Mk are variable and α→0, ¯τ→∞.
The proof of Lemma A.1 is almost identical to the proof the main theorem for "non-catastrophic" DRL up to minor modifications needed to pass from absolute to relative regret, and tracking the contribution of the derivative of VMkπk. We give it in Appendix B.
We will not apply Lemma A.1 directly the the universes of Theorem 1. Instead, we will define new universes using the following constructions.
#Definition A.2
Consider M a catastrophe MDP. We define the catastrophe MDP MD as follows.
SFMD:=SFM⊔{⊥}, SDMD:=SDM, SCMD:=∅.
AMD=AM
For any s,t∈SDM:
TMD(t∣s)=TM(t∣s)
TMD(⊥∣s)=TM(SCM∣s)
TMD(⊥∣⊥)=1
For any s∈SDM∪SFM, t∈SFM:
TMD(t∣s)=TM(t∣s)
For any s∈SFM:
TMD(⊥∣s)=TM(SCM∪SDM∣s)
For any s∈SDM, RMD(s)=12RM(s).
For any s∈SFM, RMD(s)=1.
RMD(⊥)=0
Now, consider an interface I=(A,O) and a υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. Denote O′:=O×{R,I}, O⋆:=O×{R,I,⊥} and I⋆:=(A,O⋆). Denote β:O′→O the projection mapping and β∗:(A×O′)∗→(A×O)∗ corresponding. We define the I⋆-universe υD=(μD,r⋆) and the function S⋆:(A×O⋆)∗→SMD as follows
μD(oR∣ha):={μ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)),S(β∗(h)ao)∈SDM0 otherwise
μD(oI∣ha):={μ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)ao)∈SFM0 otherwise
μD(o⊥∣ha):=1|O|(1−∑o∈O(μD(oR∣ha)+μD(oI∣ha)))
r⋆(h):=⎧⎪
⎪
⎪
⎪
⎪⎨⎪
⎪
⎪
⎪
⎪⎩12r(λ) if h=λ12r(β∗(h)) if h∈(A×O′)∗,|h|>0 and h:|h|−1∈AOR1 if h∈(A×O′)∗,|h|>0 and h:|h|−1∈AOI0 if h∉(A×O′)∗
S⋆(h):={S(β∗(h)) if h∈(A×O′)∗⊥ otherwise
It is easy to see that υD is an O⋆-realization of MD with state function S⋆.
#Definition A.3
Consider M a catastrophe MDP. We define the catastrophe MDP ME as follows.
SFME:=SFM⊔{⊥}, SDME:=SDM, SCME:=∅.
AME=AM
TME=TMD
For any s∈SDM∪SFM, RME(s)=12RM(s).
RME(⊥)=0
Now, consider an interface I=(A,O) and a υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. We define the I⋆-universe υE=(μE,r⋆) as follows
μE(oR∣ha):=⎧⎪⎨⎪⎩μ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)),S(β∗(h)ao)∈SDMμ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)ao)∈SFM0 otherwise
μE(oI∣ha):=0
μE(⊥∣ha):=1−∑o∈OμE(oR∣ha)
It is easy to see that υE is an O⋆-realization of ME with state function S⋆.
Given h=a0o0a1o1…an−1on−1∈(A×O)n, we will use the notation
R∗h:=a0o0Ra1o1R…an−1on−1R∈(A×O′)n
Given an I⋆-policy π, the I-policy πR∗ is defined by πR∗(h):=π(R∗h).
In order to utilize condition iii of Definition 8, we need to establish the following relation between M♭ and MD, ME.
#Proposition A.2
Consider M a catastrophe MDP, some s∈SM∖SCM and π∗ a proper mitigation policy. Then
A0M♭π∗(s)∩A♯M(s)⊆A0MDπ∗(s)
A0M♭π∗(s)∩A♯M(s)⊆A0MEπ∗(s)
For the purpose of the proof, the following notation will be convenient
#Definition A.4
Consider S a finite set and some . We define by
T∞:=limn→∞1nn−1∑m=0Tm
As well known, the limit above always exists.
#Proof of Proposition A.2
Consider any s∈S∖SCM and a∈A0M♭π∗(s)∩A♯M(s). Since a∈A0M♭π∗(s), we have
Now we will establish a bound on the derivative of V by γ in terms of expected mitigation time, in order to demonstrate condition ii of Lemma A.1.
#Proposition A.3
Fix ¯τ,¯τF1∈(0,∞). Consider a catastrophe MDP M and a proper mitigation policy with expected mitigation time ¯τ. Assume than for any s∈SFM and γ∈(0,1)
∣∣∣dVMπ(s,γ)dγ∣∣∣≤¯τF1
Then, for any s∈SM∖SCM and γ∈[0,1]
∣∣∣dVMπ(s,γ)dγ∣∣∣≤3¯τ1+¯τF1
Note that, since VMπ(s,γ) is a rational function of γ with no poles on the interval [0,1], some finite ¯τF always exists. Note also that Proposition A.3 is really about Markov chains rather than MDPs, but we don't make it explicit to avoid introducing more notation.
#Proof of Proposition A.3
Let μMπs∈ΔSωM be the Markov chain with transition matrix TMπ and initial state s. For any γ∈(0,1), we have
VMπ(s,γ)=Ex∼μMπs[(1−γ)∞∑n=0γnRM(xn)]
Given x∈SωM, we define τ(x)∈N⊔{∞} by
τ(x)=min{n∈N∣xn∈SFM}
It is easy to see that VMπ(s,γ) can be rewritten as
To transform the relative regret bounds for "auxiliary" universes obtained from Lemma A.1 to regret bounds for the original universes, we will need the following.
#Definition A.5
Fix δ∈(0,1) and a universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. Let ~M be a δ-perturbation of M. An environment ~μ is said to be a δ-lift of ~M to μ when
i. (~μ,r) is an O-realization of ~M with state function S.
ii. hdom~μ⊆hdomμ
iii. For any h∈hdom~μ and a∈A, if S(h)∈SM∖SDM then μ(ha)=~μ(ha).
iv. For any h∈hdom~μ and a∈A, if S(h)∈SDM then there exists ζ∈ΔO s.t. μ(ha)=(1−δ)~μ(ha)+δζ
It is easy to see that such a lift always exists, for example we can take:
Consider γ,δ∈(0,1) s.t δ≥1−γ and ¯τ∈(0,∞). Let υ=(μ,r) be a universe which is an O-realization of a catastrophe MDP M with state function S. Suppose that π∗ is a mitigation policy for M that has expected mitigation time ¯τ. Consider some I⋆-policy π. Suppose that ~M is a δ-perturbation of M and ~μ is a δ-lift of ~M to μ. Denote ~υ:=(~μ,r). Then, there is some C∈(0,∞) that depends on nothing s.t
In order to prove Proposition A.4, we need a relative regret bound for M derived from a relative regret bound for ME.
#Proposition A.5
Fix an interface I and an I-universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S s.t. S(λ)∉SCM. Suppose that π∗ is a mitigation policy for M. Let π be any I⋆-policy. Then, for any γ∈(0,1)
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆υE(γ)−EUπυE(γ)
#Proof of Proposition A.5
π∗ is a mitigation policy, therefore for any s∈SM∖SCM, TMEπ∗(s)=TMπ∗(s). It follows that
EUπ∗S⋆υE(γ)=12EUπ∗Sυ(γ)
Also, it is easy to see from the definition of TME and r⋆ that
EUπυE(γ)≤12EUπR∗υ(γ)
Indeed, any discrepancy between the behavior of ME and M involves transition to the state ⊥ which yields 0 reward forever. Subtracting these inequalities, we get the desired result.
Another observation we need to prove Proposition A.4 is a bound on the effect of δ-perturbations in terms of mitigation time.
#Proposition A.6
Consider γ,δ∈(0,1), a universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S, and some . Assume that for any h∈S−1(SFM) and a∈suppπ(h), TM(SFM∣s,a)=1. Let ~M be a δ-perturbation of M and ~μ a δ-lift of ~M to μ. Then,
dtv(μ⋈π,~μ⋈π)≤2(1−Ex∼~μ⋈π[(1−δ)τS(x)])
#Proof of Proposition A.6
It is straightforward to construct a probability space (Ω,P), X:Ω→(A×O)ω measurable and {Cn⊆Ω}n∈N measurable s.t.
i. h∗(P)=μ⋈π
ii. For any n∈N and h∈(A×O)n s.t. S(h)∈SDM:
Pr[Cn∣X:n=h]=δ
iii. For any n∈N and h∈(A×O)n s.t. S(h)∈SM∖SDM:
Pr[Cn∣X:n=h]=0
iv. For any n∈N, h∈(A×O)n and h′∈(A×O)n+1:
Pr[X:n+1=h′∣X:n=h and not Cn]=Prx∼~μ⋈π[x:n+1=h′∣x:n=h]
Denote D:=∩∞n=0Ω∖Cn. We have
P(D)=Ex∼~μ⋈π[(1−δ)τS(x)]
dtv(P,P∣D)≤1−P(D)
dtv(μ⋈π,h∗(P∣D))≤1−P(D)
Also, it is easy to see that for any A⊆(A×O)ω measurable
As a final ingredient towards the proof of Proposition A.4, we will need to use the relative regret bound for MD to get a certain statistical bound on mitigation time.
#Definition A.6
Let μ be any environment. We define the closed set hdomωμ⊆(A×O)ω by
hdomωμ:={x∈(A×O)ω∣∀n∈N:x:n∈hdomμ}
Consider a universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. We define the measurable function τS:hdomωμ→N⊔{∞} as follows
τS(x):=min{n∈N∣S(x:n)∈SFM}
#Proposition A.7
Fix an interface I and an I-universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S s.t. S(λ)∈SDM. Suppose that π∗ is a mitigation policy for M that has the expected mitigation time ¯τ∈(0,∞). Let π be any I⋆-policy. Then, there is C∈(0,∞) that depends on nothing s.t. for any γ,δ∈(0,1), if δ≥1−γ then
Denote ρ:=max(EUπ∗S⋆υD(γ)−EUπυD(γ),0). By choosing C sufficiently large, we can assume without loss of generality that the right hand side is positive since, unless γ¯τ≈1, we would have Cδ¯τ≥C(1−γ)¯τ>1, and unless ρ≪1, we would have Cδρ1−γ≥Cρ>1. In either case, the inequality we are trying to prove would hold. Also, note that ln(1−δ)lnγ≥1. We get
By the same reasoning as before, we can assume without loss of generality that e.g. γ¯τ≤1−12(1−γ)¯τ. It follows that
2γ¯τ−2ρ−1≤2−(1−γ)¯τ−2ρ−1=1−(1−γ)¯τ−2ρ≤1−(1−γ)=γ
ln(2γ¯τ−2ρ−1)lnγ≥1
Combining this with the previous inequality implies
EμD⋈π[(1−δ)τS⋆]≥1−δln(2γ¯τ−2ρ−1)lnγ
It is easy to see that there is x0∈(0,1) s.t. for any x∈[x0,1], 2x−1≥x3 and therefore 2x−1x3≥1. Therefore, for any such x and ρ≪1, e−Cρ≤2x−1x3−2ρx30≤2x−2ρ−1x3, where it is sufficient to assume that C>2x30. Taking x=γ¯τ, we conclude (assuming C≥3 and observing that γ11−γ≤e−1)
γCρ1−γ≤e−Cρ≤2γ¯τ−2ρ−1γ3¯τ≤2γ¯τ−2ρ−1γC¯τ
γC(¯τ+ρ1−γ)≤2γ¯τ−2ρ−1
Taking logarithm of both sides
C(¯τ+ρ1−γ)lnγ≤ln(2γ¯τ−2ρ−1)
C(¯τ+ρ1−γ)≥ln(2γ¯τ−2ρ−1)lnγ
Combining with the inequality we had before, we get
EμD⋈π[(1−δ)τS⋆]≥1−Cδ(¯τ+ρ1−γ)
#Proof of Proposition A.4
By Proposition A.5, we have
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆υE(γ)−EUπυE(γ)
Note that ~ME is a δ-perturbation of ME and ~μE is a δ-lift of ~ME to μE. The condition of Proposition A.6 holds tautologically due to Definition A.3. Therefore, we can apply Proposition A.6 and get
The following definition will be useful in order to apply Proposition A.4.
#Definition A.7
Consider M a catastrophe MDP s.t. AM=A and a policy . We then define the catastrophe MDP M[σ] as follows:
SFM[σ]:=SFM, SDM[σ]:=SDM, SCM[σ]:=SCM.
AM[σ]:=¯A
For any s∈SM and a∈A: TM[σ](s,a):=TM(s,a).
For any s∈SM: TM[σ](s,⊥):=TMσ(s).
RM[σ]:=RM
Now consider υ an O-realization of M with state function S. Then, ¯υ[σS] is clearly an ¯O-realization of M[σ] with the state function ¯S defined by ¯S(h):=S(h––).
Note also that M[σ]D≅MD[σ] and M[σ]E≅ME[σ] (where interpreting σ as a policy for MD or ME requires choosing an arbitrary value for the state ⊥). Moreover, ¯I⋆≅¯¯¯¯¯¯I⋆, ¯υ[σS]D=¯¯¯¯¯¯υD[σS⋆], ¯υ[σS]E=¯¯¯¯¯¯υE[σS⋆] and ¯S⋆=¯¯¯¯¯¯S⋆.
Finally, we are read to prove the main theorem.
#Proof of Theorem 1
For every k∈[N], denote ~Mk and ~σk the δ-perturbations of Mk and σk respectively and πk the deterministic proper mitigation policy for ~Mk of Definition 8. Let ~μk be a lift of ~Mk to μk and denote ~υk:=(~μk,rk). Define H:={~υkD,~υkE}k∈[N]. Observe that ~σk is ϵ-sane relatively to πk in the sense of ~MkD and ~MkE both: condition i of Definition A.1 follows by Proposition A.2 from conditions ii and iii of Definition 8, and condition ii of Definition A.1 follows from condition iv of Definition 8. Moreover, by Proposition A.3, we have
∣∣
∣∣dV~MDπk(s,γ)dγ∣∣
∣∣=O(¯τ)
∣∣
∣∣dV~MEπk(s,γ)dγ∣∣
∣∣=O(¯τ)
Here, we used that MFk is fixed (and thus so is ¯τF, by conditions i-iii).
Condition iv implies that all the universes in H have a common reward function (notice that transition to a corrupted state induces the observation ⊥ whereas transition to a state in SFM in the universe ~υkD induces the observation I). Therefore, we can use Lemma A.1 to conclude that there exists an ¯¯¯¯¯¯I⋆-policy π♯ s.t.
It is easy to see that ~Mk[~σk] is a 2δ-perturbation of Mk[σk]. Observe also that ¯¯¯¯¯¯¯¯~υkD[~σkSk⋆]=¯¯¯¯¯~υk[~σkSk]D, ¯¯¯¯¯¯¯¯~υkE[~σkSk⋆]=¯¯¯¯¯~υk[~σkSk]E and ¯¯¯¯¯¯~μk[~σkSk] is a 2δ-lift of ~Mk[~σk] to ¯¯¯¯¯¯μk[σkSk]. Applying Proposition A.4, we get
Consider a universe υ=(μ,r) which an O-realization of an MDP M with state function S, a stationary policy , an arbitrary I-policy π0 and some γ∈(0,1). Then,
For the sake of encumbering the notation less, we will omit the parameter γ in functions that depend on it. We will use S implicitly, i.e. given F a function on SM and h∈hdomμ, F(h):=F(S(h)). Finally, we will omit Mπ∗, using the shorthand notations V:=VMπ∗, Q:=QMπ∗.
It is easy to see that the second term vanishes, yielding the desired result.
#Proposition B.2
Consider some τ∈(0,∞), T∈N+, a universe υ=(μ,r) that is an O-realization of M with state function S, a stationary policy and an arbitrary I-policy . For any n∈N, let π∗n be an I-policy s.t. for any h∈hdomμ
For the sake of encumbering the notation less, we will use S implicitly, i.e. given F a function on SM and h∈hdomμ, F(h):=F(S(h)). Also, we will omit Mπ∗, using the shorthand notations V:=VMπ∗, Q:=QMπ∗.
Denote ρ∗l:=μ⋈π∗l, ρ0:=μ⋈π0. We also use the shorthand notations rn:=r(x:n), Vn(γ):=V(x:n,γ), Qn(γ):=Q(x:n,xAn,γ). Both π∗l and π∗l+1 coincide with π∗ after (l+1)T, therefore
Fix γ∈(0,1), η∈(0,N−1) and T∈N+. Denote νk:=¯μk[σkSk]. To avoid cumbersome notation, whenever Mk should appear a subscript, we will replace it by k. Let (Ω,P∈ΔΩ) be a probability space\Comment{ and {Fn⊆2Ω}n∈N⊔{−1} a filtration of Ω}. Let K:Ω→[N] be \Comment{measurable w.r.t. F−1}a random variable and the following be stochastic processes\Comment{ adapted to F}
Zn,~Zn:Ω→Δ[N]
Jn:Ω→[N]
Ψn:Ω→A
An:Ω→¯A
Θn:Ω→¯O
We also define AΘ:n:Ω→¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ by
AΘ:n:=A0Θ0A1Θ1…An−1Θn−1
(The following conditions on A and Θ imply that the range of the above is indeed in ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗.) Let D and D!k be as in Proposition C.1 (we assume w.l.o.g. that ϵ<1|A|). We construct Ω\Comment{, F}, K, Z, ~Z, J, Ψ, A and Θ s.t K is uniformly distributed and for any k∈[N], l∈N, m∈[T] and o∈O, denoting n=lT+m
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on "impossible" information.
We now construct the ¯I-policy π∗ s.t. for any n∈N, h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. Pr[AΘ:n=h]>0 and a∈¯A
π∗(a∣h):=Pr[An=a∣AΘ:n=h]
That is, we perform Thompson sampling at time intervals of size T, moderated by the delegation routine D, and discard from our belief state hypotheses whose probability is below η and hypotheses sampling which resulted in recommending "unsafe" actions i.e. actions that D refused to perform.
In order to prove π∗ has the desired property, we will define the stochastic processes Z!, ~Z!, J!, Ψ!, A! and Θ!, each process of the same type as its shriekless counterpart (thus Ω is constructed to accommodate them). These processes are required to satisfy the following:
% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
Previously we derived a regret bound for DRL which assumed the advisor is "locally sane." Such an advisor can only take actions that don't lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into "dangerous" (impending catastrophe) and "safe" (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain "leaking" of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the "value leak" rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1. %Thus, this formalism can be regarded as a simple model of aligned superintelligence: the agent is aligned since near-optimal utility is achieved despite the presence of corrupted states (which are treated more or less the same as before) and it is superintelligent since its performance is vastly better than the performance of the advisor.
Appendix A proves the main theorem. Appendix B contains the proof of an important lemma which is however almost identical to what appeared in the previous essay. Appendix C contains several propositions from the previous essay which we are used in the proof.
##Results
We start by formalising the concepts of a "catastrophe" and "catastrophe mitigation" in the language of MDPs.
#Definition 1
A catastrophe MDP is an MDP M together with a partition of SM into subsets SM:=SFM⊔SDM⊔SCM (safe, dangerous and corrupt states respectively).
#Definition 2
Fix a catastrophe MDP M. Define A♯M:SM→2AM by
A♯M(s):=⎧⎪⎨⎪⎩{a∈AM∣suppT(s,a)⊆SFM} if s∈SFM{a∈AM∣suppT(s,a)∩SCM=∅} if s∈SDMAM if s∈SCM
is called a mitigation policy for M when
i. For any s∈SM, suppπ(s)⊆A♯M(s).
π is called a proper mitigation policy for M when condition i holds and
ii. For any s∈SDM, limn→∞TnMπ(SFM∣s)=1.
#Definition 3
Fix ¯τ∈(0,∞), a catastrophe MDP M and a proper mitigation policy π. π is said to have expected mitigation time ¯τ when for any s∈SDM
∞∑n=0(n+1)(Tn+1Mπ(SFM∣s)−TnMπ(SFM∣s))=¯τ
Next, we introduce the notion of an MDP perturbation. We will use it by considering perturbations of a catastrophe MDP which "eliminate the catastrophe."
#Definition 4
Fix δ∈(0,1) and consider a catastrophe MDP M. An MDP ~M is said to be a δ-perturbation of M when
i. S~M=SM
ii. A~M=AM
iii. R~M=RM
iv. For any s∈SM∖SDM and a∈AM, T~M(s,a)=TM(s,a)
v. For any s∈SDM and a∈AM, there exists ζ∈ΔSM s.t. TM(s,a)=(1−δ)T~M(s,a)+δζ.
Similarly, we can consider perturbations of a policy.
#Definition 5
Fix δ∈(0,1) and consider a catastrophe MDP M. Given and , ~π is said to be a δ-perturbation of π when
i. For any s∈SM∖SDM, ~π(s)=π(s).
ii. For any s∈SDM, there exists α∈ΔA s.t. π(s)=(1−δ)~π(s)+δα.
We will also need to introduce policy-specific value functions, Q-functions and relatively k-optimal actions.
#Definition 6
Fix an MDP M and . We define VMπ:SM×(0,1)→[0,1] and QMπ:SM×AM×(0,1)→[0,1] by
VMπ(s,γ):=(1−γ)∞∑n=0γnETnMπ(s)[RM]
QMπ(s,a,γ):=(1−γ)RM(s)+γEt∼TM(s,a)[VMπ(t,γ)]
For each k∈N, we define VkMπ:SM→R, QkMπ:SM×AM→R and AkMπ:SM→2AM by
VkMπ(s):=(−1)kdkVMπ(s,γ)dγk∣∣∣γ=1
QkMπ(s,a):=(−1)kdkQMπ(s,a,γ)dγk∣∣∣γ=1
A0Mπ(s):={a∈AM∣Q0Mπ(s,a)≥V0Mπ(s)}
Ak+1Mπ(s):={a∈AkMπ(s)∣Qk+1Mπ(s,a)≥Vk+1Mπ(s) or ∃j≤k:QjMπ(s,a)>VjMπ(s)}
Now we give the new (weaker) condition on the advisor policy. For notational simplicity, we assume the policy is stationary. It is easy to generalize these results to non-stationary advisor policies and to policies that depend on irrelevant additional information (i.e. policies for universes that are realizations of the MDP).
#Definition 7
Given a catastrophe MDP M, we denote M♭ the MDP defined by
SM♭=SM
AM♭=AM
TM♭=TM
For any s∉SFM, RM♭(s)=0.
For any s∈SFM, RM♭(s)=12+12RM(s).
#Definition 8
Fix ϵ,δ,γ∈(0,1). Consider a catastrophe MDP M. A policy is said to be locally (ϵ,δ)-sane for (M,γ) when there exists a δ-perturbation ~M of M with a deterministic proper mitigation policy π∗:SM→AM and a δ-perturbation ~π of π s.t.
i. For all s∈SM, VMπ∗(s,γ)=VM(s,γ).
ii. ~π is a mitigation policy for ~M.
iii. For any s∈SM∖SCM: supp~π(s)⊆A0~M♭π∗(s)
iv. For any s∈SM∖SCM: ~π(π∗(s)∣s)>ϵ
% Alternatively we could allow \pi^* to be stochastic... % and require \tilde{\pi} to be of the form \epsilon \pi^* + (1 - \epsilon) \pi'
Given ¯τ∈(0,∞), π is said to have potential mitigation time ¯τ when π∗ has it as expected mitigation time.
Note that a locally (ϵ,δ)-sane policy still has to be 0-optimal in SFM. This requirement seems reasonably realistic, since, roughly speaking, it only means that there is some way to "rearrange the universe" that the agent can achieve, and that would be "endorsed" by the advisor, s.t this rearrangement doesn't destroy substantially much value and s.t. after this rearrangement, there is no "impending catastrophe" that the agent has to prevent and the advisor wouldn't be able to prevent in its place. In particular, this rearrangement may involve creating some subagents inside the environment and destroying the original agent, in which case any policy on SFM is "vacuously" optimal (since all actions have no effect).
We can now formulate the main result.
#Theorem 1
Fix an interface I=(A,O), N∈N, ϵ∈(0,1) and for each k∈[N], an MDP MFk s.t. AMFk=A. Now, consider for each k∈[N], an I-universe υk=(μk,rk) which is an O-realization of a catastrophe MDP Mk with state function Sk s.t.
i. SFMk=SMFk
ii. For each s∈SMFk and a∈A, TMk(s,a)∣SMFk=TMFk(s,a).
iii. For each s∈SMFk, RMk(s)=RMFk(s).
iv. Given k,j∈[N] and h∈hdomμk∩hdomμj, if Sk(h)∈SMk∖SCMk and Sj(h)∈SMj∖SCMj, then rk(h)=rj(h) (this condition means that in uncorrupted states, the reward is observable).
Consider also α,δ∈(0,1), ¯τ∈(0,∞) and σk a locally (ϵ,δ)-sane policy for (Mk,1−α). Assume σk has potential mitigation time ¯τ. Then, there exists an ¯I-policy π∗ s.t. for any k∈[N]
EU∗υk(1−α)−EUπ∗¯υk[σkSk](1−α)=O(max(δα,1)⋅(¯τα)1/4+¯τδ)
Here, σkSk is the I-policy defined by σkSk(h):=σk(Sk(h)). ϵ and the MFk are regarded as fixed and we don't explicitly examine their effect on regret, whereas α, δ, ¯τ and the Mk are regarded as variable with the asymptotics α,δ→0, ¯τ→∞.
In most interesting cases, δ≫α (i.e. the "mean time between catastrophes" is much shorter than a discount horizon) and ¯τα≪1 (i.e. the expected mitigation time is much shorter than the discount horizon), which allows simplifying the above to
EU∗υk(1−α)−EUπ∗¯υk[σkSk](1−α)=O(δ¯τ1/4α−3/4)
We give a simple example.
#Example 1
Let A={0,1,∗}, O={0,1}. For any n∈N and k∈[N], we fix some wkn∈{0,1}n and define the catastrophe MDP Mkn by
SDMkn={0,1}≤n, SFMkn={⊥,⊤}, SCMkn=∅ (adding corrupted states is an easy exercise).
If s∈{0,1}<n and a∈{0,1} then
TMkn(sa∣s,a)=1−δ
TMkn(⊥∣s,a)=δ
TMkn(s∣sa,∗)=1−δ
TMkn(⊥∣sa,∗)=δ
TMkn(⊤∣wkn,a)=1
TMkn(⊥∣s,a)=1
TMkn(s∣s,a)=1
RMkn(⊥)=0, if s∈SMkn∖⊥ then RMkn(s)=1.
Skn(λA×O)=λ{0,1} and Skn(hao)=⊥ iff o=0 (this defines a unique Skn).
If s∈{0,1}<n∪{⊥,⊤} then σkn(a∣s)=13 for any a∈A.
σkn(0∣wkn)=ϵ, σkn(∗∣wkn)=1−ϵ.
If s∈{0,1}n∖wkn then σkn(0∣s)=δ, σkn(∗∣s)=1−δ.
We have ¯τ=n. Consider the asymptotic regime n→∞, αn=Θ(n−6), δn=Θ(n−5). According to Theorem 1, we get
EU∗υkn(1−αn)−EUπ∗n¯υkn[σkn](1−αn)=O(n−5⋅n1/4⋅(n−6)−3/4)=O(n−1/4)=O(α1/24n)
The probability of a catastrophe (i.e. ending up in state ⊥) for the optimal policy for a given k is O(¯τδ)=O(n−4). Therefore, the probability of a catastrophe for policy π∗n is O(n−4+n−1/4)=O(n−1/4). On the other hand, it is easy to see that the policy σkn has a probability of catastrophe 1−o(1) (and in particular regret Ω(1)): it spends Ω(2n) time "exploring" {0,1}≤n with a probability δ=Θ(n−5) of a catastrophe on every step.
Note that this example can be interpreted as a version of Christiano's approval-directed agent, if we regard the state s∈{0,1}i as a "plan of action" that the advisor may either approve or not. But in this formalism, it is a special case of consequentialist reasoning.
Theorem 1 speaks of a finite set of environments, but as before (see Proposition 1 here and Corollary 3 here), there is a "structural" equivalent, i.e. we can use it to produce corollaries about Bayesian agents with priors over a countable set of environments. The difference is, in this case we consider asymptotic regimes in which the environment is also variable, so the probability weight of the environment in the prior will affect the regret bound. We leave out the details for now.
##Appendix A
We start by deriving a more general and more precise version of the non-catastrophic regret bound, in which the optimal policy is replaced by an arbitrary "reference policy" (later it will be related to the mitigation policy) and the dependence on the MDPs is expressed via a bound on the derivative of V by γ.
#Definition A.1
Fix ϵ∈(0,1). Consider an MDP M and policies π:SM→AM, . σ is called ϵ-sane relatively to π when for any s∈SM
i. suppσ(s)⊆A0Mπ
ii. σ(π(s)∣s)>ϵ
#Lemma A.1
Fix an interface I=(A,O), N∈N and ϵ∈(0,1). Now, consider for each k∈[N], an I-universe υk=(μk,r) which is an O-realization of an MDP Mk with state function Sk and policies πk:SMk→A, . Consider also α∈(0,1), ¯τ∈(0,∞) and assume that
i. σk is ϵ-sane relatively to πk.
ii. For any s∈SMk and γ∈(0,1) ∣∣∣dVMkπk(s,γ)dγ∣∣∣≤¯τ
Then, there exists an ¯I-policy π∗ s.t. for any k∈[N]
EUπkSkυk(1−α)−EUπ∗¯υk[σkSk](1−α)≤O((¯τα)1/4)
The O-notation refers to the asymptotics where ϵ is fixed (so we don't explicitly examine its effect on regret) whereas α, ¯τ and the Mk are variable and α→0, ¯τ→∞.
The proof of Lemma A.1 is almost identical to the proof the main theorem for "non-catastrophic" DRL up to minor modifications needed to pass from absolute to relative regret, and tracking the contribution of the derivative of VMkπk. We give it in Appendix B.
We will not apply Lemma A.1 directly the the universes of Theorem 1. Instead, we will define new universes using the following constructions.
#Definition A.2
Consider M a catastrophe MDP. We define the catastrophe MDP MD as follows.
SFMD:=SFM⊔{⊥}, SDMD:=SDM, SCMD:=∅.
AMD=AM
For any s,t∈SDM:
TMD(t∣s)=TM(t∣s)
TMD(⊥∣s)=TM(SCM∣s)
TMD(⊥∣⊥)=1
TMD(t∣s)=TM(t∣s)
TMD(⊥∣s)=TM(SCM∪SDM∣s)
For any s∈SDM, RMD(s)=12RM(s).
For any s∈SFM, RMD(s)=1.
RMD(⊥)=0
Now, consider an interface I=(A,O) and a υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. Denote O′:=O×{R,I}, O⋆:=O×{R,I,⊥} and I⋆:=(A,O⋆). Denote β:O′→O the projection mapping and β∗:(A×O′)∗→(A×O)∗ corresponding. We define the I⋆-universe υD=(μD,r⋆) and the function S⋆:(A×O⋆)∗→SMD as follows
μD(oR∣ha):={μ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)),S(β∗(h)ao)∈SDM0 otherwise
μD(oI∣ha):={μ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)ao)∈SFM0 otherwise
μD(o⊥∣ha):=1|O|(1−∑o∈O(μD(oR∣ha)+μD(oI∣ha)))
r⋆(h):=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩12r(λ) if h=λ12r(β∗(h)) if h∈(A×O′)∗, |h|>0 and h:|h|−1∈AOR1 if h∈(A×O′)∗, |h|>0 and h:|h|−1∈AOI0 if h∉(A×O′)∗
S⋆(h):={S(β∗(h)) if h∈(A×O′)∗⊥ otherwise
It is easy to see that υD is an O⋆-realization of MD with state function S⋆.
#Definition A.3
Consider M a catastrophe MDP. We define the catastrophe MDP ME as follows.
SFME:=SFM⊔{⊥}, SDME:=SDM, SCME:=∅.
AME=AM
TME=TMD
For any s∈SDM∪SFM, RME(s)=12RM(s).
RME(⊥)=0
Now, consider an interface I=(A,O) and a υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. We define the I⋆-universe υE=(μE,r⋆) as follows
μE(oR∣ha):=⎧⎪⎨⎪⎩μ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)),S(β∗(h)ao)∈SDMμ(o∣β∗(h)a) if h∈(A×O′)∗ and S(β∗(h)ao)∈SFM0 otherwise
μE(oI∣ha):=0
μE(⊥∣ha):=1−∑o∈OμE(oR∣ha)
It is easy to see that υE is an O⋆-realization of ME with state function S⋆.
Given h=a0o0a1o1…an−1on−1∈(A×O)n, we will use the notation
R∗h:=a0o0Ra1o1R…an−1on−1R∈(A×O′)n
Given an I⋆-policy π, the I-policy πR∗ is defined by πR∗(h):=π(R∗h).
In order to utilize condition iii of Definition 8, we need to establish the following relation between M♭ and MD, ME.
#Proposition A.2
Consider M a catastrophe MDP, some s∈SM∖SCM and π∗ a proper mitigation policy. Then
A0M♭π∗(s)∩A♯M(s)⊆A0MDπ∗(s)
A0M♭π∗(s)∩A♯M(s)⊆A0MEπ∗(s)
For the purpose of the proof, the following notation will be convenient
#Definition A.4
Consider S a finite set and some . We define by
T∞:=limn→∞1nn−1∑m=0Tm
As well known, the limit above always exists.
#Proof of Proposition A.2
Consider any s∈S∖SCM and a∈A0M♭π∗(s)∩A♯M(s). Since a∈A0M♭π∗(s), we have
E(T∞M♭π∗∘TM♭)(s,a)[RM♭]=ETM♭(s,a)[V0M♭π∗]=Q0M♭π∗(s,a)≥V0M♭π∗(s)=ET∞M♭π∗(s)[RM♭]
Let N be either of MD and ME. Since a∈A♯M(s), we get
E(T∞M♭π∗∘TN)(s,a)[RM♭]≥ET∞M♭π∗(s)[RM♭]
Since π∗ is a mitigation policy, we get
E(T∞Nπ∗∘TN)(s,a)[RM♭]≥ET∞Nπ∗(s)[RM♭]
Finally, since π∗ is proper, suppT∞Nπ∗(s)⊆SFM and supp(T∞Nπ∗∘TN)(s,a)⊆SFM. We conclude
Q0Nπ∗(s,a)=E(T∞Nπ∗∘TN)(s,a)[RN]≥ET∞Nπ∗(s)[RN]=V0Nπ∗(s)
Now we will establish a bound on the derivative of V by γ in terms of expected mitigation time, in order to demonstrate condition ii of Lemma A.1.
#Proposition A.3
Fix ¯τ,¯τF1∈(0,∞). Consider a catastrophe MDP M and a proper mitigation policy with expected mitigation time ¯τ. Assume than for any s∈SFM and γ∈(0,1)
∣∣∣dVMπ(s,γ)dγ∣∣∣≤¯τF1
Then, for any s∈SM∖SCM and γ∈[0,1]
∣∣∣dVMπ(s,γ)dγ∣∣∣≤3¯τ1+¯τF1
Note that, since VMπ(s,γ) is a rational function of γ with no poles on the interval [0,1], some finite ¯τF always exists. Note also that Proposition A.3 is really about Markov chains rather than MDPs, but we don't make it explicit to avoid introducing more notation.
#Proof of Proposition A.3
Let μMπs∈ΔSωM be the Markov chain with transition matrix TMπ and initial state s. For any γ∈(0,1), we have
VMπ(s,γ)=Ex∼μMπs[(1−γ)∞∑n=0γnRM(xn)]
Given x∈SωM, we define τ(x)∈N⊔{∞} by
τ(x)=min{n∈N∣xn∈SFM}
It is easy to see that VMπ(s,γ) can be rewritten as
VMπ(s,γ)=Ex∼μMπs⎡⎣(1−γ)⎛⎝τ(x)−1∑n=0γnRM(xn)+γτ(x)1−γVMπ(xτ(x),γ)⎞⎠⎤⎦
The expression above is well defined because π is a proper mitigation policy and therefore τ(x) is finite with probability 1.
VMπ(s,γ)=Ex∼μMπs⎡⎣(1−γ)τ(x)−1∑n=0γnRM(xn)+γτ(x)VMπ(xτ(x),γ)⎤⎦
Let us decompose VMπ(s,γ) as VDMπ(s,γ)+VFMπ(s,γ) defined as follows
VDMπ(s,γ):=Ex∼μMπs⎡⎣(1−γ)τ(x)−1∑n=0γnRM(xn)⎤⎦
VFMπ(s,γ):=Ex∼μMπs[γτ(x)VMπ(xτ(x),γ)]
We have
dVDMπ(s,γ)dγ=Ex∼μMπs⎡⎣τ(x)−1∑n=0(−γn+(1−γ)⋅nγn−1)RM(xn)⎤⎦
∣∣ ∣∣dVDMπ(s,γ)dγ∣∣ ∣∣≤Ex∼μMπs⎡⎣τ(x)+(1−γ)τ(x)−1∑n=0nγn−1RM(xn)⎤⎦
The second term can be regarded as a weighted average (since 1−γ=∑∞n=0γn), where the maximal term in the average is at most τ(x)−1, hence
∣∣ ∣∣dVDMπ(s,γ)dγ∣∣ ∣∣≤2¯τ1
Also, we have
dVFMπ(s,γ)dγ=Ex∼μMπs[τ(x)γτ(x)−1VMπ(xτ(x),γ)+γτ(x)dVMπ(xτ(x),γ)dγ]
∣∣ ∣∣dVFMπ(s,γ)dγ∣∣ ∣∣≤Ex∼μMπs[τ(x)+∣∣ ∣∣dVMπ(xτ(x),γ)dγ∣∣ ∣∣]=¯τ1+Ex∼μMπs[∣∣ ∣∣dVMπ(xτ(x),γ)dγ∣∣ ∣∣]≤¯τ1+¯τF1
∣∣∣dVMπ(s,γ)dγ∣∣∣≤∣∣ ∣∣dVDMπ(s,γ)dγ∣∣ ∣∣+∣∣ ∣∣dVFMπ(s,γ)dγ∣∣ ∣∣≤3¯τ1+¯τF1
\Comment{∣∣ ∣∣dVFMπ(s,γ)dγ∣∣∣γ=1∣∣ ∣∣≤¯τ1+Ex∼μMπs[∣∣V1Mπ(xτ(x))∣∣]≤¯τ1+¯τF1
We conclude
∣∣V1Mπ(s)∣∣≤∣∣ ∣∣dVDMπ(s,γ)dγ∣∣∣γ=1∣∣ ∣∣+∣∣ ∣∣dVFMπ(s,γ)dγ∣∣∣γ=1∣∣ ∣∣≤3¯τ1+¯τF1}
To transform the relative regret bounds for "auxiliary" universes obtained from Lemma A.1 to regret bounds for the original universes, we will need the following.
#Definition A.5
Fix δ∈(0,1) and a universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. Let ~M be a δ-perturbation of M. An environment ~μ is said to be a δ-lift of ~M to μ when
i. (~μ,r) is an O-realization of ~M with state function S.
ii. hdom~μ⊆hdomμ
iii. For any h∈hdom~μ and a∈A, if S(h)∈SM∖SDM then μ(ha)=~μ(ha).
iv. For any h∈hdom~μ and a∈A, if S(h)∈SDM then there exists ζ∈ΔO s.t. μ(ha)=(1−δ)~μ(ha)+δζ
It is easy to see that such a lift always exists, for example we can take:
~μ(o∣ha):=T~M(S(hao)∣S(h),a)TM(S(hao)∣S(h),a)μ(o∣ha)
#Proposition A.4
Consider γ,δ∈(0,1) s.t δ≥1−γ and ¯τ∈(0,∞). Let υ=(μ,r) be a universe which is an O-realization of a catastrophe MDP M with state function S. Suppose that π∗ is a mitigation policy for M that has expected mitigation time ¯τ. Consider some I⋆-policy π. Suppose that ~M is a δ-perturbation of M and ~μ is a δ-lift of ~M to μ. Denote ~υ:=(~μ,r). Then, there is some C∈(0,∞) that depends on nothing s.t
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆~υE(γ)−EUπ~υE(γ)+Cδ⎛⎜ ⎜⎝¯τ+max(EUπ∗S⋆~υD(γ)−EUπ~υD(γ),0)1−γ⎞⎟ ⎟⎠
In order to prove Proposition A.4, we need a relative regret bound for M derived from a relative regret bound for ME.
#Proposition A.5
Fix an interface I and an I-universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S s.t. S(λ)∉SCM. Suppose that π∗ is a mitigation policy for M. Let π be any I⋆-policy. Then, for any γ∈(0,1)
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆υE(γ)−EUπυE(γ)
#Proof of Proposition A.5
π∗ is a mitigation policy, therefore for any s∈SM∖SCM, TMEπ∗(s)=TMπ∗(s). It follows that
EUπ∗S⋆υE(γ)=12EUπ∗Sυ(γ)
Also, it is easy to see from the definition of TME and r⋆ that
EUπυE(γ)≤12EUπR∗υ(γ)
Indeed, any discrepancy between the behavior of ME and M involves transition to the state ⊥ which yields 0 reward forever. Subtracting these inequalities, we get the desired result.
Another observation we need to prove Proposition A.4 is a bound on the effect of δ-perturbations in terms of mitigation time.
#Proposition A.6
Consider γ,δ∈(0,1), a universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S, and some . Assume that for any h∈S−1(SFM) and a∈suppπ(h), TM(SFM∣s,a)=1. Let ~M be a δ-perturbation of M and ~μ a δ-lift of ~M to μ. Then,
dtv(μ⋈π, ~μ⋈π)≤2(1−Ex∼~μ⋈π[(1−δ)τS(x)])
#Proof of Proposition A.6
It is straightforward to construct a probability space (Ω,P), X:Ω→(A×O)ω measurable and {Cn⊆Ω}n∈N measurable s.t.
i. h∗(P)=μ⋈π
ii. For any n∈N and h∈(A×O)n s.t. S(h)∈SDM:
Pr[Cn∣X:n=h]=δ
iii. For any n∈N and h∈(A×O)n s.t. S(h)∈SM∖SDM:
Pr[Cn∣X:n=h]=0
iv. For any n∈N, h∈(A×O)n and h′∈(A×O)n+1:
Pr[X:n+1=h′∣X:n=h and not Cn]=Prx∼~μ⋈π[x:n+1=h′∣x:n=h]
Denote D:=∩∞n=0Ω∖Cn. We have
P(D)=Ex∼~μ⋈π[(1−δ)τS(x)]
dtv(P, P∣D)≤1−P(D)
dtv(μ⋈π, h∗(P∣D))≤1−P(D)
Also, it is easy to see that for any A⊆(A×O)ω measurable
h∗(P∣D)(A)=Ex∼~μ⋈π[(1−δ)τS(x)χA]P(D)
It follows that
dtv(h∗(P∣D),~μ⋈π)=12Ex∼~μ⋈π[∣∣ ∣∣(1−δ)τS(x)P(D)−1∣∣ ∣∣]
dtv(h∗(P∣D),~μ⋈π)=12P(D)Ex∼~μ⋈π[∣∣(1−δ)τS(x)−P(D)∣∣]
Using the fact that (1−δ)τS(x)∈[0,1] and the convexity of the function |t−P(D)|
dtv(h∗(P∣D),~μ⋈π)≤12P(D)Ex∼~μ⋈π[(1−2δ)τS(x)(1−P(D))+(1−(1−δ)τS(x))P(D)]
dtv(h∗(P∣D),~μ⋈π)≤12P(D)(P(D)(1−P(D))+(1−P(D))P(D))=1−P(D)
Using the triangle inequality, we conclude
dtv(μ⋈π, ~μ⋈π)≤dtv(μ⋈π, h∗(P∣D))+dtv(h∗(P∣D),~μ⋈π)=2(1−Ex∼~μ⋈π[(1−δ)τS(x)])
As a final ingredient towards the proof of Proposition A.4, we will need to use the relative regret bound for MD to get a certain statistical bound on mitigation time.
#Definition A.6
Let μ be any environment. We define the closed set hdomωμ⊆(A×O)ω by
hdomωμ:={x∈(A×O)ω∣∀n∈N:x:n∈hdomμ}
Consider a universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S. We define the measurable function τS:hdomωμ→N⊔{∞} as follows
τS(x):=min{n∈N∣S(x:n)∈SFM}
#Proposition A.7
Fix an interface I and an I-universe υ=(μ,r) which is an O-realization of a catastrophe MDP M with state function S s.t. S(λ)∈SDM. Suppose that π∗ is a mitigation policy for M that has the expected mitigation time ¯τ∈(0,∞). Let π be any I⋆-policy. Then, there is C∈(0,∞) that depends on nothing s.t. for any γ,δ∈(0,1), if δ≥1−γ then
EμD⋈π[(1−δ)τS⋆]≥1−Cδ⎛⎜ ⎜⎝¯τ+max(EUπ∗S⋆υD(γ)−EUπυD(γ),0)1−γ⎞⎟ ⎟⎠
#Proof of Proposition A.7
For any x∈hdomμD, we have
Ur⋆γ(x)≤(1−γ)⎛⎝12τS⋆(x)−1∑n=0γn+∞∑n=τS⋆(x)γn⎞⎠=(1−γ)(12⋅1−γτS⋆(x)1−γ+γτS⋆(x)1−γ)=12+12γτS⋆(x)
It follows that
EUπυD(γ)≤12+12E=μD⋈π[γτS⋆]
If x∈hdomωμD is s.t. for all n∈N, S⋆(x:n)≠⊥, then
Ur⋆γ(x)≥(1−γ)∞∑n=τS⋆(x)γn=γτS⋆(x)
Since π∗ is a mitigation policy, it follows that
EUπ∗S⋆υD(γ)≥EμD⋈π∗S⋆[γτS⋆]≥γ¯τ
Subtracting the two inequalities, we get
EUπ∗S⋆υD(γ)−EUπυD(γ)≥γ¯τ−12−12EμD⋈π[γτS⋆]
EμD⋈π[γτS⋆]≥2γ¯τ−2(EUπ∗S⋆υD(γ)−EUπυD(γ))−1
Denote ρ:=max(EUπ∗S⋆υD(γ)−EUπυD(γ),0). By choosing C sufficiently large, we can assume without loss of generality that the right hand side is positive since, unless γ¯τ≈1, we would have Cδ¯τ≥C(1−γ)¯τ>1, and unless ρ≪1, we would have Cδρ1−γ≥Cρ>1. In either case, the inequality we are trying to prove would hold. Also, note that ln(1−δ)lnγ≥1. We get
EμD⋈π[(1−δ)τS⋆]=EμD⋈π[γτS⋆ln(1−δ)lnγ]≥EμD⋈π[γτS⋆]ln(1−δ)lnγ≥(2γ¯τ−2ρ−1)ln(1−δ)lnγ=(1−δ)ln(2γ¯τ−2ρ−1)lnγ
By the same reasoning as before, we can assume without loss of generality that e.g. γ¯τ≤1−12(1−γ)¯τ. It follows that
2γ¯τ−2ρ−1≤2−(1−γ)¯τ−2ρ−1=1−(1−γ)¯τ−2ρ≤1−(1−γ)=γ
ln(2γ¯τ−2ρ−1)lnγ≥1
Combining this with the previous inequality implies
EμD⋈π[(1−δ)τS⋆]≥1−δln(2γ¯τ−2ρ−1)lnγ
It is easy to see that there is x0∈(0,1) s.t. for any x∈[x0,1], 2x−1≥x3 and therefore 2x−1x3≥1. Therefore, for any such x and ρ≪1, e−Cρ≤2x−1x3−2ρx30≤2x−2ρ−1x3, where it is sufficient to assume that C>2x30. Taking x=γ¯τ, we conclude (assuming C≥3 and observing that γ11−γ≤e−1)
γCρ1−γ≤e−Cρ≤2γ¯τ−2ρ−1γ3¯τ≤2γ¯τ−2ρ−1γC¯τ
γC(¯τ+ρ1−γ)≤2γ¯τ−2ρ−1
Taking logarithm of both sides
C(¯τ+ρ1−γ)lnγ≤ln(2γ¯τ−2ρ−1)
C(¯τ+ρ1−γ)≥ln(2γ¯τ−2ρ−1)lnγ
Combining with the inequality we had before, we get
EμD⋈π[(1−δ)τS⋆]≥1−Cδ(¯τ+ρ1−γ)
#Proof of Proposition A.4
By Proposition A.5, we have
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆υE(γ)−EUπυE(γ)
Note that ~ME is a δ-perturbation of ME and ~μE is a δ-lift of ~ME to μE. The condition of Proposition A.6 holds tautologically due to Definition A.3. Therefore, we can apply Proposition A.6 and get
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆~υE(γ)−EUπ~υE(γ)+2(2−E~μE⋈π∗S⋆[(1−δ)τS⋆]−E~μE⋈π[(1−δ)τS⋆])
The only difference between ~μE and ~μD is the appearance of R instead of I. Therefore, we can rewrite the above as
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆~υE(γ)−EUπ~υE(γ)+2(2−E~μD⋈π∗S⋆[(1−δ)τS⋆]−E~μD⋈π[(1−δ)τS⋆])
Applying Proposition A.7 to each of the last two terms, we get
12(EUπ∗Sυ(γ)−EUπR∗υ(γ))≤EUπ∗S⋆~υE(γ)−EUπ~υE(γ)+4Cδ⎛⎜ ⎜⎝2¯τ+max(EUπ∗S⋆~υD(γ)−EUπ~υD(γ),0)1−γ⎞⎟ ⎟⎠
The following definition will be useful in order to apply Proposition A.4.
#Definition A.7
Consider M a catastrophe MDP s.t. AM=A and a policy . We then define the catastrophe MDP M[σ] as follows:
SFM[σ]:=SFM, SDM[σ]:=SDM, SCM[σ]:=SCM.
AM[σ]:=¯A
For any s∈SM and a∈A: TM[σ](s,a):=TM(s,a).
For any s∈SM: TM[σ](s,⊥):=TMσ(s).
RM[σ]:=RM
Now consider υ an O-realization of M with state function S. Then, ¯υ[σS] is clearly an ¯O-realization of M[σ] with the state function ¯S defined by ¯S(h):=S(h––).
Note also that M[σ]D≅MD[σ] and M[σ]E≅ME[σ] (where interpreting σ as a policy for MD or ME requires choosing an arbitrary value for the state ⊥). Moreover, ¯I⋆≅¯¯¯¯¯¯I⋆, ¯υ[σS]D=¯¯¯¯¯¯υD[σS⋆], ¯υ[σS]E=¯¯¯¯¯¯υE[σS⋆] and ¯S⋆=¯¯¯¯¯¯S⋆.
Finally, we are read to prove the main theorem.
#Proof of Theorem 1
For every k∈[N], denote ~Mk and ~σk the δ-perturbations of Mk and σk respectively and πk the deterministic proper mitigation policy for ~Mk of Definition 8. Let ~μk be a lift of ~Mk to μk and denote ~υk:=(~μk,rk). Define H:={~υkD,~υkE}k∈[N]. Observe that ~σk is ϵ-sane relatively to πk in the sense of ~MkD and ~MkE both: condition i of Definition A.1 follows by Proposition A.2 from conditions ii and iii of Definition 8, and condition ii of Definition A.1 follows from condition iv of Definition 8. Moreover, by Proposition A.3, we have
∣∣ ∣∣dV~MDπk(s,γ)dγ∣∣ ∣∣=O(¯τ)
∣∣ ∣∣dV~MEπk(s,γ)dγ∣∣ ∣∣=O(¯τ)
Here, we used that MFk is fixed (and thus so is ¯τF, by conditions i-iii).
Condition iv implies that all the universes in H have a common reward function (notice that transition to a corrupted state induces the observation ⊥ whereas transition to a state in SFM in the universe ~υkD induces the observation I). Therefore, we can use Lemma A.1 to conclude that there exists an ¯¯¯¯¯¯I⋆-policy π♯ s.t.
EUπkSk⋆~υkD(1−α)−EUπ♯¯¯¯¯¯¯¯¯~υkD[~σkSk⋆](1−α)≤O((¯τα)1/4)
EUπkSk⋆~υkE(1−α)−EUπ♯¯¯¯¯¯¯¯~υkE[~σkSk⋆](1−α)≤O((¯τα)1/4)
It is easy to see that ~Mk[~σk] is a 2δ-perturbation of Mk[σk]. Observe also that ¯¯¯¯¯¯¯¯~υkD[~σkSk⋆]=¯¯¯¯¯~υk[~σkSk]D, ¯¯¯¯¯¯¯¯~υkE[~σkSk⋆]=¯¯¯¯¯~υk[~σkSk]E and ¯¯¯¯¯¯~μk[~σkSk] is a 2δ-lift of ~Mk[~σk] to ¯¯¯¯¯¯μk[σkSk]. Applying Proposition A.4, we get
12(EUπkSkυk(1−α)−EUπ♯R∗¯υk[σkSk](1−α))≤O((¯τα)1/4+δ(¯τ+(¯τα)1/4α))
\Comment{Given h=a0b0o0a1b1o1…an−1bn−1on−1∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×On, we use the notation
¯R∗h:=a0b0o0Ra1b1o1R…an−1bn−1on−1R∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O′n
We define π∗ as by π∗(h):=π⋆(¯R∗h).}
Setting π∗:=π♯R∗ we get
EUπkSkυk(1−α)−EUπ∗¯υk[σkSk](1−α)≤O(max(δα,1)⋅(¯τα)1/4+¯τδ)
By condition i of Definition 8, this implies
EU∗υk(1−α)−EUπ∗¯υk[σkSk](1−α)≤O(max(δα,1)⋅(¯τα)1/4+¯τδ)
##Appendix B
Given p=(a,o)∈A×O, we denote pA:=a, pO:=o.
#Proposition B.1
Consider a universe υ=(μ,r) which an O-realization of an MDP M with state function S, a stationary policy , an arbitrary I-policy π0 and some γ∈(0,1). Then,
EUπ∗Sυ(γ)−EUπ0υ(γ)=∞∑n=0γnEx∼μ⋈π0[VMπ∗(S(x:n),γ)−QMπ∗(S(x:n),xAn,γ)]
#Proof of Proposition B.1
For the sake of encumbering the notation less, we will omit the parameter γ in functions that depend on it. We will use S implicitly, i.e. given F a function on SM and h∈hdomμ, F(h):=F(S(h)). Finally, we will omit Mπ∗, using the shorthand notations V:=VMπ∗, Q:=QMπ∗.
For any x∈hdomωμ, it is easy to see that
EUπ∗Sυ=V(λ)=∞∑n=0γn(V(x:n)−γV(x:n+1))
Ur(x)=(1−γ)∞∑n=0γnr(x:n)
EUπ∗Sυ−Ur(x)=∞∑n=0γn(V(x:n)−(1−γ)r(x:n)−γV(x:n+1))
EUπ∗Sυ−Ur(x)=∞∑n=0γn(V(x:n)−Q(x:n,xAn)+Q(x:n,xAn)−(1−γ)r(x:n)−γV(x:n+1))
Taking expected value over x, we get
EUπ∗Sυ−EUπ0υ=∞∑n=0γn(Eμ⋈π0[V(x:n)−Q(x:n,xAn)]+Eμ⋈π0[Q(x:n,xAn)−(1−γ)r(x:n)−γV(x:n+1)])
It is easy to see that the second term vanishes, yielding the desired result.
#Proposition B.2
Consider some τ∈(0,∞), T∈N+, a universe υ=(μ,r) that is an O-realization of M with state function S, a stationary policy and an arbitrary I-policy . For any n∈N, let π∗n be an I-policy s.t. for any h∈hdomμ
π∗n(h):={π0(h) if |h|<nTπ∗(h) otherwise
Assume that
i. For any h∈hdomμ suppπ0(h)⊆A0Mπ∗(S(h))
ii. For any s∈SM and γ∈(0,1) ∣∣∣dVMπ∗(s,γ)dγ∣∣∣≤τ
Then, for any γ∈(0,1),
EUπ∗υ(γ)−EUπ0υ(γ)≤(1−γ)∞∑n=0T−1∑m=0γnT+m(Ex∼μ⋈π∗n[r(x:nT+m)]−Ex∼μ⋈π0[r(x:nT+m)])+2τγT(1−γ)1−γT
#Proof of Proposition B.2
For the sake of encumbering the notation less, we will use S implicitly, i.e. given F a function on SM and h∈hdomμ, F(h):=F(S(h)). Also, we will omit Mπ∗, using the shorthand notations V:=VMπ∗, Q:=QMπ∗.
By Proposition B.1, for any l∈N
EUπ∗υ(γ)−EUπ∗lυ(γ)=∞∑n=0γnEx∼μ⋈π∗l[V(x:n,γ)−Q(x:n,xAn,γ)]
π∗l coincides with π∗ after lT, therefore the corresponding expected values vanish.
EUπ∗υ(γ)−EUπ∗lυ(γ)=lT−1∑n=0γnEx∼μ⋈π0[V(x:n,γ)−Q(x:n,xAn,γ)]
Subtracting the equalities for l+1 and l, we get
EUπ∗lυ(γ)−EUπ∗l+1υ(γ)=(l+1)T−1∑n=lTγnEx∼μ⋈π0[V(x:n,γ)−Q(x:n,xAn,γ)]
(1−γ)∞∑n=0γn(Ex∼μ⋈π∗l[r(x:n)]−Ex∼μ⋈π∗l+1[r(x:n)])=(l+1)T−1∑n=lTγnEx∼μ⋈π0[V(x:n,γ)−Q(x:n,xAn,γ)]
π∗l and π∗l+1 coincide until lT, therefore
(1−γ)∞∑n=lTγn(Ex∼μ⋈π∗l[r(x:n)]−Ex∼μ⋈π∗l+1[r(x:n)])=(l+1)T−1∑n=lTγnEx∼μ⋈π0[V(x:n,γ)−Q(x:n,xAn,γ)]
Denote ρ∗l:=μ⋈π∗l, ρ0:=μ⋈π0. We also use the shorthand notations rn:=r(x:n), Vn(γ):=V(x:n,γ), Qn(γ):=Q(x:n,xAn,γ). Both π∗l and π∗l+1 coincide with π∗ after (l+1)T, therefore
(1−γ)(l+1)T−1∑n=lTγn(Eρ∗l[rn]−Eρ0[rn])+γ(l+1)T(Eρ∗l[V(l+1)T(γ)]−Eρ0[V(l+1)T(γ)])=(l+1)T−1∑n=lTγnEρ0[Vn(γ)−Qn(γ)]
Denote V′(s,γ):=dV(s,γ)dγ. By the mean value theorem, for each s∈SM there is γ∗∈(γ,1) s.t.
V(s,γ)=V0(s)−V′(s,γ∗)⋅(1−γ)
V0(s)−τ(1−γ)≤V(s,γ)≤V0(s)+τ(1−γ)
It follows that
(1−γ)(l+1)T−1∑n=lTγnEρ∗l−ρ0[rn]+γ(l+1)T(Eρ∗l−ρ0[V0(l+1)T]+2τ(1−γ))≥(l+1)T−1∑n=lTγnEρ0[Vn(γ)−Qn(γ)]
Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.
It is easy to see that assumption i implies that V0n is a submartingale for ρ0 (whereas it is a martingale for μ⋈π∗) and therefore
Eρ∗l−ρ0[V0(l+1)T]≤0
We get
(1−γ)(l+1)T−1∑n=lTγnEρ∗l−ρ0[rn]+2τγ(l+1)T(1−γ)≥(l+1)T−1∑n=lTγnEρ0[Vn(γ)−Qn(γ)]
Summing over l, we get
(1−γ)∞∑l=0(l+1)T−1∑n=lTγnEρ∗l−ρ0[rn]+2τγT(1−γ)1−γT≥∞∑n=0γnEρ0[Vn(γ)−Qn(γ)]
Applying Proposition B.1 to the right hand side
(1−γ)∞∑l=0(l+1)T−1∑n=lTγnEρ∗l−ρ0[rn]+2τγT(1−γ)1−γT≥EUπ∗υ(γ)−EUπ0υ(γ)
#Proof of Lemma A.1
Fix γ∈(0,1), η∈(0,N−1) and T∈N+. Denote νk:=¯μk[σkSk]. To avoid cumbersome notation, whenever Mk should appear a subscript, we will replace it by k. Let (Ω,P∈ΔΩ) be a probability space\Comment{ and {Fn⊆2Ω}n∈N⊔{−1} a filtration of Ω}. Let K:Ω→[N] be \Comment{measurable w.r.t. F−1}a random variable and the following be stochastic processes\Comment{ adapted to F}
Zn,~Zn:Ω→Δ[N]
Jn:Ω→[N]
Ψn:Ω→A
An:Ω→¯A
Θn:Ω→¯O
We also define AΘ:n:Ω→¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ by
AΘ:n:=A0Θ0A1Θ1…An−1Θn−1
(The following conditions on A and Θ imply that the range of the above is indeed in ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗.) Let D and D!k be as in Proposition C.1 (we assume w.l.o.g. that ϵ<1|A|). We construct Ω\Comment{, F}, K, Z, ~Z, J, Ψ, A and Θ s.t K is uniformly distributed and for any k∈[N], l∈N, m∈[T] and o∈O, denoting n=lT+m
~Z0(k)≡1N
Zn(k)=~Zn(k)[[~Zn(k)≥η]]∑N−1j=0~Zn(j)[[~Zn(j)≥η]]
Pr[Jl=k∣ZlT]=ZlT(k)
Ψn=πJl(AΘ:n)
Pr[Θn=o∣AΘ:n]=νK(o∣AΘ:n)
An=D(AΘ:n,Ψn)
~Zn+1(k)N−1∑j=0Zn(j)[[An=D!j(AΘ:n,Ψn)]]νj(Θn∣AΘ:nAn)=Zn(k)[[An=D!k(AΘ:n,Ψn)]]νk(Θn∣AΘ:nAn)
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on "impossible" information.
We now construct the ¯I-policy π∗ s.t. for any n∈N, h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. Pr[AΘ:n=h]>0 and a∈¯A
π∗(a∣h):=Pr[An=a∣AΘ:n=h]
That is, we perform Thompson sampling at time intervals of size T, moderated by the delegation routine D, and discard from our belief state hypotheses whose probability is below η and hypotheses sampling which resulted in recommending "unsafe" actions i.e. actions that D refused to perform.
In order to prove π∗ has the desired property, we will define the stochastic processes Z!, ~Z!, J!, Ψ!, A! and Θ!, each process of the same type as its shriekless counterpart (thus Ω is constructed to accommodate them). These processes are required to satisfy the following:
~Z!0(k)≡1N
Z!n(k)=~Z!n(k)[[~Z!n(k)≥η]]∑N−1j=0~Z!n(j)[[~Z!n(j)≥η]][[~Z!n(K)≥η]]+[[K=k]]⋅[[~Z!n(K)<η]]
Pr[J!l=k∣Z!lT]=Z!lT(k)
Ψ!n=πJ!l(AΘ!:n)
Pr[Θ!n=o∣AΘ!:n]=νK(o∣AΘ!:n)
A!n=D!K(AΘ!:n,Ψ!n)
~Z!n+1(k)=Z!n(k)[[A!n=D!k(AΘ!:n,Ψ!n)]]νk(Θ!n∣AΘ!:nA!n)∑N−1j=0Z!n(j)[[A!n=D!j(AΘ!:n,Ψ!n)]]νj(Θ!n∣AΘ!:nA!n)
For any k∈[N], we construct the ¯I-policy π?k s.t. for any n∈N, h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t. Pr[AΘ!:n=h, K=k]>0 and a∈¯A
π?k(a∣h):=Pr[A!n=a∣AΘ!:n=h, K=k]
Given any ¯I-policy π and I-policy σ we define by
ασπ(g∣h):=[[h=g–]]Ch|h|−1∏n=0∑a∈A([[gn∈⊥aO]]π(⊥∣g:n)σ(a∣h:n)+[[gn∈a⊥O]]π(a∣g:n))
Here, Ch∈R is a constant defined s.t. the probabilities sum to 1. We define the I-policy [σ]π–– by
[σ]π––(a∣h):=Prg∼ασπ(h)[π(g)=a∨(π(g)=⊥∧σ(h)=a)]
Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any h∈hdomμk
supp[σk]π––?k(h)⊆A0Mkπk(Sk(h))
This means we can apply Proposition B.2 and get
EUπkSkυk(γ)−EUπ?k¯υk[σkSk](γ)≤(1−γ)∞∑n=0T−1∑m=0γnT+m(Ex∼μk⋈π∗kn[r(x:nT+m)]−Ex∼νk⋈π?k[r(x––:nT+m)])+2¯τγT(1−γ)1−γT
Here, the I-policy π∗kn is defined as π∗n in Proposition B.2. We also define the ¯I-policies π!kn and π!!kn by
π!kn(a∣h):={π?k(a∣h) if |h|<nTPr[A!|h|=a∣AΘ!:|h|=h, K=k, J!n=k] otherwise
π!!kn(a∣h):=⎧⎪⎨⎪⎩π?k(a∣h) if |h|<nTπ!kn(a∣h)+π!kn(⊥∣h)⋅π∗kn(a∣h––) if |h|≥nT and a≠⊥0 if |h|≥nT and a=⊥
Denote
ρ∗kn:=μk⋈π∗kn
ρ!!kn:=νk⋈π!!kn
ρ!kn:=νk⋈π!kn
ρ?k:=νk⋈π?k
R?k:=EUπkSkυk(γ)−EUπ?k¯υk[σkSk](γ)
For each n∈N, denote
EU∗kn(γ):=1−γ1−γTT−1∑m=0γmEx∼ρ∗kn[r(x:nT+m)]
EU!!kn(γ):=1−γ1−γTT−1∑m=0γmEx∼ρ!!kn[r(x––:nT+m)]
EU!kn(γ):=1−γ1−γTT−1∑m=0γmEx∼ρ!kn[r(x––:nT+m)]
EU?kn(γ):=1−γ1−γTT−1∑m=0γmEx∼ρ?k[r(x––:nT+m)]
We have
R?k≤(1−γT)∞∑n=0γnT(EU∗kn(γ)−EU?kn(γ))+2¯τγT(1−γ)1−γT
R?k≤(1−γT)∞∑n=0γnT(EU∗kn(γ)−EU!!kn(γ)+EU!!kn(γ)−EU!kn(γ)+EU!kn(γ)−EU?kn(γ))+2¯τγT(1−γ)1−γT
Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given h∈hdomνk s.t. |h|≥nT
suppπ!kn(h)⊆{πk(Sk(h)),⊥}
π!!kn(πk(Sk(h))∣h)=1
Therefore, π!!kn=π∗kn, and we remain with
R?k≤(1−γT)∞∑n=0γnT(EU!!kn(γ)−EU!kn(γ)+EU!kn(γ)−EU?kn(γ))+2¯τγT(1−γ)1−γT
We have
EU!!kn(γ)−EU!kn(γ)≤Prx∼ρ!kn[∃m∈[T]:xnT+m∈⊥¯O]
Since Z!nT(K)≥η, it follows that
EU!!kn(γ)−EU!kn(γ)≤1ηPrx∼ρ?k[∃m∈[T]:xnT+m∈⊥¯O]≤1ηEx∼ρ?k[∣∣{m∈[T]∣xnT+m∈⊥¯O}∣∣]
∞∑n=0(EU!!kn(γ)−EU!kn(γ))≤1ηEx∼ρ?k[∣∣{n∈N∣xn∈⊥¯O}∣∣]
Using condition i of Proposition C.1, we conclude
R?k≤(1−γT)∞∑n=0γnT(EU!kn(γ)−EU?kn(γ))+O(1−γTη2+¯τ(1−γ)1−γT)
Define the random variables {U!n:Ω→[0,1]}n∈N by
U!n:=1−γ1−γTT−1∑m=0γmr(AΘ!:nT+m)
Averaging the previous inequality over k, we get
$$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \E{}\left[\E{}\left[U^!_n \mid \J^!n = K,\ Z^!{nT}\right]-\E{}\left[U^!n \mid Z^!{nT}\right]\right]