I've personally had a hard time accepting the 'seeking power' terminology used in some of the discussion around instrumental convergence. I can't ask my teachers about power or instrumental convergence. However, I do often ask questions about Rademacher complexity since I study machine learning algorithms. This is effectively a rough-draft articulation of why I do that.

Background

Instrumental convergence is the idea that there are states or strategies that are somehow broadly applicable to achieving a broad set of rewards. A somewhat canonical example might the paper-clip maximizer.

At a high-level paper-clip maximizers seem like a potential threat. First, we argue that some smaller collection of strategies would be useful for achieving a larger set of goal states. Second, we observe that this smaller collection of states/strategies is somehow 'bad'. In our example, this might look like turning Earth into a factory to produce paper-clips.

Of course, without a proper grounding of the various terms it's easy to overeach prove too much. Formalizations of instrumental convergence have been previously proposed. Since optimal policies always converges to a limiting cycle on the graph work on maximum mean-weight cycles is also relevant. Mathieu and Wilson study the distribution of such returns on complete graphs. In our context, they show the limiting cycles with high return have a constant-order legnth Θ(1) which means that the length does not scale with the size of the graph. A subsequent work clarifies that with constant probability the length is constant or otherwise has length O((logn)3).

Turner et al. mathematically formalize the relevant notions for AI and then propose something more: that optimal policies tend to seek power. Power is defined as the expected value of a state over a distribution of reward functions,
POWER(s)∼E[⟨ρs,r⟩|ρs is a valid occupancy-measure for s]
A useful example is to think of having an agent precommit between limiting cycles by choosing it's initial state. If the MDP has n states s∈S and is deterministic then we may assign a cycle function C:s→Nn which indicates the number of cylces of each length. It's clear enough that if one state s includes at least as many cycles of any given length as s′ then the power of s is at least that of s′. We can generalize this slightly,

Proposition: Let ΔC(s,s′)i=C(s)i−C(s′)i. Suppose the rewards are iid then we have POWER(s)≥POWER(s′) whenever,
j∑i=1ΔC(s,s)i≥0,∀j∈{1,…,n}
and strict inequality whenever ΔC(s,s′)i>0 for some i∈{1,…,n}. This is just saying that shorter cycles are more valuable than longer cycles in the sense that any cycle of length L can substitute for any cycle of length K≥L.

The general idea of measuring the average optimization ability of a learning algorithm is extremely well studied. In particular, these measures are commonly used to provide generalization gurantees for machine learning algorithms. For example, we can introduce unormalized Σ-complexity measures defined on such sets as,
CΣ(A):=Eσ∈Σ[supa∈A⟨a,σ⟩]
where we'll typically take a and σ to be vectors in Rm. The most common Σ is the Rademacher distribution which is given as,
P(σi=+1)=P(σi=−1)=1/2
Using this distribution gives us the Rademacher complexity of the set. We could also use a Gaussian distribution in which case we obtain the Gaussian complexity. These complexities are related,
CGaussian(A)2√log(n)≤CRad(A)≤√π2CGaussian(A)
which means that the different complexity measures roughly correlate.

Definition: The optimality probability of A relative to B under distribution Σ is
pΣ(A≥B):=Pσ∼Σ(supa∈A⟨a,σ⟩≥supb∈B⟨b,σ⟩)Definition: We'll say B is Σ instrumentally convergent over A whenever pΣ(A≥B)≤1/2.

For an MDP we might take Σ as a distribution of state-based reward functions and the sets A and B as indicating sets of feasible state-occupancy measures (discounted) that have initial full support on a state sA and sB, respectively.

Definition: Assuming the rewards are zero-mean the Σ-power of a state is given as,
POWERΣ(s)=CΣ(Is)
where Is is the set of feasible state-occupancy measures (discounted) that have initial full-support on a state s.

Theorem: Assume Ia and Ib have positive Σ-power. Then, if the Σ-power of Ib exceeds Ia it is more likely to be optimal under the reward distribution Σ. Moreover, policies Ib slightly optimal over Ia cannot be too powerful.

We'll use the following lemma to help us,

Lemma: Let Δ(σ)=supa∈A⟨a,σ⟩−supb∈B⟨b,σ⟩. Suppose that Δ−1(Σ) is sub-Gaussian with scale-parameter ν2R and CΣ(A)<CΣ(B) then we have,
pΣ(A≥B)≤e−(CΣ(A)−CΣ(B))22ν2RProof: The previous lemma yeilds,
pΣ(Ia≥Ib)≤e−(POWERΣ(b)−POWERΣ(a))22ν2R⇒pΣ(Ib>Ia)≥1−e−(POWERΣ(b)−POWERΣ(a))22ν2R⇒log(pΣ(Ia≥Ib))≥−(POWERΣ(b)−POWERΣ(a))22ν2R⇒POWERΣ(b)−POWERΣ(a)≤νR
⎷2log(1pΣ(Ia≥Ib))≤νR
⎷2log(11−pΣ(Ib>Ia))
Thus, relatively optimal policies have bounded power difference. □

All that remains is to prove the lemma,

Proof: For any λ>0 we have,
pΣ(A≥B)≤Eσ∼Σ[I(supa∈A⟨a,σ⟩≥supb∈B⟨b,σ⟩)]≤Eσ∼Σ[eλ(supa∈A⟨a,σ⟩−supb∈B⟨b,σ⟩)](0-1 Loss)=Eσ∼Σ[eλΔ(σ)]=Eδ∼Δ−1(Σ)[eλδ](Change of Variables)≤eλE[δ]+λ2ν22=eλ(CΣ(A)−CΣ(B))+λ2ν22(Hoeffding Lemma)
We may optimize the bound and obtain λ=CΣ(B)−CΣ(A)ν2R>0 by assumption. This implies that,
pΣ(A≥B)≤e−(CΣ(B)−CΣ(A))22ν2R□

I've personally had a hard time accepting the 'seeking power' terminology used in some of the discussion around instrumental convergence. I can't ask my teachers about power or instrumental convergence. However, I do often ask questions about Rademacher complexity since I study machine learning algorithms. This is effectively a rough-draft articulation of why I do that.

## Background

Instrumental convergence is the idea that there are states or strategies that are somehow broadly applicable to achieving a broad set of rewards. A somewhat canonical example might the paper-clip maximizer.

At a high-level paper-clip maximizers seem like a potential threat. First, we argue that some smaller collection of strategies would be useful for achieving a larger set of goal states. Second, we observe that this smaller collection of states/strategies is somehow 'bad'. In our example, this might look like turning Earth into a factory to produce paper-clips.

Of course, without a proper grounding of the various terms it's easy to overeach prove too much. Formalizations of instrumental convergence have been previously proposed. Since optimal policies always converges to a limiting cycle on the graph work on maximum mean-weight cycles is also relevant. Mathieu and Wilson study the distribution of such returns on complete graphs. In our context, they show the limiting cycles with high return have a constant-order legnth Θ(1) which means that the length does not scale with the size of the graph. A subsequent work clarifies that with constant probability the length is constant or otherwise has length O((logn)3).

Turner et al. mathematically formalize the relevant notions for AI and then propose something more: that optimal policies tend to seek power. Power is defined as the expected value of a state over a distribution of reward functions, POWER(s)∼E[⟨ρs,r⟩|ρs is a valid occupancy-measure for s] A useful example is to think of having an agent precommit between limiting cycles by choosing it's initial state. If the MDP has n states s∈S and is deterministic then we may assign a cycle function C:s→Nn which indicates the number of cylces of each length. It's clear enough that if one state s includes at least as many cycles of any given length as s′ then the power of s is at least that of s′. We can generalize this slightly,

Proposition:Let ΔC(s,s′)i=C(s)i−C(s′)i. Suppose the rewards are iid then we have POWER(s)≥POWER(s′) whenever, j∑i=1ΔC(s,s)i≥0,∀j∈{1,…,n} and strict inequality whenever ΔC(s,s′)i>0 for some i∈{1,…,n}. This is just saying that shorter cycles are more valuable than longer cycles in the sense that any cycle of length L can substitute for any cycle of length K≥L.The general idea of measuring the average optimization ability of a learning algorithm is extremely well studied. In particular, these measures are commonly used to provide generalization gurantees for machine learning algorithms. For example, we can introduce unormalized Σ-complexity measures defined on such sets as, CΣ(A):=Eσ∈Σ[supa∈A⟨a,σ⟩] where we'll typically take a and σ to be vectors in Rm. The most common Σ is the Rademacher distribution which is given as, P(σi=+1)=P(σi=−1)=1/2 Using this distribution gives us the Rademacher complexity of the set. We could also use a Gaussian distribution in which case we obtain the Gaussian complexity. These complexities are related, CGaussian(A)2√log(n)≤CRad(A)≤√π2CGaussian(A) which means that the different complexity measures roughly correlate.

## Applications to Instrumental Convergence

Here I'll model definitions similar to those used in the power-seeking paper and a recent post on broad-convergence.

Definition:The optimality probability of A relative to B under distribution Σ is pΣ(A≥B):=Pσ∼Σ(supa∈A ⟨a,σ⟩≥supb∈B ⟨b,σ⟩)Definition: We'll say B is Σ instrumentally convergent over A whenever pΣ(A≥B)≤1/2.For an MDP we might take Σ as a distribution of state-based reward functions and the sets A and B as indicating sets of feasible state-occupancy measures (discounted) that have initial full support on a state sA and sB, respectively.

Definition:Assuming the rewards are zero-mean the Σ-power of a state is given as, POWERΣ(s)=CΣ(Is) where Is is the set of feasible state-occupancy measures (discounted) that have initial full-support on a state s.Theorem:Assume Ia and Ib have positive Σ-power. Then, if the Σ-power of Ib exceeds Ia it is more likely to be optimal under the reward distribution Σ. Moreover, policies Ib slightly optimal over Ia cannot be too powerful.We'll use the following lemma to help us,

Lemma:Let Δ(σ)=supa∈A ⟨a,σ⟩−supb∈B ⟨b,σ⟩. Suppose that Δ−1(Σ) is sub-Gaussian with scale-parameter ν2R and CΣ(A)<CΣ(B) then we have, pΣ(A≥B)≤e−(CΣ(A)−CΣ(B))22ν2RProof:The previous lemma yeilds, pΣ(Ia≥Ib)≤e−(POWERΣ(b)−POWERΣ(a))22ν2R⇒pΣ(Ib>Ia)≥1−e−(POWERΣ(b)−POWERΣ(a))22ν2R⇒log(pΣ(Ia≥Ib))≥−(POWERΣ(b)−POWERΣ(a))22ν2R⇒POWERΣ(b)−POWERΣ(a)≤νR ⎷2log(1pΣ(Ia≥Ib))≤νR ⎷2log(11−pΣ(Ib>Ia)) Thus, relatively optimal policies have bounded power difference. □All that remains is to prove the lemma,

Proof:For any λ>0 we have, pΣ(A≥B)≤Eσ∼Σ[I(supa∈A ⟨a,σ⟩≥supb∈B ⟨b,σ⟩)]≤Eσ∼Σ[eλ(supa∈A ⟨a,σ⟩−supb∈B ⟨b,σ⟩)](0-1 Loss)=Eσ∼Σ[eλΔ(σ)]=Eδ∼Δ−1(Σ)[eλδ](Change of Variables)≤eλE[δ]+λ2ν22=eλ(CΣ(A)−CΣ(B))+λ2ν22(Hoeffding Lemma) We may optimize the bound and obtain λ=CΣ(B)−CΣ(A)ν2R>0 by assumption. This implies that, pΣ(A≥B)≤e−(CΣ(B)−CΣ(A))22ν2R □