I've personally had a hard time accepting the 'seeking power' terminology used in some of the discussion around instrumental convergence. I can't ask my teachers about power or instrumental convergence. However, I do often ask questions about Rademacher complexity since I study machine learning algorithms. This is effectively a rough-draft articulation of why I do that.
Instrumental convergence is the idea that there are states or strategies that are somehow broadly applicable to achieving a broad set of rewards. A somewhat canonical example might the paper-clip maximizer.
At a high-level paper-clip maximizers seem like a potential threat. First, we argue that some smaller collection of strategies would be useful for achieving a larger set of goal states. Second, we observe that this smaller collection of states/strategies is somehow 'bad'. In our example, this might look like turning Earth into a factory to produce paper-clips.
Of course, without a proper grounding of the various terms it's easy to overeach prove too much. Formalizations of instrumental convergence have been previously proposed. Since optimal policies always converges to a limiting cycle on the graph work on maximum mean-weight cycles is also relevant. Mathieu and Wilson study the distribution of such returns on complete graphs. In our context, they show the limiting cycles with high return have a constant-order legnth which means that the length does not scale with the size of the graph. A subsequent work clarifies that with constant probability the length is constant or otherwise has length .
Turner et al. mathematically formalize the relevant notions for AI and then propose something more: that optimal policies tend to seek power. Power is defined as the expected value of a state over a distribution of reward functions, A useful example is to think of having an agent precommit between limiting cycles by choosing it's initial state. If the MDP has states and is deterministic then we may assign a cycle function which indicates the number of cylces of each length. It's clear enough that if one state includes at least as many cycles of any given length as then the power of is at least that of . We can generalize this slightly,
Proposition: Let . Suppose the rewards are iid then we have whenever, and strict inequality whenever for some . This is just saying that shorter cycles are more valuable than longer cycles in the sense that any cycle of length can substitute for any cycle of length .
The general idea of measuring the average optimization ability of a learning algorithm is extremely well studied. In particular, these measures are commonly used to provide generalization gurantees for machine learning algorithms. For example, we can introduce unormalized -complexity measures defined on such sets as, where we'll typically take and to be vectors in . The most common is the Rademacher distribution which is given as, Using this distribution gives us the Rademacher complexity of the set. We could also use a Gaussian distribution in which case we obtain the Gaussian complexity. These complexities are related, which means that the different complexity measures roughly correlate.
Applications to Instrumental Convergence
Definition: The optimality probability of relative to under distribution is Definition: We'll say is instrumentally convergent over whenever .
For an MDP we might take as a distribution of state-based reward functions and the sets and as indicating sets of feasible state-occupancy measures (discounted) that have initial full support on a state and , respectively.
Definition: Assuming the rewards are zero-mean the -power of a state is given as, where is the set of feasible state-occupancy measures (discounted) that have initial full-support on a state .
Theorem: Assume and have positive -power. Then, if the -power of exceeds it is more likely to be optimal under the reward distribution . Moreover, policies slightly optimal over cannot be too powerful.
We'll use the following lemma to help us,
Lemma: Let . Suppose that is sub-Gaussian with scale-parameter and then we have, Proof: The previous lemma yeilds, Thus, relatively optimal policies have bounded power difference.
All that remains is to prove the lemma,
Proof: For any we have, We may optimize the bound and obtain by assumption. This implies that,