Thanks to Rohin Shah, Michael Dennis, Josh Turner, and Evan Hubinger for comments.
It sure seems like gaining power over the environment is instrumentally convergent (optimal for a wide range of agent goals). You can turn this into math and prove things about it. Given some distribution over agent goals, we want to be able to formally describe how optimal action tends to flow through the future.
Does gaining money tend to be optimal? Avoiding shutdown? When? How do we know?
Optimal Farsighted Agents Tend to Seek Power proved that, when you distribute reward fairly and evenly across states (IID), it's instrumentally convergent to gain access to lots of final states (which are absorbing, in that the agent keeps on experiencing the final state). The theorems apply when you don't discount the future (you're "infinitely farsighted").
While it's good to understand the limiting case, what if the agent, you know, isn't infinitely farsighted? That's a pretty unrealistic assumption. Eventually, we want this theory to help us predict what happens after we deploy RL agents with high-performing policies in the real world.
Normal amounts of sightedness
But what if we care about the journey? What if ?
We can view Frank as traversing a Markov decision process, navigating between states with his actions:
It sure seems like Frank is more likely to start with the blue or green gems. Those give him way more choices along the way, after all. But the previous theorems only said "at , he's equally likely to pick each gem. At , he's equally likely to end up in each terminal state".
Let me tell you, finding the probability that one tangled web of choices is optimal over another web, is generally a huge mess. You're finding the measure of reward functions which satisfy some messy system of inequalities, like
And that's in the simple tiny environments!
How do we reason about instrumental convergence – how do we find those sets of trajectories which are more likely to be optimal for a lot of reward functions?
We exploit symmetries.
The blue gem makes available all of the same options as the red gems, and then some. Since the blue gem gives you strictly more options, it's strictly more likely to be optimal! When you toss back in the green gem, avoiding the red gems becomes yet more likely.
So, we can prove that for all , most agents don't choose the red gems. Agents are more likely to pick blue than red. Easy.
Plus, this reasoning mirrors why we think instrumental convergence exists to begin with:
Sure, the goal could incentivize immediately initiating shutdown procedures. But if you stay active, you could still shut down later, plus there are all these other states the agent might be incentivized to reach.
This extends further. If the symmetry occurs twice over, then you can conclude the agent is at least twice as likely to do the instrumentally convergent thing.
My initial work made a lot of simplifying assumptions:
- The agents are infinitely farsighted: they care about average reward over time, and don't prioritize the present over the future.
- Relaxed. See above.
- The environment is deterministic.
- Relaxed. The paper is already updated to handle stochastic environments. The new techniques in this post also generalize straightforwardly.
- Reward is distributed IID over states, where each state's reward distribution is bounded and continuous.
- Relaxed. We can immediately toss out boundedness, as none of our reasoning about instrumental convergence relies on it. It just ensured certain unrelated equations were well-defined.
- With a bit of work, I could probably toss out continuity in general (and instead require only non-degeneracy), but I haven't done that yet.
- If you can prove instrumental convergence under IID reward, and then you have another reward function distribution which improves reward for instrumentally convergent trajectories while worsening reward for already-unlikely trajectories, then there's also instrumental convergence under .
- For example, if you double reward in instrumentally convergent states and halve it in unlikely states, then you still have instrumental convergence.
- The environment is Markov.
- Relaxed. -step Markovian environments are handled by conversion into isomorphic Markov environments.
- The agent is optimal.
- The environment is finite and fully observable.
The power-seeking theorems apply to:
infinitely farsighted optimal policies in finite deterministic MDPs with respect to reward distributed independently, identically, continuously, and boundedly over states.
We now have a few formally correct strategies for showing instrumental convergence, or lack thereof.
- In deterministic environments, there's no instrumental convergence at for IID reward.
- When , you're strictly more likely to navigate to parts of the future which give you strictly more options (in a graph-theoretic sense). Plus, these parts of the future give you strictly more power.
- When , it's instrumentally convergent to access a wide range of terminal states.
- This can be seen as a special case of having "strictly more options", but you no longer require an isomorphism on the paths leading to the terminal states.
This work builds off of my initial paper on power-seeking; I'll refer to that as .
Definition. Let be a visitation distribution of state , and let contain . denotes the measure of reward functions under distribution satisfying at discount rate .
Non-dominated visitation distributions have positive measure and "take" positive measure from every other non-dominated visitation distribution.
Lemma 1. If , then for all . Furthermore, such that , for all containing .
Proof. The first claim was proven in . The second claim follows by observing that visitation distributions which are non-dominated with respect to all of are also non-dominated with respect to subsets (as taking subsets winnows the set of constraints). Then, use the fact that non-dominated visitation distributions always have positive measure (in particular, with respect to ). QED.
Definition. The graph induced by a set of visitation distributions consists of the states visited and actions taken by at least one of the policies generating the visitation distributions. This is also referred to as the -graph.
Theorem 2 [Strictly more meaningful options means strict instrumental convergence and strict power increase]. Let be subsets of non-dominated visitation distributions. If the -graph is isomorphic to a subgraph of the -graph, such that the isomorphism fixes , then for all . If the subgraph of the -graph is strict, then so is the inequality.
Furthermore, is more powerful in the -graph than in the -graph. If the subgraph of the -graph is strict, then is strictly more powerful in the -graph.
Proof. The claim follows from symmetry; measure must be invariant to state relabelling, because reward is IID. The strict inequality follows from lemma 1: adding another non-dominated visitation distribution must strictly increase and decrease .
Similarly, the first power claim follows from symmetry. Adding another non-dominated visitation distribution must strictly increase the power . QED.
Reward distribution generalization
We derive a sufficient condition for instrumental convergence for (certain) non-IID reward function distributions.
Definition. Distribution (with CDF ) dominates distribution (with CDF ) when (when minorizes ). Similarly, distribution (with CDF ) is dominated by distribution (with CDF ) when (when majorizes ).
The following insight is simple: if you can prove instrumental convergence under IID rewards, and then you have another reward function distribution which improves reward for instrumentally convergent trajectories while worsening reward for already-unlikely trajectories, then there's also instrumental convergence under .
The logic goes:
If e.g. avoiding shutdown was instrumentally convergent for this more generous IID distribution, but realistic distributions are far less likely to reward shutdown, and a few other trajectories are even more likely to be rewarded. So, it's still instrumentally convergent to avoid shutdown for this more realistic task-based distribution we have in mind.
Theorem 3. Let . Suppose that under reward function distribution ,
If all have dominant distributions at discount rate under distribution compared to under , and all have dominated return distributions at discount rate under distribution compared to under , then under .
Proof. Consider the process of starting with the initial return distributions, and iteratively swapping them to their more generous counterparts. If any such swap strictly increases , it strictly increases and strictly decreases by lemma 1. Clearly such a swap cannot strictly decrease .
Similar logic applies to the less generous return distributions for under . QED.