Previously: Seeking Power is Often Provably Instrumentally Convergent in MDPs.

Circa 2021, the above post was revamped to supersede this one, so I recommend just reading that instead.

Thanks to Rohin Shah, Michael Dennis, Josh Turner, and Evan Hubinger for comments.

The original post contained proof sketches for non-IID reward function distributions. I think the actual non-IID theorems look different than I thought, and so I've removed the proof sketches in the meantime.

It sure seems like gaining power over the environment is instrumentally convergent (optimal for a wide range of agent goals). You can turn this into math and prove things about it. Given some distribution over agent goals, we want to be able to formally describe how optimal action tends to flow through the future.

Does gaining money tend to be optimal? Avoiding shutdown? When? How do we know?

Optimal Farsighted Agents Tend to Seek Power proved that, when you distribute reward fairly and evenly across states (IID), it's instrumentally convergent to gain access to lots of final states (which are absorbing, in that the agent keeps on experiencing the final state). The theorems apply when you don't discount the future (you're "infinitely farsighted").

Most reward functions for the Pac-Man game incentivize not dying immediately, so that the agent can loop around higher-scoring configurations.

Many ways of scoring Tic-Tac-Toe game states incentivize not losing immediately, in order to choose the highest-scoring final configuration.

"All states have self-loops, left hidden to reduce clutter.

In *AI: A Modern Approach (3e)*, the agent starts at $1$ and receives reward for reaching $3$ . The optimal policy for this reward function avoids $2$ , and one might suspect that avoiding $2$ is instrumentally convergent. However, a skeptic might provide a reward function for which navigating to $2$ is optimal, and then argue that "instrumental convergence'' is subjective and that there is no reasonable basis for concluding that $2$ is generally avoided.

We can do better... for any way of independently and identically distributing reward over states, $\frac{10}{11}$ of reward functions have farsighted optimal policies which avoid $2$ . If we complicate the MDP with additional terminal states, this number further approaches 1.

If we suppose that the agent wil

...

Power Seeking (AI)

Power Seeking (AI)