Seeking Power is Often Provably Instrumentally Convergent in MDPs

by TurnTrout, elriggs11 min read5th Dec 201925 comments

116

Ω 30

Instrumental ConvergenceAI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://arxiv.org/abs/1912.01683

In 2008, Steve Omohundro's foundational Basic AI Drives made important conjectures about what superintelligent goal-directed AIs might do, including gaining as much power as possible to best achieve their goals. Toy models have been constructed in which Omohundro's conjectures bear out, and the supporting philosophical arguments are intuitive. The conjectures have recently been the center of debate between well-known AI researchers.

Instrumental convergence has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and accompanying paper) is to change that.

My results strongly suggest that, within the Markov decision process formalism (the staple of reinforcement learning[1]), the structure of the agent's environment means that most goals incentivize gaining power over that environment. Furthermore, maximally gaining power over an environment is bad for other agents therein. That is, power seems constant-sum after a certain point.

I'm going to provide the intuitions for a mechanistic understanding of power and instrumental convergence, and then informally show how optimal action usually means trying to stay alive, gain power, and take over the world; read the paper for the rigorous version. Lastly, I'll talk about why these results excite me.

Intuitions

I claim that

The structure of the agent's environment means that most goals incentivize gaining power over that environment.

By environment, I mean the thing the agent thinks it's interacting with. Here, we're going to think about dualistic environments where you can see the whole state, where there are only finitely many states to see and actions to take. Also, future stuff gets geometrically discounted; at discount rate , this means stuff in one turn is half as important as stuff now, stuff in two turns is a quarter as important, and so on. Pac-Man is an environment structured like this: you see the game screen (the state), you take an action, and then you get a result (another state). There's only finitely many screens, and only finitely many actions – they all had to fit onto the arcade controller, after all!

When I talk about "goals", I'm talking about reward functions over states: each way-the-world-could-be gets assigned some point value. The canonical way of earning points in Pac-Man is just one possible reward function for the game.

Instrumental convergence supposedly exists for sufficiently wide varieties of goals, so today we'll think about the most variety possible: the distribution of goals where each possible state is uniformly randomly assigned a reward in the interval (although the theorems hold for a lot more distributions than this[2]). Sometimes, I'll say things like "most agents do ", which means "maximizing total discounted reward usually entails doing when your goals are drawn from the uniform distribution". We say agents are "farsighted" when the discount rate is sufficiently close to 1 (the agent doesn't prioritize immediate reward over delayed gratification).

Power

You can do things in the world and take different paths through time. Let's call these paths "possibilities"; they're like filmstrips of how the future could go.

If you have more control over the future, you're usually[3] choosing among more paths-through-time. This lets you more precisely control what kinds of things happen later. This is one way to concretize what people mean when they use the word 'power' in everyday speech, and will be the definition used going forward: the ability to achieve goals in general.[4] In other words, power is the average attainable utility across a distribution of goals.

This definition seems philosophically reasonable: if you have a lot of money, you can make more things happen and have more power. If you have social clout, you can spend that in various ways to better tailor the future to various ends. Dying means you can't do much at all, and all else equal, losing a limb decreases your power.

Exercise: spend a few minutes considering whether real-world intuitive examples of power are explained by this definition.

Once you feel comfortable that it's at least a pretty good definition, we can move on.


Imagine a simple game with three choices: eat candy, eat a chocolate bar, or hug a friend.

The power of a state is how well agents can generally do by starting from that state. It's important to note that we're considering power from behind a "veil of ignorance" about the reward function. We're averaging the best we can do for a lot of different individual goals.

Each reward function has an optimal possibility, or path-through-time. If chocolate has maximal reward, then the optimal possibility is .

Since the distribution randomly assigns a value in to each state, an agent can expect to average reward. This is because you're choosing between three choices, each of which has some value between and . The expected maximum of draws from uniform is ; you have three draws here, so you expect to be able to get reward. Now, some reward functions do worse than this, and some do better; but on average, they get reward. You can test this out for yourself.

If you have no choices, you expect to average reward: sometimes the future is great, sometimes it's not. Conversely, the more things you can choose between, the closer this gets to (i.e., you can do well by all goals, because each has a great chance of being able to steer the future how you want).

Instrumental convergence

Plans that help you better reach a lot of goals are called instrumentally convergent. To travel as quickly as possible to a randomly selected coordinate on Earth, one likely begins by driving to the nearest airport. Although it's possible that the coordinate is within driving distance, it's not likely. Driving to the airport would then be instrumentally convergent for travel-related goals.

We define instrumental convergence as optimal agents being more likely to take one action than another at some point in the future. I want to emphasize that when I say "likely", I mean from behind the veil of ignorance. Suppose I say that it's 50% likely that agents go left, and 50% likely they go right. This doesn't mean any agent has the stochastic policy of 50% left / 50% right. This means that, when drawing goals from our distribution, 50% of the time optimal pursuit of the goal entails going left, and 50% of the time it entails going right.

Consider either eating candy now, or earning some reward for waiting a second before choosing between chocolate and hugs.

Let's think about how optimal action tends to change as we start caring about the future more. Think about all the places you can be after just one turn:

We could be in two places. Imagine we only care about the reward we get next turn. How many goals choose over ? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal. About half of nearsighted agents go to and half go to . There isn't much instrumental convergence yet. Note that this is also why nearsighted agents tend not to seek power.

Now think about where we can be in two turns:

We could be in three places. Supposing we care more about the future, more of our future control is coming from . In other words, about two thirds of our power is coming from our ability to . But is instrumentally convergent? If the agent is farsighted, the answer is yes (why?).

In the limit of farsightedness, the chance of each possibility being optimal approaches (each terminal state has an equal chance to be maximal).

There are two important things happening here.

Important Thing #1

Instrumental convergence doesn't happen in all environments. An agent starting at blue isn't more likely to go up or down at any given point in time.

There's also never instrumental convergence when the agent doesn't care about the future at all (when ). However, let's think back to what happens in the waiting environment:

As the agent becomes farsighted, the and possibilities become more likely.

We can show that instrumental convergence exists in an environment if and only if a path through time becomes more likely as the agent cares more about the future.

Important Thing #2

The more control-at-future-timesteps an action provides, the more likely it is to be selected. What an intriguing "coincidence"!

Power-seeking

So, it sure seems like gaining power is a good idea for a lot of agents!

Having tasted a few hints for why this is true, we'll now walk through the intuitions a little more explicitly. This, in turn, will show some pretty cool things: most agents avoid dying in Pac-Man, keep the Tic-Tac-Toe game going as long as possible, and avoid deactivation in real life.[5]

Let's focus on an environment with the same rules as Tic-Tac-Toe, but considering the uniform distribution over reward functions. The agent (playing ) keeps experiencing the final state over and over when the game's done. We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.

Whenever we make a move that ends the game, we can't reach anything else – we have to stay put. Since each final state has the same chance of being optimal, a move which doesn't end the game is more likely than a move which does. Let's look at part of the game tree, with instrumentally convergent moves shown in green.

Starting on the left, all but one move leads to ending the game, but the second-to-last move allows us to keep choosing between five more final outcomes. For reasonably farsighted agents at the first state, the green move is ~50% likely to be optimal, while each of the others are only best for ~10% of goals. So we see a kind of "self-preservation" arising, even in Tic-Tac-Toe.

Remember how, as the agent gets more farsighted, more of its control comes from choosing between and , while also these two possibilities become more and more likely?

The same thing is happening in Tic-Tac-Toe. Let's think about what happens as the agent cares more about later and later time steps.

The initial green move contributes more and more control, so it becomes more and more likely as we become more farsighted. This isn't a coincidence.

Power-seeking is instrumentally convergent.

Reasons for excitement

The direct takeaway

I'm obviously not "excited" that power-seeking happens by default, but I'm excited that we can see this risk more clearly. I'm also planning on getting this work peer-reviewed before purposefully entering it into the aforementioned mainstream debate, but here are some of my preliminary thoughts.

Imagine you have good formal reasons to suspect that typing random strings will usually blow up your computer and kill you. Would you then say, "I'm not planning to type random strings", and proceed to enter your thesis into a word processor? No. You wouldn't type anything yet, not until you really, really understand what makes the computer blow up sometimes.

The overall concern raised by [the power-seeking theorem] is not that we will build powerful RL agents with randomly selected goals. The concern is that random reward function inputs produce adversarial power-seeking behavior, which can produce perverse incentives such as avoiding deactivation and appropriating resources. Therefore, we should have specific reason to believe that providing the reward function we had in mind will not end in catastrophe.

Speaking to the broader debate taking place in the AI research community, I think a productive posture here will be investigating and understanding these results in more detail, getting curious about unexpected phenomena, and seeing how the numbers crunch out in reasonable models. I think that even though the alignment community may have superficially understood many of these conclusions, there are many new concepts for the broader AI community to explore.

Incidentally, if you're a member of this broader community and have questions, please feel free to email me at .

Explaining catastrophes

AI alignment research can often have a slippery feeling to it. We're trying hard to become less confused about basic concepts, and there's only everything on the line.

What are "agents"? Do people even have "values", and should we try to get the AI to learn them? What does it mean to be "corrigible", or "deceptive"? What are our machine learning models even doing? I mean, sometimes we get a formal open question (and this theory of possibilities has a few of those), but not usually.

We have to do philosophical work while in a state of significant confusion and ignorance about the nature of intelligence and alignment. We're groping around in the dark with only periodic flashes of insight to guide us.

In this context, we were like,

wow, it seems like every time I think of optimal plans for these arbitrary goals, the AI can best complete them by gaining a ton of power to make sure it isn't shut off. Everything slightly wrong leads to doom, apparently?

and we didn't really know why. Intuitively, it's pretty obvious that most agents don't have deactivation as their dream outcome, but we couldn't actually point to any formal explanations, and we certainly couldn't make precise predictions.

On its own, Goodhart's law doesn't explain why optimizing proxy goals leads to catastrophically bad outcomes, instead of just less-than-ideal outcomes.

I've heard that, from this state of ignorance, alignment proposals shouldn't rely on instrumental convergence being a thing (and I agree). If you're building superintelligent systems for which slight mistakes apparently lead to extinction, and you want to evaluate whether your proposal to avoid extinction will work, you obviously want to deeply understand why extinction happens by default.

We're now starting to have this kind of understanding. I suspect that power-seeking is the thing that makes capable goal-directed agency so dangerous.[6] If we want to consider more benign alternatives to goal-directed agency, then deeply understanding why goal-directed agency is bad is important for evaluating alternatives. This work lets us get a feel for the character of the underlying incentives of a proposed system design.

Formalizations

Defining power as "the ability to achieve goals in general" seems to capture just the right thing. I think it's good enough that I view important theorems about power (as defined in the paper) as philosophically insightful.

Considering power in this way seems to formally capture our intuitive notions about what resources are. For example, our current position in the environment means that having money allows us to exert more control over the future. That is, our current position in the state space means that having money allows more possibilities and greater power (in the formal sense). However, possessing green scraps of paper would not be as helpful if one were living alone near Alpha Centauri. In a sense, resource acquisition can naturally be viewed as taking steps to increase one's power.

Power might be important for reasoning about the strategy-stealing assumption (and I think it might be similar to what Paul means by "flexible influence over the future"). Evan Hubinger has already noted the utility of the distribution of attainable utility shifts for thinking about value-neutrality in this context (and power is another facet of the same phenomenon). If you want to think about whether, when, and why mesa optimizers might try to seize power, this theory seems like a valuable tool.

And, of course, we're going to use this notion of power to design an impact measure.

The formalization of instrumental convergence seems to be correct. We're able to now make detailed predictions about e.g. how the difficulty of getting reward affects the level of farsightedness at which seizing power tends to make sense. This also might be relevant for thinking about myopic agency, as the broader theory formally describes how optimal action tends to change with the discount factor.

Another useful conceptual distinction is that power and instrumental convergence aren't the same thing; we can construct environments where the state with the highest power is not instrumentally convergent from another state.

ETA: Here's an excerpt from the paper:

So, just because a state has more resources, doesn't mean holds great opportunity from the agent's current vantage point. In the above example, optimal action generally means going directly towards the optimal terminal state.

Here's what the relevant current results say: parts of the future allowing you to reach more terminal states are instrumentally convergent, and the formal POWER contributions of different possibilities are approximately proportionally related to instrumental convergence.

I think the Tic-Tac-Toe reasoning is helpful: it's instrumentally convergent to reach parts of the future which give you more control from your current vantage point. I'm working on expanding the formal results to include some version of this. I've since further clarified some claims made in the initial version of this post.

The broader theory of possibilities lends signficant insight into the structure of Markov decision processes; it feels like a piece of basic theory that was never discovered earlier, for whatever reason. More on this another time.

Future deconfusion

What excites me the most is a little more vague: there's a new piece of AI alignment we can deeply understand, and understanding breeds understanding.

Acknowledgements

This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund.

Logan Smith (elriggs) spent an enormous amount of time writing Mathematica code to compute power and measure in arbitrary toy MDPs, saving me from needing to repeatedly do quintuple+ integrations by hand. I thank Rohin Shah for his detailed feedback and brainstorming over the summer, and Tiffany Cai for the argument that arbitrary possibilities have expected value (and so optimal average control can't be worse than this). Zack M. Davis, Chase Denecke, William Ellsworth, Vahid Ghadakchi, Ofer Givoli, Evan Hubinger, Neale Ratzlaff, Jess Riedel, Duncan Sabien, Davide Zagami, and TheMajor gave feedback on drafts of this post.


  1. It seems reasonable to expect the key results to generalize in spirit to larger classes of environments, but keep in mind that the claims I make are only proven to apply to finite MDPs. ↩︎

  2. Specifically, consider any continuous bounded distribution distributed identically over the state space : . The kind of power-seeking and Tic-Tac-Toe-esque instrumental convergence I'm gesturing at should also hold for discontinuous bounded nondegenerate .

    The power-seeking argument works for arbitrary distributions over reward functions (with instrumental convergence also being defined with respect to that distribution) – identical distribution enforces "fairness" over the different parts of the environment. It's not as if instrumental convergence might not exist for arbitrary distributions – it's just that proofs for them are less informative (because we don't know their structure a priori).

    For example, without identical distribution, we can't say that agents (roughly) tend to preserve the ability to reach as many 1-cycles as possible; after all, you could just distribute reward on an arbitrary 1-cycle and 0 reward for all other states. According to this "distribution", only moving towards the 1-cycle is instrumentally convergent. ↩︎

  3. Power is not the same thing as number of possibilities! Power is average attainable utility; you might have a lot of possibilities, but not be able to choose between them for a long time, which decreases your control over the (discounted) future.

    Also, remember that we're assuming dualistic agency: the agent can choose whatever sequence of actions it wants. That is, there aren't "possibilities" it's unable to take. ↩︎

  4. Informal definition of "power" suggested by Cohen et al.. ↩︎

  5. We need to take care when applying theorems to real life, especially since the power-seeking theorem assumes the state is fully observable. Obviously, this isn't true in real life, but it seems reasonable to expect the theorem to generalize appropriately. ↩︎

  6. I'll talk more in future posts about why I presently think power-seeking is the worst part of goal-directed agency. ↩︎

116

Ω 30

25 comments, sorted by Highlighting new comments since Today at 12:33 AM
New Comment

Strong upvote, this is amazing to me. On the post:

  • Another example of explaining the intuitions for formal results less formally. I strongly support this as a norm.
  • I found the graphics helpful, both in style and content.

Some thoughts on the results:

  • This strikes at the heart of AI risk, and to my inexpert eyes the lack of anything rigorous to build on or criticize as a mechanism for the flashiest concerns has been a big factor in how difficult it was and is to get engagement from the rest of the AI field. Even if the formalism fails due to a critical flaw, the ability to spot such a flaw is a big step forward.
  • The formalism of average attainable utility, and the explicit distinction from number of possibilities, provides powerful intuition even outside the field. This includes areas like warfare and business. I realize it isn't the goal, but I have always considered applicability outside the field as an important test because it would be deeply concerning for thinking about goal-directed behavior to mysteriously fail when applied to the only extant things which pursue goals.
  • I find the result aesthetically pleasing. This is not important, but I thought I would mention it.

This is great work, nice job!

Maybe a shot in the dark, but there might be some connection with that paper a few years back Causal Entropic Forces (more accessible summary). They define "causal path entropy" as basically the number of different paths you can go down starting from a certain point, which might be related to or the same as what you call "power". And they calculate some examples of what happens if you maximize this (in a few different contexts, all continuous not discrete), and get fun things like (what they generously call) "tool use". I'm not sure that paper really adds anything important conceptually that you don't already know, but just wanted to point that out, and PM me if you want help decoding their physics jargon. :-)

Yeah, this is a great connection which I learned about earlier in the summer. I think this theory explains what's going on when they say

They argue that simple mechanical systems that are postulated to follow this rule show features of “intelligence,” hinting at a connection between this most-human attribute and fundamental physical laws.

Basically, since near-optimal agents tend to go towards states of high power, and near-optimal agents are generally ones which are intelligent, observing an agent moving towards a state of high power is Bayesian evidence that it is intelligent. However, as I understand it, they have the causation wrong: instead of physical laws -> power-seeking and intelligence, intelligent goal-directed behavior tends to produce power-seeking.

I agree 100% with everything you said.

This means that if there's more than twice the power coming from one move than from another, the former is more likely than the latter. In general, if one set of possibilities contributes 2K the power of another set of possibilities, the former set is at least K times more likely than the latter.

Where does the 2 come from? Why does one move have to have more than twice the power of another to be more likely? What happens if it only has 1.1x as much power?

What happens if it only has 1.1x as much power?

Then it won't always be instrumentally convergent, depending on the environment in question. For Tic-Tac-Toe, there's an exact proportionality in the limit of farsightedness (see theorem 46). In general, there's a delicate interaction between control provided and probability which I don't fully understand right now. However, we can easily bound how different these quantities can be; the constant depends on the distribution we choose (it's at most 2 for the uniform distribution). The formal explanation can be found in the proof of theorem 48, but I'll try to give a quick overview.

The power calculation is the average attainable utility. This calculation breaks down into the weighted sum of the average attainable utility when is best, the average attainable utility when is best, and the average attainable utility when is best; each term is weighted by the probability that its possibility is optimal.[1] Each term is the power contribution of a different possibility.[2]

Let's think about 's contribution to the first (simple) example. First, how likely is to be optimal? Well, each state has an equal chance of being optimal, so of goals choose . Next, given that is optimal, how much reward do we expect to get? Learning that a possibility is optimal tells us something about its expected value. In this case, the expected reward is still ; the higher this number is, the "happier" an agent is to have this as its optimal possibility.

In general,

If the agent can "die" in an environment, more of its "ability to do things in general" is coming from not dying at first. Like, let's follow where the power is coming from, and that lets us deduce things about the instrumental convergence. Consider the power at a state. Maybe 99% of the power comes from the possibilities for one move (like the move that avoids dying), and 1% comes from the rest. Part of this is because there are "more" goals which say to avoid dying at first, but part also might be that, conditional on not dying being optimal, agents tend to have more control.

By analogy, imagine you're collecting taxes. You have this weird system where each person has to pay at least 50¢, and pays no more than $1. The western half of your city pays $99, while the eastern half pays $1. Obviously, there have to be more people living in this wild western portion – but you aren't sure exactly how many more. Even so, you know that there are at least 99 people west, and at most 2 people east; so, there are at least 44.5 times as many people in the western half.

In the exact same way, the minimum possible average control is not doing better than chance ( is the expected value of an arbitrary possibility), and the maximum possible is all agents being in heaven ( reward is maximal). So if 99% of the power comes from one move, then this move is at least 44.5 times as likely as any other moves.


  1. opt(f,), in the terminology of the paper. ↩︎

  2. Power(f,); see definition 9. ↩︎

Thanks for this reply. In general when I'm reading an explanation and come across a statement like, "this means that...", as in the above, if it's not immediately obvious to me why, I find myself wondering whether I'm supposed to see why and I'm just missing something, or if there's a complicated explanation that's being skipped.

In this case it sounds like there was a complicated explanation that was being skipped, and you did not expect readers to see why the statement was true. As a point of feedback: when that's the case I appreciate when writers make note of that fact in the text (e.g. with a parenthetical saying, "To see why this is true, refer to theorem... in the paper.").

Otherwise, I feel like I've just stopped understanding what's being written, and it's hard for me to stay engaged. If I know that something is not supposed to be obvious, then it's easier for me to just mentally flag it as something I can return to later if I want, and keep going.

Remember how, as the agent gets more farsighted, more of its control comes from Chocolate and Hug, while also these two possibilities become more and more likely?

I don't understand this bit -- how does more of its control come from Chocolate and Hug? Wouldn't you say its control comes from Wait!? Once it ends up in Candy, Chocolate, or Hug, it has no control left. No?

Yeah, you could think of the control as coming from . Will rephrase.

We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.

And the opponent plays to win, with perfect play?

Yes in this case, although note that that only tells us about the rules of the game, not about the reward function - most agents we're considering don't have the normal Tic-Tac-Toe reward function.

Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.

I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:

  • the agent gets a reward for the Wait! state, just like the other states
  • for terminal states (the three non-Wait! states), the agent stays in that state, and keeps getting the same reward for all future time steps
  • so, when comparing Candy vs Wait! + Chocolate, the rewards after three turns would be (R_candy + γ
    * R_candy + γ^2 * R_candy) vs (R_wait + γ * R_chocolate + γ^2 * R_chocolate)

(I had at first assumed the agent got no reward for Wait!, and also failed to realize that the agent keeps getting the reward for the terminal state indefinitely, and so thought it was just about comparing different one-time rewards.)

Yes. The full expansions (with no limit on the time horizon) are

, where .

Thoughts after reading and thinking about this post

The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.

In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In that analogy, the statement that "most goals incentivize gaining power over that environment" would then be "for most tournaments, the first place finisher is someone with good average performance." With this formulation, the statement

formal POWER contributions of different possibilities are approximately proportionally related to instrumental convergence.

seems to be exactly what you would expect (more first places should strongly correlate with better performance). And to construct a counter-example, one creates a state with a lot of second places (i.e., a lot of policies for which it is the second best state) but few first places. I think the graph in the "Formalizations" section does exactly that. If the analogy is sound, it feels helpful to me.

(This is all without having read the paper. I think I'd need to know more of the theory behind MDP to understand it.)

Yes, this is roughly correct!

As an additional note: it turns out, however, that even if you slightly refine the notion of "power that this part of the future gives me, given that I start here", you have neither "more power  instrumental convergence" nor "instrumental convergence  more power" as logical implications. 

Instead, if you're drawing the causal graph, there are many, many situations which cause both instrumental convergence and greater power. The formal task is then, "can we mathematically characterize those situations?". Then, you can say, "power-seeking will occur for optimal agents with goals from [such and such distributions] for [this task I care about] at [these discount rates]".

It seems a common reading of my results is that agents tend to seek out states with higher power. I think this is usually right, but it's false in some cases. Here's an excerpt from the paper:

So, just because a state has more resources, doesn't technically mean the agent will go out of its way to reach it. Here's what the relevant current results say: parts of the future allowing you to reach more terminal states are instrumentally convergent, and the formal POWER contributions of different possibilities are approximately proportionally related to their instrumental convergence. As I said in the paper,

The formalization of power seems reasonable, consistent with intuitions for all toy MDPs examined. The formalization of instrumental convergence also seems correct. Practically, if we want to determine whether an agent might gain power in the real world, one might be wary of concluding that we can simply "imagine'' a relevant MDP and then estimate e.g. the "power contributions'' of certain courses of action. However, any formal calculations of POWER are obviously infeasible for nontrivial environments.

To make predictions using these results, we must combine the intuitive correctness of the power and instrumental convergence formalisms with empirical evidence (from toy models), with intuition (from working with the formal object), and with theorems (like theorem 46, which reaffirms the common-sense prediction that more cycles means asymptotic instrumental convergence, or theorem 26, fully determining the power in time-uniform environments). We can reason, "for avoiding shutdown to not be heavily convergent, the model would have to look like such-and-such, but it almost certainly does not...''.

I think the Tic-Tac-Toe reasoning is a better intuition: it's instrumentally convergent to reach parts of the future which give you more control from your current vantage point. I'm working on expanding the formal results to include some version of this.

Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart's law that says "if X is a proxy for Y and you optimize X, the correlation breaks" but we really mean a stronger version: "if you optimize X, Y will actively decrease". Your paper clarifies that what we actually mean is an intermediate version: "if you optimize X, it becomes a harder to optimize Y". My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?

My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?

I'm not totally sure I fully follow the conclusion, but I'll take a shot at answering - correct me if it seems like I'm talking past you.

Taking to be some notion of human values, I think it's both true that actively decreases and becomes harder for us to optimize. Both of these are caused, I think, by the agent's drive to take power / resources from us. If this weren't true, we might expect to see only "evil" objectives inducing catastrophically bad outcomes.

I should've specified that the strong version is "Y decreases relative to a world where neither of X nor Y are being optimized". Am I right that this version is not true?

I don't immediately see why this wouldn't be true as well as the "intermediate version". Can you expand?

If X is "number of paperclips" and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise "1-Y" would go in the opposite direction which can't be true by symmetry. But if Y is something like "number of happy people", Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.

That makes sense, thanks. I then agree that it isn't always true that actively decreases, but it should generally become harder for us to optimize. This is the difference between a utility decrease and an attainable utility decrease.

Update: I generalized these results to stochastic MDPs (before, I assumed determinism).

We explored similar idea in "Military AI as a Convergent Goal of Self-Improving AI". In that article we suggested that any advance AI will have a convergent goal to take over the world and because of this, it will have convergent subgoal of developing weapons in the broad sense of the word "weapon": not only tanks or drones, but any instruments to enforce its own will over others or destroy them or their goals.

We wrote in the abstract: "We show that one of the convergent drives of AI is a militarization drive, arising from AI’s need to wage a war against its potential rivals by either physical or software means, or to increase its bargaining power. This militarization trend increases global catastrophic risk or even existential risk during AI takeoff, which includes the use of nuclear weapons against rival AIs, blackmail by the threat of creating a global catastrophe, and the consequences of a war between two AIs. As a result, even benevolent AI may evolve into potentially dangerous military AI. The type and intensity of militarization drive depend on the relative speed of the AI takeoff and the number of potential rivals."

That paper seems quite different from this post in important ways.

In particular, the gist of the OP seems to be something like "showing that pre-formal intuitions about instrumental convergence persist under a certain natural class of formalisations". In particular, it does so using formalism closer to standard machine learning research.

The paper you linked seems to me to instead assume that this holds true, and then apply that insight in the context of military strategy. Without speculating about the merits of that, it seems like a different thing which will appeal to different readers, and if it is important, it will be important for somewhat different reasons.