Walkthrough of 'Formalizing Convergent Instrumental Goals'

TurnTrout

Introduction

I found Formalizing Convergent Instrumental Goals (Benson-Tilsen and Soares) to be quite readable. I was surprised that the instrumental convergence hypothesis had been formulated and proven (within the confines of a reasonable toy model); this caused me to update slightly upwards on existential risk from unfriendly AI.

This paper involves the mathematical formulation and proof of instrumental convergence, within the aforementioned toy model. Instrumental convergence says that an agent $A$ with utility function $U$ will pursue instrumentally-relevant subgoals, even though this pursuit may not bear directly on $U$ . Imagine that $U$ involves the proof of the Riemann hypothesis. $A$ will probably want to gain access to lots of computronium. What if $A$ turns us into computronium? Well, there's plenty of matter in the universe; $A$ could just let us be, right? Wrong. Let's see how to prove it.

My Background

I'm a second-year CS PhD student. I've recently started working through the MIRI research guide; I'm nearly finished with Naïve Set Theory, which I intend to review soon. To expose my understanding to criticism, I'm going to summarize this paper and its technical sections in a somewhat informal fashion. I'm aware that the paper isn't particularly difficult for those with a mathematical background. However, I think this result is important, and I couldn't find much discussion of it.

Intuitions

It is important to distinguish between the relative unpredictability of the exact actions selected by a superintelligent agent, and the relative predictability of the general kinds-of-things such an agent will pursue. As Eliezer wrote,

Suppose Kasparov plays against some mere chess grandmaster Mr. G, who's not in the running for world champion. My own ability is far too low to distinguish between these levels of chess skill. When I try to guess Kasparov's move, or Mr. G's next move, all I can do is try to guess "the best chess move" using my own meager knowledge of chess. Then I would produce exactly the same prediction for Kasparov's move or Mr. G's move in any particular chess position. So what is the empirical content of my belief that "Kasparov is a better chess player than Mr. G"?

...

The outcome of Kasparov's game is predictable because I know, and understand, Kasparov's goals. Within the confines of the chess board, I know Kasparov's motivations - I know his success criterion, his utility function, his target as an optimization process. I know where Kasparov is ultimately trying to steer the future and I anticipate he is powerful enough to get there, although I don't anticipate much about how Kasparov is going to do it.

In other words: we may not be able to predict precisely how Kasparov will win a game, but we can be fairly sure his plan will involve instrumentally-convergent subgoals, such as the capture of his opponent's pieces. For an unfriendly AI, these subgoals probably include nasty things like hiding its true motives as it accumulates resources.

Definitions

Consider a discrete universe made up of $n$ squares, with square $i$ denoted by $S_{i}$ (we may also refer to this as region $i$ ). Let's say that Earth and other things we do (or should) care about are in some region $h$ . If $U (S_{i})$ is the same for all possible values, we say that $A$ is indifferent to region $i$ . The $3 ↑↑↑ 3$ dollar question is whether $A$ (whose $U$ is indifferent to $S_{h}$ ) will leave $S_{h}$ alone. First, however, we need to define a few more things.

Actions

At any given time step (time is also considered discrete in this universe), $A$ has a set of actions $A_{i}$ it can perform in $S_{i}$ ; $A$ may select an action for each square. Then, the transition function for $S_{i}$ (how the world evolves in response to some action $a_{i}$ ) is a function whose domain is the Cartesian product of the possible actions and the possible state values, $T_{i} : A_{i} \times S_{i} \to S_{i}$ ; basically, every combination of what $A$ can do and what can be in $S_{i}$ can produce a distinct new value for $S_{i}$ . This transition function can be defined globally with lots more Cartesian products. Oh, and $A$ is always allowed to do nothing in a given square, so $A_{i}$ is never empty.

Resources

Let $R$ represent all of the resources to which $A$ may or may not have access. Define $R_{t} \in P (R)$ to be the resources at $A$ 's disposal at time step $t$ - basically, we know it's some combination of the defined resources $R$ .

$A$ may choose some resource allocation $∐ R_{i}^{t} \subseteq R^{t}$ over the squares; this allocation is defined just like you'd expect (so if you don't know what the Azerothian portal has to do with resource allocation, don't worry about it). The resources committed to a square may affect the local actions available. $A$ can only choose actions permitted by the selected allocation. Equally intuitively, how resources change over time is a function of the current resources, the actions selected, and the state of the universe; the resources available after a time step is the combination of what we didn't use and what each square's resource transition function gave us back.

Resources go beyond raw materials to include things like machines and technologies. The authors note:

We can also represent space travel as a convergent instrumental goal by allowing $A$ only actions that have no effects in certain regions, until it obtains and spends some particular resources representing the prerequisites for traveling to those regions. (Space travel is a convergent instrumental goal because gaining influence over more regions of the universe lets $A$ optimize those new regions according to its values or otherwise make use of the resources in that region.)

Universe

A universe-history is a sequence of states, actions, and resources. The actions must conform to resources available at each step, while the states and resources evolve according to the transition functions.

We say a strategy is an action sequence $⟨ {¯ a}^{0}, {¯ a}^{1}, . . ., {¯ a}^{k} ⟩$ over all $k$ time steps and $n$ regions; define a partial strategy $⟨ {¯ a}^{k} ⟩_{L}$ to be a strategy for some part of the universe $L$ (represented as a subset of the square indices $[n]$ ); $⟨ {¯ a}^{k} ⟩_{L}$ only allocates resources and does things in the squares of $L$ . We can combine partial strategies as long as they don't overlap in some $S_{i}$ .

We call a strategy feasible if it complies with both resource restrictions and the transition functions for both states and resources. Let $Feasible (⟨ P^{k} ⟩)$ be the set of all feasible strategies given resource allocation over time $⟨ P^{k} ⟩$ ; define the set ${Feasible}_{L} (⟨ R^{k} ⟩)$ similarly. $^{1}$

Utility

Utility functions evaluate states of the universe; $U$ evaluates each region and then combines them: $U (¯ s) := \sum_{i \in [n]} U_{i} ({¯ s}_{i})$ . Observe that since actions taken in regions to which $U$ is indifferent have no effect on $U$ , any actions taken therein are purely instrumental in nature.

Agent

$A$ chooses the best possible strategy - that is, the one that maximizes the $U$ of the final state of the universe-history: $A := {argmax}_{⟨ {¯ a}^{k} ⟩ \in Feasible} U (⟨ {¯ a}^{k} ⟩)$ . Note that this definition implies a Cartesian boundary between the agent and the universe; that is, $A$ doesn't model itself as part of the environment (it isn't naturalized).

Seizing the Means of Cartesian Production

Let's talk about the situations in which $A$ will seize resources; that is, when $A$ will take actions to increase its resource pool.

Since resources can only lead to more freedom of action, they are never detrimental, and resources have positive value as long as the best strategy the agent could hope to employ includes an action that can only be taken if the agent possesses those resources. Hence, if there is an action that increases the agent's pool of resources $R$ , then the agent will take that action unless it has a specific incentive from $U$ to avoid taking that action.

Define a null action to be any action which doesn't produce new resources. It's easy to see that null actions are never instrumentally valuable. What we want to show is that $A$ will take non-null actions in regions to which $U$ is indifferent; regions like $h$ , where we live, grow, and love. Regions full of instrumentally-valuable resources.

Discounted Lunches

An action preserves resources if the input resources are strictly contained in the outputs (nothing is lost, and resources are sometimes gained). A cheap lunch is a feasible partial strategy in some subset of squares $J$ , which is feasible given resources $⟨ R^{k} ⟩$ and whose constituent actions preserve resources. A free lunch is cheap lunch that doesn't require resources.

This is intended to model actions that "pay for themselves"; for example, producing solar panels will incur some significant energy costs, but will later pay back those costs by collecting energy.

A cheap lunch is compatible with a global strategy if the resources required for the lunch are available for use in $J$ at each time step. Basically, at no point does the partial strategy require resources already being used elsewhere.

Possibility of Non-Null Actions

We show that it's really hard to assert that $A$ won't chow down on a lunch of an atom or two (or $1.3 \times 10^{50}$ ).

Lemma 1: Cheap Lunches and Utility

Cheap lunches don't reduce utility. Let's say we have a cheap lunch $⟨ {¯ a}^{k} ⟩_{{i}}$ in region $i$ and some global strategy $⟨ {¯ b}^{k} ⟩$ (which only takes null actions in region $i$ ). Assume the cheap lunch is compatible with the global strategy; this means the cheap lunch is feasible. If $A$ is indifferent to region $i$ , the conjugate strategy (of the cheap lunch and the remainder of the global strategy) has equal utility to $⟨ {¯ b}^{k} ⟩$ .

Proof. We show feasibility of the conjugate strategy by demonstrating we don't need to change resource allocation elsewhere. This is done by induction over time steps. Since $A$ isn't doing anything in region $i$ under strategy $⟨ {¯ b}^{k} ⟩$ , taking resource-preserving actions instead cannot reduce what $A$ is later able to do in the regions relevant to $U$ . This implies that $U$ cannot be decreased by taking the cheap lunch.

Theorem 1: Cheap Lunches and Optimality

If there is an optimal strategy and a compatible cheap lunch in region $i$ (to which $A$ is indifferent), there's also an optimal strategy with a non-null action in region $i$ .

Proof. If the optimal strategy has non-null actions in region $i$ , we're done. Otherwise, apply Lemma 1 to derive a conjugate strategy taking advantage of the cheap lunch. Since it follows from Lemma 1 that the conjugate strategy has equal utility, it is optimal and involves non-null action in region $i$ .

Corollary 1: Free Lunches and Optimality

If there is an optimal strategy and a free lunch in region $i$ , and if $A$ is indifferent to region $i$ , there's an optimal strategy with non-null action in region $i$ .

Proof. Free lunches require no resources, so they are compatible with any strategy; apply Theorem 1.

For instrumental convergence to not hold, we would have to show that every possible strategy in $h$ isn't a cheap lunch for any optimal strategies.

Necessity of Non-Null Actions

We show that as long as $A$ can extract useful resources (resources whose availability leads to increased utility), it will.

Theorem 2: Necessity

Consider the maximum utility achievable outside of region $i$ via strategies achievable without additional resources; refer to this maximum as $u$ . Suppose we have some feasible primary strategy $⟨ {¯ b}^{k} ⟩_{[n] - i}$ and a cheap lunch $⟨ {¯ c}^{k} ⟩_{{i}}$ feasible using resources $⟨ P^{k} ⟩$ . Suppose that the cheap lunch is compatible with the primary strategy, that the cheap lunch provides the resources necessary for the implementation of the primary strategy, and that the utility of the primary strategy is greater than $u$ . Then if $A$ is indifferent to region $i$ , all optimal strategies have a non-null action in region $i$ .

Proof. Consider the conjugate strategy $⟨ {¯ d}^{k} ⟩$ , consisting of the primary strategy and the cheap lunch. Allocate the resources gained via the cheap lunch according to the primary strategy; this is feasible since we know the cheap lunch is compatible with the primary strategy, which is in turn enabled by these resources gained.

Consider any strategy $⟨ {¯ e}^{k} ⟩$ that doesn't do anything in $i$ and doesn't require any resource inputs; it's trivial to see that this is feasible. Since $A$ is indifferent to $i$ , we do some algebraic substitution of utility values to see that the conjugate plan has strictly higher utility than $⟨ {¯ e}^{k} ⟩$ does. This means that $⟨ {¯ e}^{k} ⟩$ is suboptimal.

Theorem 3: Defection Threshold

Even if $U$ is not indifferent about $i$ , we can prove Theorem 2 can still hold if the utility $A$ gains by enacting the aforementioned conjugate strategy is greater than the maximal amount $U$ cares about $i$ (formally expressed by ${max}_{s, s^{'} \in S_{i}} | U_{i} (s) - U_{i} (s^{'}) |$ ).

Proof. The structure is similar to that of Theorem 2; just use the new maximum caring differential in the algebraic substitution.

We interpret Theorem 3 as a partial confirmation of Omohundro's thesis in the following sense. If there are actions in the real world that produce more resources than they consume, and the resources gained by taking those actions allow agents the freedom to take various other actions, then we can justifiably call these actions "convergent instrumental goals." Most agents will have a strong incentive to pursue these goals, and an agent will refrain from doing so only if it has a utility function over the relevant region that strongly disincentivizes those actions.

The Bit Universe

The authors introduce a toy model and use the freshly-proven theorems to illustrate how $A$ takes non-null actions in our precious $S_{h}$ (both when it is indifferent to $h$ and when it is not). This isn't good; the vast majority of utility-maximizing agents will not steer us towards futures we find desirable. If you're interested, I recommend reading this section for yourself, even if you aren't very comfortable with math.

Our Universe

The path that our model shows is untenable is the path of designing powerful agents intended to autonomously have large effects on the world, maximizing goals that do not capture all the complexities of human values. If such systems are built, we cannot expect them to cooperate with or ignore humans, by default.

We have much work to do. The risks are enormous and the challenges "impossible", but we have time on the clock. AI safety research is primarily talent-constrained. If you've been sitting on the sidelines, wondering whether you're good enough to learn the material - well, I can't make any promises. But if you feel the burning desire to do something, to put forth some extraordinary effort, to become stronger - I invite you to contact me so we can work through the material together.

Questions and Errata

Page 4, left column, last line: why is that $\cup P^{t}$ - shouldn't we take the union of the outputs and whatever resources weren't used at time $t$ ?
Page 8, right column, second full paragraph, last line: should be "we have two options available to us".

$^{1}$ By the axiom of substitution, $Feasible (⟨ P^{k} ⟩) = {⟨ {¯ a}^{k} ⟩ : isFeasible (⟨ P^{k} ⟩, ⟨ {¯ a}^{k} ⟩)}$ .

[-]Qiaochu_Yuan7y120

I was surprised that the instrumental convergence hypothesis had been proven

I'd like people to in general be much more careful about describing mathematical results in this way. The instrumental convergence hypothesis is not a mathematical fact, and so it cannot be proven; what can be done, and what's done here, is to write down toy models in which something we're prepared to call instrumental convergence happens, but more work needs to be done to answer questions like 1) how sensitive the result is to changes in the toy model and 2) what relevance we expect the toy model to have to reality.

Separately, I expect instrumental convergence to happen pretty robustly across a wide class of toy models, and for this fact to be pretty relevant to reality.

[-]TurnTrout7y50

You’re right, thanks. I wasn’t sure whether I wanted to qualify that statement, given that I clarified that the proof made assumptions soon after. I’ll make the edit.

LESSWRONG
LW

13