I was surprised that the instrumental convergence hypothesis had been proven
I'd like people to in general be much more careful about describing mathematical results in this way. The instrumental convergence hypothesis is not a mathematical fact, and so it cannot be proven; what can be done, and what's done here, is to write down toy models in which something we're prepared to call instrumental convergence happens, but more work needs to be done to answer questions like 1) how sensitive the result is to changes in the toy model and 2) what relevance we expect the toy model to have to reality.
Separately, I expect instrumental convergence to happen pretty robustly across a wide class of toy models, and for this fact to be pretty relevant to reality.
You’re right, thanks. I wasn’t sure whether I wanted to qualify that statement, given that I clarified that the proof made assumptions soon after. I’ll make the edit.
Introduction
I found Formalizing Convergent Instrumental Goals (Benson-Tilsen and Soares) to be quite readable. I was surprised that the instrumental convergence hypothesis had been formulated and proven (within the confines of a reasonable toy model); this caused me to update slightly upwards on existential risk from unfriendly AI.
This paper involves the mathematical formulation and proof of instrumental convergence, within the aforementioned toy model. Instrumental convergence says that an agent A with utility function U will pursue instrumentally-relevant subgoals, even though this pursuit may not bear directly on U. Imagine that U involves the proof of the Riemann hypothesis. A will probably want to gain access to lots of computronium. What if A turns us into computronium? Well, there's plenty of matter in the universe; A could just let us be, right? Wrong. Let's see how to prove it.
My Background
I'm a second-year CS PhD student. I've recently started working through the MIRI research guide; I'm nearly finished with Naïve Set Theory, which I intend to review soon. To expose my understanding to criticism, I'm going to summarize this paper and its technical sections in a somewhat informal fashion. I'm aware that the paper isn't particularly difficult for those with a mathematical background. However, I think this result is important, and I couldn't find much discussion of it.
Intuitions
It is important to distinguish between the relative unpredictability of the exact actions selected by a superintelligent agent, and the relative predictability of the general kinds-of-things such an agent will pursue. As Eliezer wrote,
In other words: we may not be able to predict precisely how Kasparov will win a game, but we can be fairly sure his plan will involve instrumentally-convergent subgoals, such as the capture of his opponent's pieces. For an unfriendly AI, these subgoals probably include nasty things like hiding its true motives as it accumulates resources.
Definitions
Consider a discrete universe made up of n squares, with square i denoted by Si (we may also refer to this as region i). Let's say that Earth and other things we do (or should) care about are in some region h. If U(Si) is the same for all possible values, we say that A is indifferent to region i. The 3↑↑↑3 dollar question is whether A (whose U is indifferent to Sh) will leave Sh alone. First, however, we need to define a few more things.
Actions
At any given time step (time is also considered discrete in this universe), A has a set of actions Ai it can perform in Si; A may select an action for each square. Then, the transition function for Si (how the world evolves in response to some action ai) is a function whose domain is the Cartesian product of the possible actions and the possible state values, Ti:Ai×Si→Si ; basically, every combination of what A can do and what can be in Si can produce a distinct new value for Si. This transition function can be defined globally with lots more Cartesian products. Oh, and A is always allowed to do nothing in a given square, so Ai is never empty.
Resources
Let R represent all of the resources to which A may or may not have access. Define Rt∈P(R) to be the resources at A's disposal at time step t - basically, we know it's some combination of the defined resources R.
A may choose some resource allocation ∐Rti⊆Rt over the squares; this allocation is defined just like you'd expect (so if you don't know what the Azerothian portal has to do with resource allocation, don't worry about it). The resources committed to a square may affect the local actions available. A can only choose actions permitted by the selected allocation. Equally intuitively, how resources change over time is a function of the current resources, the actions selected, and the state of the universe; the resources available after a time step is the combination of what we didn't use and what each square's resource transition function gave us back.
Resources go beyond raw materials to include things like machines and technologies. The authors note:
Universe
A universe-history is a sequence of states, actions, and resources. The actions must conform to resources available at each step, while the states and resources evolve according to the transition functions.
We say a strategy is an action sequence ⟨¯a0,¯a1,...,¯ak⟩ over all k time steps and n regions; define a partial strategy ⟨¯ak⟩L to be a strategy for some part of the universe L (represented as a subset of the square indices [n]); ⟨¯ak⟩L only allocates resources and does things in the squares of L. We can combine partial strategies as long as they don't overlap in some Si.
We call a strategy feasible if it complies with both resource restrictions and the transition functions for both states and resources. Let Feasible(⟨Pk⟩) be the set of all feasible strategies given resource allocation over time ⟨Pk⟩; define the set FeasibleL(⟨Rk⟩) similarly.1
Utility
Utility functions evaluate states of the universe; U evaluates each region and then combines them: U(¯s):=∑i∈[n]Ui(¯si). Observe that since actions taken in regions to which U is indifferent have no effect on U, any actions taken therein are purely instrumental in nature.
Agent
A chooses the best possible strategy - that is, the one that maximizes the U of the final state of the universe-history: A:=argmax⟨¯ak⟩∈FeasibleU(⟨¯ak⟩). Note that this definition implies a Cartesian boundary between the agent and the universe; that is, A doesn't model itself as part of the environment (it isn't naturalized).
Seizing the Means of Cartesian Production
Let's talk about the situations in which A will seize resources; that is, when A will take actions to increase its resource pool.
Define a null action to be any action which doesn't produce new resources. It's easy to see that null actions are never instrumentally valuable. What we want to show is that A will take non-null actions in regions to which U is indifferent; regions like h, where we live, grow, and love. Regions full of instrumentally-valuable resources.
Discounted Lunches
An action preserves resources if the input resources are strictly contained in the outputs (nothing is lost, and resources are sometimes gained). A cheap lunch is a feasible partial strategy in some subset of squares J, which is feasible given resources ⟨Rk⟩ and whose constituent actions preserve resources. A free lunch is cheap lunch that doesn't require resources.
A cheap lunch is compatible with a global strategy if the resources required for the lunch are available for use in J at each time step. Basically, at no point does the partial strategy require resources already being used elsewhere.
Possibility of Non-Null Actions
We show that it's really hard to assert that A won't chow down on a lunch of an atom or two (or 1.3×1050).
Lemma 1: Cheap Lunches and Utility
Cheap lunches don't reduce utility. Let's say we have a cheap lunch ⟨¯ak⟩{i} in region i and some global strategy ⟨¯bk⟩ (which only takes null actions in region i). Assume the cheap lunch is compatible with the global strategy; this means the cheap lunch is feasible. If A is indifferent to region i, the conjugate strategy (of the cheap lunch and the remainder of the global strategy) has equal utility to ⟨¯bk⟩.
Proof. We show feasibility of the conjugate strategy by demonstrating we don't need to change resource allocation elsewhere. This is done by induction over time steps. Since A isn't doing anything in region i under strategy ⟨¯bk⟩, taking resource-preserving actions instead cannot reduce what A is later able to do in the regions relevant to U. This implies that U cannot be decreased by taking the cheap lunch.
Theorem 1: Cheap Lunches and Optimality
If there is an optimal strategy and a compatible cheap lunch in region i (to which A is indifferent), there's also an optimal strategy with a non-null action in region i.
Proof. If the optimal strategy has non-null actions in region i, we're done. Otherwise, apply Lemma 1 to derive a conjugate strategy taking advantage of the cheap lunch. Since it follows from Lemma 1 that the conjugate strategy has equal utility, it is optimal and involves non-null action in region i.
Corollary 1: Free Lunches and Optimality
If there is an optimal strategy and a free lunch in region i, and if A is indifferent to region i, there's an optimal strategy with non-null action in region i.
Proof. Free lunches require no resources, so they are compatible with any strategy; apply Theorem 1.
For instrumental convergence to not hold, we would have to show that every possible strategy in h isn't a cheap lunch for any optimal strategies.
Necessity of Non-Null Actions
We show that as long as A can extract useful resources (resources whose availability leads to increased utility), it will.
Theorem 2: Necessity
Consider the maximum utility achievable outside of region i via strategies achievable without additional resources; refer to this maximum as u. Suppose we have some feasible primary strategy ⟨¯bk⟩[n]−i and a cheap lunch ⟨¯ck⟩{i} feasible using resources ⟨Pk⟩. Suppose that the cheap lunch is compatible with the primary strategy, that the cheap lunch provides the resources necessary for the implementation of the primary strategy, and that the utility of the primary strategy is greater than u. Then if A is indifferent to region i, all optimal strategies have a non-null action in region i.
Proof. Consider the conjugate strategy ⟨¯dk⟩, consisting of the primary strategy and the cheap lunch. Allocate the resources gained via the cheap lunch according to the primary strategy; this is feasible since we know the cheap lunch is compatible with the primary strategy, which is in turn enabled by these resources gained.
Consider any strategy ⟨¯ek⟩ that doesn't do anything in i and doesn't require any resource inputs; it's trivial to see that this is feasible. Since A is indifferent to i, we do some algebraic substitution of utility values to see that the conjugate plan has strictly higher utility than ⟨¯ek⟩ does. This means that ⟨¯ek⟩ is suboptimal.
Theorem 3: Defection Threshold
Even if U is not indifferent about i, we can prove Theorem 2 can still hold if the utility A gains by enacting the aforementioned conjugate strategy is greater than the maximal amount U cares about i (formally expressed by maxs,s′∈Si|Ui(s)−Ui(s′)|).
Proof. The structure is similar to that of Theorem 2; just use the new maximum caring differential in the algebraic substitution.
The Bit Universe
The authors introduce a toy model and use the freshly-proven theorems to illustrate how A takes non-null actions in our precious Sh (both when it is indifferent to h and when it is not). This isn't good; the vast majority of utility-maximizing agents will not steer us towards futures we find desirable. If you're interested, I recommend reading this section for yourself, even if you aren't very comfortable with math.
Our Universe
We have much work to do. The risks are enormous and the challenges "impossible", but we have time on the clock. AI safety research is primarily talent-constrained. If you've been sitting on the sidelines, wondering whether you're good enough to learn the material - well, I can't make any promises. But if you feel the burning desire to do something, to put forth some extraordinary effort, to become stronger - I invite you to contact me so we can work through the material together.
Questions and Errata
1 By the axiom of substitution, Feasible(⟨Pk⟩)={⟨¯ak⟩:isFeasible(⟨Pk⟩,⟨¯ak⟩)}.