I found Formalizing Convergent Instrumental Goals (Benson-Tilsen and Soares) to be quite readable. I was surprised that the instrumental convergence hypothesis had been formulated and proven (within the confines of a reasonable toy model); this caused me to update slightly upwards on existential risk from unfriendly AI.

This paper involves the mathematical formulation and proof of instrumental convergence, within the aforementioned toy model. Instrumental convergence says that an agent with utility function will pursue instrumentally-relevant subgoals, even though this pursuit may not bear directly on . Imagine that involves the proof of the Riemann hypothesis. will probably want to gain access to lots of computronium. What if turns us into computronium? Well, there's plenty of matter in the universe; could just let us be, right? Wrong. Let's see how to prove it.

My Background

I'm a second-year CS PhD student. I've recently started working through the MIRI research guide; I'm nearly finished with Naïve Set Theory, which I intend to review soon. To expose my understanding to criticism, I'm going to summarize this paper and its technical sections in a somewhat informal fashion. I'm aware that the paper isn't particularly difficult for those with a mathematical background. However, I think this result is important, and I couldn't find much discussion of it.


It is important to distinguish between the relative unpredictability of the exact actions selected by a superintelligent agent, and the relative predictability of the general kinds-of-things such an agent will pursue. As Eliezer wrote,

Suppose Kasparov plays against some mere chess grandmaster Mr. G, who's not in the running for world champion. My own ability is far too low to distinguish between these levels of chess skill. When I try to guess Kasparov's move, or Mr. G's next move, all I can do is try to guess "the best chess move" using my own meager knowledge of chess. Then I would produce exactly the same prediction for Kasparov's move or Mr. G's move in any particular chess position. So what is the empirical content of my belief that "Kasparov is a better chess player than Mr. G"?
The outcome of Kasparov's game is predictable because I know, and understand, Kasparov's goals. Within the confines of the chess board, I know Kasparov's motivations - I know his success criterion, his utility function, his target as an optimization process. I know where Kasparov is ultimately trying to steer the future and I anticipate he is powerful enough to get there, although I don't anticipate much about how Kasparov is going to do it.

In other words: we may not be able to predict precisely how Kasparov will win a game, but we can be fairly sure his plan will involve instrumentally-convergent subgoals, such as the capture of his opponent's pieces. For an unfriendly AI, these subgoals probably include nasty things like hiding its true motives as it accumulates resources.


Consider a discrete universe made up of squares, with square denoted by (we may also refer to this as region ). Let's say that Earth and other things we do (or should) care about are in some region . If is the same for all possible values, we say that is indifferent to region . The dollar question is whether (whose is indifferent to ) will leave alone. First, however, we need to define a few more things.


At any given time step (time is also considered discrete in this universe), has a set of actions it can perform in ; may select an action for each square. Then, the transition function for (how the world evolves in response to some action ) is a function whose domain is the Cartesian product of the possible actions and the possible state values, ; basically, every combination of what can do and what can be in can produce a distinct new value for . This transition function can be defined globally with lots more Cartesian products. Oh, and is always allowed to do nothing in a given square, so is never empty.


Let represent all of the resources to which may or may not have access. Define to be the resources at 's disposal at time step - basically, we know it's some combination of the defined resources .

may choose some resource allocation over the squares; this allocation is defined just like you'd expect (so if you don't know what the Azerothian portal has to do with resource allocation, don't worry about it). The resources committed to a square may affect the local actions available. can only choose actions permitted by the selected allocation. Equally intuitively, how resources change over time is a function of the current resources, the actions selected, and the state of the universe; the resources available after a time step is the combination of what we didn't use and what each square's resource transition function gave us back.

Resources go beyond raw materials to include things like machines and technologies. The authors note:

We can also represent space travel as a convergent instrumental goal by allowing only actions that have no e ffects in certain regions, until it obtains and spends some particular resources representing the prerequisites for traveling to those regions. (Space travel is a convergent instrumental goal because gaining influence over more regions of the universe lets optimize those new regions according to its values or otherwise make use of the resources in that region.)


A universe-history is a sequence of states, actions, and resources. The actions must conform to resources available at each step, while the states and resources evolve according to the transition functions.

We say a strategy is an action sequence over all time steps and regions; define a partial strategy to be a strategy for some part of the universe (represented as a subset of the square indices ); only allocates resources and does things in the squares of . We can combine partial strategies as long as they don't overlap in some .

We call a strategy feasible if it complies with both resource restrictions and the transition functions for both states and resources. Let be the set of all feasible strategies given resource allocation over time ; define the set similarly.