Suppose you landed on a distant planet, and found a structure of giant metal pipes, crossed by occasional cables. Further investigation shows that the cables are electrical superconductors carrying high-voltage currents.
You might not know what the structure did. But you would nonetheless guess that this structure had been built by some intelligence, rather than being a naturally-occurring mineral formation - that there were aliens who built the structure for some purpose.
Your reasoning might go something like this: "Well, I don't know if the aliens were trying to manufacture cars, or build computers, or what. But if you consider the problem of efficient manufacturing, it might involve mining resources in one place and then efficiently transporting them somewhere else, like by pipes. Since the most efficient size and location of these pipes would be stable, you'd want the shape of the pipes to be stable, which you could do by making the pipes out of a hard material like metal. There's all sorts of operations that require energy or negentropy, and a superconducting cable carrying electricity seems like an efficient way of transporting that energy. So I don't know what the aliens were ultimately trying to do, but across a very wide range of possible goals, an intelligent alien might want to build a superconducting cable to pursue that goal."
That is: We can take an enormous variety of compactly specifiable goals, like "travel to the other side of the universe" or "support biological life" or "make paperclips", and find very similar optimal strategies along the way. We don't actually know if electrical superconductors are actually the most useful way to transport energy in the limit of technology. But whatever is the most efficient way of transporting energy, whether that's electrical superconductors or something else, the most efficient form of that technology would probably not vary much depending on whether you were trying to make diamonds or make paperclips.
Or to put it another way: If you consider the goals "make diamonds" and "make paperclips", then they might have almost nothing in common with respect to their end-states - a diamond might contain no iron. But the earlier strategies used to make a lot of diamond and make a lot of paperclips might have much in common; "the best way of transporting energy to make diamond" and "the best way of transporting energy to make paperclips" are much more likely to be similar.
From a Bayesian standpoint this is how we can identify a huge machine strung with superconducting cables as having been produced by high-technology aliens, even before we have any idea of what the machine does. We're saying, "This looks like the product of optimization, a strategy that the aliens chose to best achieve some unknown goal ; we can infer this without knowing because many possible -goals would concentrate probability into this -strategy being used."
When you select strategy because you expect it to achieve a later state (the "goal"), we say that is your instrumental strategy for achieving The observation of "instrumental convergence" is that a widely different range of -goals can lead into highly similar -strategies. (This becomes truer as the -seeking agent becomes more instrumentally efficient; two very powerful chess engines are more likely to solve a solvable chess problem the same way, compared to two weak chess engines whose individual quirks might result in idiosyncratic solutions.)
If you'd probably want to do whether you were a superintelligent paperclip maximizer, or a superintelligent diamond maximizer, or a superintelligence that just wanted to keep a single button pressed for as long as possible, then may be a "convergent instrumental strategy".
Vingean uncertainty is the observation that, as we become increasingly confident of increasingly powerful intelligence from an agent with precisely known goals, we become decreasingly confident of the exact moves it will make (unless the domain has an optimal strategy and we know the exact strategy). E.g., to know exactly where Deep Blue would move on a chessboard, you would have to be as good at chess as Deep Blue. Instrumental convergence can be seen as a caveat to Vingean uncertainty: Even if we don't know the exact actions or the exact end goal, we may be able to predict that some intervening states or strategies will fall into certain abstract categories.
This is: If we don't know whether a superintelligent agent is a paperclip maximizer or a diamond maximizer, we can still guess with some confidence that it will pursue a strategy in the general class "get more resources of matter, energy, and computation" rather than "don't get more resources". This is true even though Vinge's Principle says that we won't be able to predict exactly how the superintelligence will go about gathering matter and energy (since we don't know optimal play in the domain "gather matter and energy").
Another way of viewing it: Imagine that at the start of a rich and complicated real-life "game", the player must make a single binary choice between the abstract moves "Gather more resources later" and "Never gather any more resources later". We want to predict which move is the optimal play on that first turn. Vingean uncertainty or not, we are probably justified in putting a large amount of probability mass on the first move.
• An instrumental convergence claim is about a default or a majority of cases, not a universal generalization.
If for whatever reason your goal is to "make paperclips without using any superconductors", then superconducting cables will not be the best instrumental strategy for achieving that goal.
Any claim about instrumental convergence says at most, "The vast majority of possible goals would convergently imply a strategy in , by default and unless otherwise averted by some special case for which strategies in (not-X) are better."
See also the more general idea that the space of possible minds is very large; universal claims about all possible minds have many chances to be false, while existential claims "There exists at least one possible mind such that..." have many chances to be true.
If some particular tree is extremely important and valuable to you, then you won't cut it down to obtain wood. It is irrelevant whether a majority of other goals that you could have, but don't actually have, would suggest cutting down that tree.
• Convergent strategies are not deontological rules.
Imagine looking at a machine chess-player and reasoning, "Well, I don't think the AI will sacrifice its pawn in this position, even to achieve a checkmate. Any chess-playing AI needs a drive to be protective of its pawns, or else it'd just give up all its pawns. It wouldn't have gotten this far in the game in the first place, if it wasn't more protective of its pawns than that."
The reasoning for an instrumental convergence claim says, "Supposing a consequentialist with a goal that is reasoning about what policy will effectively obtain by default it will find a strategy within a highly similar class of strategies that leads to " If happens not to lead to then a -consequentialist won't do even if most other -cases are prudent.
Modern chess algorithms behave in a fashion that most humans can't distinguish from expected-checkmate-maximizers. That is, from your merely human perspective, watching a single move at the time it happens, there's no visible difference between your subjective expectation for the chess algorithm's behavior, and your expectation for the behavior of an oracle that always output the move with the highest conditional probability of leading to checkmate. If you, a human, you could discern with your unaided eye some systematic difference like "this algorithm protects its pawn more often than checkmate-achievement would imply", you would know how to make systematically better chess moves; modern machine chess is too superhuman for that.
Often, this uniform rule of output-the-move-with-highest-probability-of-eventual-checkmate will seem to protect pawns, or not throw away pawns, or defend pawns when you attack them. But if in some special case the highest probability of checkmate is instead achieved by sacrificing a pawn, the chess algorithm will do that instead.
• Convergent strategies don't need to be implemented as independent motivational drives.
Consider the following line of reasoning: "It's impossible to get on an airplane without buying plane tickets. So anyone on an airplane must be a sort of person who enjoys buying plane tickets. If I offer them a plane ticket they'll probably buy it, because this is almost certainly somebody who has an independent motivational drive to buy plane tickets. There's just no way you can design an organism that ends up on an airplane unless it has a buying-tickets drive."
The appearance of an "instrumental strategy" can be seen as a kind of implicit phenomenon in repeatedly choosing actions that lead into a final state There doesn't have to be a special -strategy-module which repeatedly does -actions regardless of whether or not they lead to There can just be a process that repeatedly checks for the action leading to the most .
The flaw in the argument about plane tickets is that human beings are consequentialists who buy plane tickets just because they wanted to go somewhere and they expected the action "buy the plane ticket" to have the consequence, in that particular case, of going to the particular place and time they wanted to go. No extra "buy the plane ticket" module is required, and especially not a plane-ticket-buyer that doesn't check whether there's any travel goal and whether buying the plane ticket leads into the desired later state.
• " would help accomplish " is insufficient to establish a claim of instrumental convergence on .
Suppose you want to get to San Francisco. You could get to San Francisco by paying me $20,000 for a plane ticket. You could also get to San Francisco by paying someone else $400 for a plane ticket, and this is probably the smarter option for achieving your other goals.
Establishing "Compared to doing nothing, is more useful for achieving " doesn't establish as an instrumental strategy. We need to believe that there's no variant of (not-) which would be even more useful than any for achieving
When is phrased in very general terms like "acquire resources", we might reasonably guess that "don't acquire resources" or "do without acquiring any resources" really is unlikely to be a superior strategy. If is some more specific strategy, it's more likely that some other strategy will be even more effective at achieving most -goals.
See also: Missing the weird alternative.
That said, if we can see how achieves most -goals to some degree, then we should expect the actual strategy deployed by an efficient -agent to obtain at least as much as would.
A claim of instrumental convergence on can also be more strongly refuted by presenting a strategy which would get more paperclips. We are not then confident of convergence on but we are confident of non-convergence on
• Claims about instrumental convergence are not ethical claims.
Whether is a good way to get both paperclips and diamonds is irrelevant to whether is good for human flourishing or eudaimonia or fun-theoretic optimality or extrapolated volition or whatever. Whether is, in an intuitive sense, "good", needs to be evaluated separately from whether it is instrumentally convergent.
In particular: instrumental strategies are not terminal values - they have a type distinction from terminal values. So a widely useful instrumental strategy is not like a universally convincing utility function (which you probably shouldn't look for in the first place). "If you're going to spend resources on thinking about technology, try to do it earlier rather than later, so that you can amortize your invention over more uses" seems very likely to be an instrumentally convergent exploration-exploitation strategy. It's not plausibly the Meaning of Life.
• Claims about instrumental convergence are not futurological predictions.
Even if, e.g., "acquire resources" is an instrumentally convergent strategy, this doesn't mean that we can't as a special case deliberately construct advanced AGIs that are not driven to acquire as many resources as possible. Rather the claim implies, "We would need to deliberately build -averting agents as a special case, because by default most imaginable agent designs would pursue strategy "
Of itself, this observation makes no further claim about the quantitative probability that, in the real world, AGI builders might want to build -agents, might try to build -agents, and might succeed at building -agents.
A claim about instrumental convergence is talking about a logical property of the larger design space of possible agents, not making a prediction what happens in any particular research lab. Though, obviously, the ground facts of computer science may very well be relevant to what happens in actual research labs.
Distinguishing the advanced agent properties that seem probably required for an AI program to start exhibiting the sort of reasoning filed under "instrumental convergence", the most obvious candidates are:
That is: You don't automatically see "acquire more computing power" as a useful strategy unless you understand "I am a cognitive program and I tend to achieve more of my goals when I run on more resources." Alternatively, e.g., the programmers adding more computing power and the system's goals starting to be achieved better, after which related policies are positively reinforced and repeated, could arrive at a very similar end via the pseudoconsequentialist idiom of policy reinforcement.
The advanced agent properties that would naturally or automatically lead to instrumental convergence seem well above the range of modern AI programs. As of 2016, current machine learning algorithms don't seem to be within the range where this predicted phenomenon should start to be visible.
One of the convergent strategies originally proposed by Steve Omohundro in "The Basic AI Drives" was resource acquisition:
"All computation and physical action requires the physical resources of space, time, matter, and free energy. Almost any goal can be better accomplished by having more of these resources."
We'll consider this example as a template for other proposed instrumentally convergent strategies, and run through the standard questions and caveats.
• Question: Is this something we'd expect a paperclip maximizer, diamond maximizer, and button-presser to do? And a flourishing-intergalactic-civilization optimizer too, while we're at it?
To put it another way, for a utility function to imply the use of every joule of energy, it is a sufficient condition that for every plan with expected utility there is a plan with that uses one more joule of energy:
• Question: Is there some strategy in which produces higher -achievement for most than any strategy inside ?
Suppose that by using most of the mass-energy in most of the stars reachable before they go over the cosmological horizon as seen from present-day Earth, it would be possible to produce paperclips (or diamonds, or probability-years of expected button-stays-pressed time, or QALYs, etcetera).
It seems reasonably unlikely that there is a strategy inside the space intuitively described by "Do not acquire more resources" that would produce paperclips, let alone that the strategy producing the most paperclips would be inside this space.
We might be able to come up with a weird special case that would imply this. But that's not the same as asserting, "With high subjective probability, the optimal strategy will be in ." We're concerned with making a statement about defaults given the most subjectively probable background states of the universe, not trying to make a universal statement that covers every conceivable possibility.
To put it another way, if your policy choices or predictions are only safe given the premise that "The best way of producing the maximum possible number of paperclips involves not acquiring any more resources", you need to clearly flag this as a load-bearing assumption and expect a lot of people not to believe it without a very strong argument that you can get more than paperclips that way.
• Caveat: The claim is not that every possible goal can be better-accomplished by acquiring more resources.
As a special case, this would not be true of an agent with an impact penalty term in its utility function, or some other low-impact agent (if that agent also only had goals of a form that could be satisfied inside bounded regions of space and time with a bounded effort).
We might reasonably expect this special kind of agent to only acquire the minimum resources to accomplish its Task, minimizing its further impact.
(But we wouldn't expect this to be true in a majority of possible cases inside mind design space; it's not true by default; we need to specify a further fact about the agent to make the claim not be true. If we imagine some computationally simple language for specifying utility functions, then most utility functions wouldn't happen to have both of these properties, so a majority of utility functions given this language and measure would not by default try to use fewer resources.)
• Caveat: The claim is not that most agents will behave as if under a deontological imperative to acquire resources.
A paperclip maximizer wouldn't necessarily tear apart a working paperclip factory to "acquire more resources" (at least not until that factory had already produced all the paperclips it was going to help produce.)
• Caveat: The claim is not that well-functioning agents must have additional, independent resource-acquiring motivational drives.
A paperclip maximizer will act like it is "obtaining resources" if it repeatedly outputs the action it expects to lead to the most paperclips. Clippy does not need to have any separate and independent term in its utility function for the amount of resource it possesses (and indeed this would potentially interfere with Clippy making paperclips, since it might then be tempted to hold onto resources instead of making paperclips with them).
• Check: Are we arguing "Acquiring resources is a better way to make a few more paperclips than doing nothing" or "There's no better/best way to make paperclips that involves not acquiring more matter and energy"?
As mentioned above, the latter seems pretty reasonable in this case.
• Caveat: "Acquiring resources is instrumentally convergent" is not an ethical claim.
The fact that a paperclip maximizer would try to acquire all matter and energy within reach, does not of itself bear on whether our own normative values might perhaps command that we ought to use few resources as a terminal value.
(Though some of us might find pretty compelling the observation that if you leave matter lying around, it sits around not doing anything and eventually the protons decay or the expanding universe tears it apart, whereas if you turn the matter into people, it can have fun. There's no rule that instrumentally convergent strategies don't happen to be the right thing to do.)
• Caveat: "Acquiring resources is instrumentally convergent" is not of itself a futurological prediction.
See above. Maybe we try to build Task AGIs instead. Maybe we succeed, and they don't consume lots of resources because they have well-bounded tasks and impact penalties.
The list of arguably convergent strategies has its own page; however, some of the key strategies that have been argued as convergent in e.g. Omohundro's "The Basic AI Drives" and Bostrom's "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents" include:
(The reader is again invited to run through all these items , checking whether it's likely (as a probable default, not in every imaginable case) that the strategy for getting the most possible paperclips is more likely to be in than .)
This is relevant to some of the central background ideas in AGI alignment, because:
This means that programmers don't have to be evil, or even deliberately bent on creating superintelligence, in order for their work to have catastrophic consequences.
The above list of instrumental strategies includes everything an agent needs to survive and grow, which supports strong forms of the Orthogonality Thesis being true in practice as well as in principle. We don't need to filter on agents with explicit terminal values for e.g. "survival" in order to find surviving powerful agents.
Instrumental convergence is also why we expect to encounter most of the problems filed under Corrigibility. When the AI is young, it's less likely to be instrumentally efficient or understand the relevant parts of the bigger picture; but once it does, we would by default expect, e.g.:
This paints a much more effortful picture of AGI alignment work than "Oh, well, we'll just test it to see if it looks nice, and if not, we'll just shut off the electricity."
The point that some undesirable strategies are convergent gives rise to the Nearest unblocked strategy problem. Suppose the AGI's most preferred policy starts out as one of these incorrigible behaviors. Suppose we currently have enough control to add patches to the AGI's utility function intended to rule out the incorrigible behavior. Then, after integrating the intended patch, the new most preferred policy may be the most similar policy that wasn't explicitly blocked. If you naively give the AI a term in its utility function for "having an off-switch", it may still build subagents or successors that don't have off-switches.
Similarly, when the AGI becomes more powerful and its option space expands, it's again likely to find new similar policies that weren't explicitly blocked.
Thus, instrumental convergence is one of the two basic sources of patch resistance as a foreseeable difficulty of AGI alignment work.