Instrumental convergence is the idea that every sufficiently intelligent agent would exhibit behaviors such as self preservation or acquiring resources. This is a natural result for maximizers of simple utility functions. However I claim that it is based on a false premise that agents must have goals.

What are goals?

For agents constructed as utility maximizers, the goal coincides with utility maximization. There is no doubt that for most utility functions, a utility maximizer would exhibit instrumental convergence. However, I claim that most minds in the mindspace are not utility maximizers in the usual sense.

It may be true that for every agent there is a utility function that it maximizes, in the spirit of VNM utility. However these utility functions do not coincide with goals in the sense that instrumental convergence requires. These functions are merely encodings of the agent's decision algorithm and are no less complex than the agent itself. No simple conclusions can be made from their existence.

Humans exhibit goals in the usual sense and arguably have VNM utility functions. Human goals often involve maximizing some quantity, e.g. money. However explicit human goals represent only a small fraction of their total utility computation. Presumably, the goals explain the extent to which some humans exhibit instrumental convergence, and the rest of the utility function explains why we haven't yet tiled the universe with money.

What about non-human non-utility-maximizer agents? Certainly, some of them can still exhibit instrumental convergence, but I will argue that this is rare.

What is the average mind in the mindspace?

The "mindspace" refers to some hypothetical set of functions or algorithms, possibly selected to meet some arbitrary definition of intelligence. I claim that most minds in the mindspace do not share any properties that we have not selected for. In fact, most minds in the mindspace do nothing even remotely useful. Even if we explicitly selected a random mind from a useful subset of the mindspace, the mind would most likely do nothing more than the bare minimum we required. For example, if we search for minds that are able to run a paperclip factory, we will find minds that run paperclip factories well enough to pass our test, but not any better. Intelligence is defined by the ability to solve problems, not by the degree of agency.

Without a doubt, somewhere in the mindspace there is a mind that will run the paperclip factory, acquire resources, and eventually tile the universe with paperclips, however it is not the only mind out there. There is also a mind that runs the paperclip factory and then when it has nothing better to do, shuts down, sits in an empty loop, dreams of butter, or generates bad harry potter fanfic.

With this in mind, random searches in the mindspace are relatively safe, even if the minds we find aren't well aligned. Though it would be lovely to be certain that a new mind is not the "tile the universe with paperclips" kind.

Caveats and half-baked ideas

  • This post comes from trying to understand why I don't find the threat of AI as inevitable as some suggest. In other words, it's a rationalization.
  • I have a limited understanding of what MIRI does and what assumptions it has. I'm under the impression that they are primarily working on utility maximizers, and that instrumental convergence is important to them. But it's likely that points similar to mine have been made and either accepted or countered.
  • In the post I make empirical claims about the composition of the mindspace, I have obviously not verified them, if they are even verifiable. The claims seem trivial to me, but may well not be that strong.
  • While simple utility maximizing minds are not common, it's possible that they are the smallest minds that can pass our tests, or that they have other special properties that would make semi-random searches find them more often than we'd like.

New Comment
31 comments, sorted by Click to highlight new comments since:

I think there's an important thing to note, if it doesn't already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.

Here's my attempt at a poor intuitionistic proof:

If you have some kind of program that understands consequences or backchains or etc, then perhaps it's capable of recognizing that "acquire lots of power" will then let it choose from a much larger set of possibilities. Regardless of the details of how "decisions" are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I'm worried about "instrumental convergence".


At this point, I'm already much more worried about instrumental convergence, because backchaining feels damn useful. It's the sort of thing I'd expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like "utility function over here" and "maximizer over there".

(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can't literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)


Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.

Agreed. I guess instrumental convergence mostly applies to AIs that we have to worry about, not all possible minds.

Understanding consequences of actions is a reasonable requirement for a machine to be called intelligent, however that implies nothing about the behavior of this machine. A paperclip maker may understand that destroying the earth could yield paperclips, it may not care much for humans, and it may still not do it. There is nothing inconsistent or unlikely about such machine. You're thinking of machines that have a pipeline: pick a goal -> find the optimal plan to reach the goal -> implement the plan. However this is not the only possible architecture (though it is appealing).

More generally, intelligence is the ability to solve hard problems. If a program solves a problem without explicitly making predictions about the future, that's fine. Though I don't want to make claims about how common such programs would be. And there is also a problem of recognizing whether a program not written by a human does explicitly consider cause and effect.

Hm, I think an important piece of "intuitionistic proof" didn't transfer, or is broken. Drawing attention to that part:

Regardless of the details of how "decisions" are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power.

So here, I realize, I am relying on something like "the AI implicitly moves toward an imagined realizable future". I think that's a lot easier to get than the pipeline you sketch.

I think I'm being pretty unclear - I'm having trouble conveying my thought structure here. I'll go make a meta-level comment instead.

As I understand, your argument is that there are many dangerous world-states and few safe world-states, therefore most powerful agents would move to a dangerous state, in the spirit of entropy. This seems reasonable.

An alarming version of this argument assumes that the agents already have power, however I think that they don't and that acquiring dangerous amounts of power is hard work and won't happen by accident.

A milder version of the same argument says that even relatively powerless, unaligned agents would slowly and unknowingly inch towards a more dangerous future world-state. This is probably true, however, if humans retain some control, this is probably harmless. And it is also debatable to what extent that sort of probabilistic argument can work on a complex machine.

Though I don't want to make claims about how common such programs would be.

If you don't want to make claims about how common such programs are, how do you defend the (implicit) assertion that such programs are worth talking about, especially in the context of the alignment problem?

I don't want to make claims about how many random programs make explicit predictions about the future to reach their goals. For all I know it could be 1% and it could be 99%. However, I do make claims about how common other kinds of programs are. I claim that a given random program, regardless of whether it explicitly predicts the future, is unlikely to have the kind of motivational structure that would exhibit instrumental convergence.


Incidentally, I'm also interested in what specifically you mean by "random program". A natural interpretation is that you're talking about a program that is drawn from some kind of distribution across the set of all possible programs, but as far as I can tell, you haven't actually defined said distribution. Without a specific distribution to talk about, any claim about how likely a "random program" is to do anything is largely meaningless, since for any such claim, you can construct a distribution that makes that claim true.

(Note: The above paragraph was originally a parenthetical note on my other reply, but I decided to expand it into its own, separate comment, since in my experience having multiple unrelated discussions in a single comment chain often leads to unproductive conversation.)

Well, good question. Frankly I don't think it matters. I don't believe that my claims are sensitive to the distributions (aside from some convoluted ones), or that giving you a specific distribution would help you to defend either position (feel free to prove me wrong). But when I want to feel rigorous, I assume that I'm starting off with a natural length-based distribution over all Turing machines (or maybe all neural networks), then discard all machines that fail to pass some relatively simple criteria about the output they generate (e.g. does it classify a given set of cat pictures correctly), keep the ones that passed, normalize and draw from that.

But really, by "random" I mean nearly anything that's not entirely intentional. To use a metaphor for machine learning, if you pick a random point in the world map, then find the nearest point that's 2km above sea level, you'll find a "random" point that's 2km above sea level. The algorithm has a non-random step, but the outcome is clearly random in a significant way. The distribution you get is different from the one I described in my previous paragraph (where you just filtered the initial point distribution to get the points at 2km), but they'll most likely be close.

Maybe that answers your other comment too?

I claim that a given random program, regardless of whether it explicitly predicts the future, is unlikely to have the kind of motivational structure that would exhibit instrumental convergence.

Yes, I understand that. What I'm more interested in knowing, however, is how this statement connects to AI alignment in your view, since any AI created in the real world will certainly not be "random".

The important thing to consider here is that humans want to use AI to achieve things, and that the existing architectures that we are using for AIs use utility functions, or reward functions, or generally things that seem to be affected by instrumental convergence. A randomly picked program won’t be subject to instrumental convergence, but a program picked by humans to achieve something has a significantly higher chance of doing so.

There is a difference between an AI that does X, and an AI that has a goal to do X. Not sure what architectures you're referring to, but I suspect you may be conflating the goals of the AI's constructor (or construction process) with the goals of the AI.

It's almost certainly true that a random program from the set of programs that do X, is more dangerous than a random program from some larger set, however I claim that the fraction of dangerous programs is still pretty small.

Now, there are some obviously dangerous sets of programs, and it's possible that humans will pick an AI form such set. In other news, if you shoot yourself in the foot, you get hurt.

Not sure what architectures you're referring to

I think habryka is thinking about modern machine learning architectures that are studied by AI researchers. AI research is in fact a distinct subfield from programming-in-general, because AI programs are in fact a distinct subgroup from programs-in-general.

I'm very much aware what architectures machine learning studies, and indeed it (usually) isn't generic programs in the sense of raw instruction lists (although, any other Turing complete set of programs can be perfectly well called "programs-in-general" - instruction lists are in no way unique).

The problem is that everyone's favorite architecture - the plain neural network - does not contain a utility function. It is built using a utility/cost function, but that's very different.

This doesn't make the AI any safer, given the whole (neural network + neural network builder¹) is still a misaligned AI. Real life examples happen all the time.

¹: I don't know if there is a standard term for this.

I never said anything about (mis)alignment. Of course using stupid training rewards will result in stupid programs. But the problem with those programs is not that they are instrumentally convergent (see the title of my post).

The training program, which has the utility function, does exhibit convergence, but the resulting agent, which has no utilities, does not usually exhibit it. E.g. if training environment involves a scenario where the agent is turned off (which results in 0 utility), then the training program will certainly build an agent that resists being turned off. But if the training environment does not include that scenario, then the resulting agent is unlikely to resist it.

The problem is that "the training environment does not include that scenario" is far from guaranteed.

Yes, but what are you arguing against? At no point did I claim that it is impossible for the training program to build a convergent agent (indeed, if you search for an agent that exhibits instrumental convergence, then you might find one). Nor did I ever claim that all agents are safe - I only meant that they are safer than the hypothesis of instrumental convergence would imply.

Also, you worry about the agent learning convergent behaviors by accident, but that's a little silly when you know that the agent often fails to learn what you want it to learn. E.g. if you do explicitly include the scenario of the agent being turned off, and you want the agent to resist it, you know it's likely that the agent will, e.g. overfit and will resist only in that single scenario. But then, when you don't intentionally include any such scenario, and don't want the agent to resist, it seems likely to you that the agent will correctly learn to resist anyway? Yes, it's strictly possible that you will unintentionally train an agent that robustly resists being turned off. But the odds are in our favour.

Let's go back to the OP. I'm claiming that not all intelligent agents would exhibit instrumental convergence, and that, in fact, the majority wouldn't. What part of that exactly do you disagree with? Maybe we actually agree?

Let's go a little meta.

It seems clear that an agent that "maximizes utility" exhibits instrumental convergence. I think we can state a stronger claim: any agent that "plans to reach imagined futures", with some implicit "preferences over futures", exhibits instrumental convergence.

The question then is how much can you weaken the constraint "looks like a utility maximizer", before instrumental convergence breaks? Where is the point in between "formless program" and "selects preferred imagined futures" at which instrumental convergence starts/stops applying?


This moves in the direction of working out exactly which components of utility-maximizing behaviour are necessary. (Personally, I think you might only need to assume "backchaining".)

So, I'm curious: What do you think a minimal set of necessary pieces might be, before a program is close enough to "goal directed" for instrumental convergence to apply?

This might be a difficult question to answer, but it's probably a good way to understand why instrumental convergence feels so real to other people.

any agent that "plans to reach imagined futures", with some implicit "preferences over futures", exhibits instrumental convergence.

Actually, I don't think this is true. For example take a chess playing program which imagines winning, and searches for strategies to reach that goal. Instrumental convergence would assume that the program would resist being turned off, try to get more computational resources, or try to drug/hack the opponent to make them weaker. However, the planning process could easily be restricted to chess moves, where none of these would be found, and thus would not exhibit instrumental convergence. This sort of "safety measure" isn't very reliable, especially when we're dealing with the real world rather than a game. However it is possible for an agent to be a utility maximizer, or to have some utility maximizing subroutines, and still not exhibit instrumental convergence.

There is another, more philosophical question of what is and isn't a preference over futures. I believe that there can be a brilliant chess player that does not actually prefer winning to losing. But the relevant terms are vague enough that it's a pain to think about them.

I think proponents of the instrumental convergence thesis would expect a consequentialist chess program to exhibit instrumental convergence in the domain of chess. So if there were some (chess-related) subplan that was useful in lots of other (chess-related) plans, we would see the program execute that subplan a lot. The important difference would be that the chess program uses an ontology of chess while unsafe programs use an ontology of nature.

First, Nick Bostrom has an example where a Riemann hypothesis solving machine converts the earth into computronium. I imagine he'd predict the same for a chess program, regardless of what ontology it uses.

Second, if instrumental convergence was that easy to solve (the convergence in the domain of chess is harmless), it wouldn't really be an interesting problem.

I think you have a good point, in that the VNM utility theorem is often overused/abused: I don't think it's clear how to frame a potentially self modifying agent in reality as a preference ordering on lotteries, and even if you could in theory do so it might require such a granular set of outcomes as to make the resulting utility function not super interesting. (I'd very much appreciate arguments for taking VNM more seriously in this context; I've been pretty frustrated about this.)

That said, I think instrumental convergence is indeed a problem for real world searches; the things we're classifying as "instrumentally convergent goals" are just "things that are helpful for a large class of problems." It turns out there are ways to do better than random search in general, and that some these ways (the most general ones) are making use of the things we're calling "instrumentally convergent goals": AlphaGo Zero was not a (uniformish) random search on Go programs, and humans were not a (uniformish) random search on creatures. So I don't think this particular line of thought should make you think potential AI is less of a problem.

AlphaGo Zero was not a (uniformish) random search on Go programs, and humans were not a (uniformish) random search on creatures.

I'd classify both of those as random programs though. AlphaZero is a random program from the set of programs that are good at playing go (and that satisfy some structure set by the creators). Humans are random machines from the set of machines that are good at not dying. The searches aren't uniform, of course, but they are not intentional enough for it to matter.

In particular, AlphaZero was not selected in such way that exhibiting instrumental convergence would benefit it, and therefore it most likely does not exhibit instrumental convergence. Suppose there was a random modification to AlphaZero that would make it try to get more computational resources, and that this modification was actually made during training. The modified version would play against the original, the modification would not actually help it win in the simulated environment, the modified version would most likely lose and be discarded. If the modified version did end up winning, then it was purely by chance.

The case of humans is more complicated, since the "training" does reward self preservation. Curiously, this self preservation seems to be it's own goal, and not a subgoal of some other desire, as instrumental convergence would predict. Also, human self preservation only works in a narrow sense. You run from a tiger, but you don't always appreciate long term and low probability threats, presumably because you were not selected to appreciate them. I suspect that concern for these non-urgent threats does not correlate strongly with IQ, unlike what instrumental convergence would predict.

I was definitely very confused when writing the part you quoted. I think the underlying thought was that the processes of writing humans and of writing AlphaZero are very non-random; i.e., even if there's a random number generated in some sense somewhere as part of the process, there's other things going on that are highly constraining the search space -- and those processes are making use of "instrumental convergence" (stored resources, intelligence, putting the hard drives in safe locations.) Then I can understand your claim as "instrumental convergence may occur in guiding the search for/construction of an agent, but there's no reason to believe that agent will then do instrumentally convergent things." I think that's not true in general, but it would take more words to defend.

Human's have the desire to stay alive because evolutions selects for beings who want to stay alive. If you have multiple AGI's out there you will have similar selection pressure.

I think the crux of this post has been correctly argued against by other people in this comment section, but I want to address a small part of your post:

Presumably, the goals explain the extent to which some humans exhibit instrumental convergence, and the rest of the utility function explains why we haven't yet tiled the universe with money.

No, the reason we haven't yet tiled the universe with money is that we don't desire money for its own sake, only becaue it is an instrumental value that helps us achieve our goals. If we could tile the universe, we would tile the universe with happy humans. The problem is that we can't tile the universe.

If we could tile the universe, we would tile the universe with happy humans.

What, you would? I sure as hell wouldn't. The "money" example wasn't meant to be entirely serious, although I suspect that many people see money as an intrinsic reward - they collect a bunch and then don't really know what to do with it.

To clarify, I am firmly anti-wireheading.

I think this post is basically correct. You don't, however, give an argument that most minds would behave this way. However, here is a brief intuitive argument for it. A "utility function" does not mean something that is maximized in the ordinary sense of maximize; it just means "what the thing does in all situations." Look at computers: what do they do? In most situations, they sit there and compute things, and do not attempt to do anything in particular in the world. If you scale up their intelligence, that will not necessarily change their utility function much. In other words, it will lead to computers that mostly sit there and compute, without trying to do much in the world. That is to say, AIs will be weakly motivated. Most humans are weakly motivated, and most of the strength of their motivation does not come from intelligence, but from the desires that came from evolution. Since AIs will not have that evolution, they will be even more weakly motivated than humans, assuming a random design.

That may be a useful argument, but I'd be wary of using your intuitions about the kinds of programs that are running on your computer to make conclusions about the kinds of programs you'd find with a random search.

You're right that I'm missing some arguments. I struggled to come up with anything even remotely rigorous. I'm making a claim about a complex aspect of the behavior of a random program, after all. Rice's theorem seems somewhat relevant, though I wouldn't have an argument even if we were talking about primitive recursive functions either. But at the same time the claim seems trivial.