This post is cross-posted from our Substack. Kindly read the description of this sequence to understand the context in which this was written.
In 2012, Nick Bostrom published a paper titled, "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents." Much of what is to come is an attempt to convey Bostrom's writing to a broader audience. Because the paper is beautifully written, I strongly recommend reading it. It is, however, somewhat technical in nature, which is why I attempt here to make its ideas more accessible. Still, any sentences which you find illuminating are likely rephrased, if not directly taken, from the original. Let's not discuss the less illuminating sentences.
Bostrom presents two theses regarding the relation between intelligence and motivation in artificial agents. Because of their incredible importance in AI Alignment—the problem of getting AI to do stuff we want it to do, even as it becomes more intelligent than us—it's vital to understand them well! One might even say the paper was quite instrumental (wink) in furthering the discourse on AI alignment. Hopefully when you're finished reading this you'll have, at the very least, an intuitive understanding of the two theses.
The goals we have and the things we do are often weird... Not weird from our perspective, of course, because to us it is quite literally the norm, but still. Say some strange aliens visited our lovely blue planet and you had to explain to them why hundreds of millions of people watch 22 others run after a ball for 90 minutes. That would be surprisingly hard to convey to the strange aliens without ending up saying something along the lines of, "our brains have evolved in such a way that we like playing sports because it is a social game and allows us to bond which is good for social cohesion, etc. etc."
Although we can think about it when it comes up—at least if we know a thing or two about evolutionary psychology—we often underestimate the cognitive complexity of the capabilities, or more importantly, the motivations of intelligent systems, particularly when there is really no ground for expecting human-like drives and passions. Eliezer Yudkowsky gives a nice illustration of this phenomenon:
"Back in the era of pulp science fiction, magazine covers occasionally depicted a sentient monstrous alien—colloquially known as a bug-eyed monster (BEM)—carrying off an attractive human female in a torn dress. It would seem the artist believed that a non-humanoid alien, with a wholly different evolutionary history, would sexually desire human females … Probably the artist did not ask whether a giant bug perceives human females as attractive. Rather, a human female in a torn dress is sexy—inherently so, as an intrinsic property. They who made this mistake did not think about the insectoid’s mind: they focused on the woman’s torn dress. If the dress were not torn, the woman would be less sexy; the BEM does not enter into it." Yudkowsky (2008)
The extraterrestrial is a biological creature which has arisen through a process of evolution and therefore is expected to have similar types of motivation to humans. For example, it would not be surprising to find the strange aliens to have motives related to attaining food and air, or avoiding energy expenditure. It might even have motivations related to cooperation and competition because it is a member of an intelligent social species. The strange aliens might show in-group loyalty, a resentment of free-riders, perhaps even a concern with reputation and appearance. Still, the strange aliens will be strange! Very strange indeed, and likely will have sufficiently different goals and motives to us that, if they were as intelligent or more intelligent than us, cooperation wouldn't be a given.
By contrast, an artificially intelligent mind need not care about any of the above motives, not even to the slightest degree. An artificially intelligent mind whose fundamental goal is counting the grains of sand on Terschelling (a beautiful small island in the Netherlands) is easily conceivable. So is one who has to write blogs on Bostrom's "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents" paper, indefinitely. Or one, which is often used as an easy example, whose goal it is to maximize the total number of paperclips. In fact, it would be easier to create an AI with simple goals like these, than to build one that has a human-like set of values and dispositions!
Intelligence and motivation can thus be thought of as independent phenomena. The (slightly rephrased) orthogonality thesis states:
Intelligence and final goals are perpendicular axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.
Take a moment to think about whether the above feels intuitive to you. People vary in how easily they feel the orthogonality thesis to be true. For example, if you've held the belief that as intelligence increases, kindness or morality also increases, you might have more trouble accepting the above thesis. Although I believe this to not be true, I invite you to think about why this does not matter for the orthogonality thesis. (Or, you might simply disagree with the orthogonality thesis, in which case I kindly invite you to engage with this post.)
So, we've established—or at least, I have—that any level of intelligence in an agent can in principle be combined with more or less any final goal. What now?
Now comes a question that follows naturally from the orthogonality thesis: How will the artificially intelligent agents achieve the goals inside the enormous range of possible final goals?
To formalize this, Bostrom presents the instrumental convergence thesis. Instead of spoiling the fun by simply stating how we can roughly predict the ways in which agents are likely to achieve their goals, allow me to present you, the reader, the three different goals I had laid out earlier:
Now, I challenge you to think about how you would go about achieving these goals as optimally as possible. Think outside the box. Don't be limited by normal conventions. Be truly creative and try to think about what you would do if the above goals were the only goals that would make sense to pursue. What would you do? Think about it for a minute, really.
I'll present some rough ideas of what one could do. In these three cases, it makes sense to get more people on board. To get these people to do what you want you must have either power or money, and ideally both. With lots of money it would be easier to automate the process. With lots of power, it would be easier to convince others that my goal of maximizing paperclips is indeed the fundamental goal we should focus on. It would also be crucial to ensure that there is no way I will be deviated from my final goal. One might even say that the single most important thing I can do in achieving one of these goals is ensuring it remains my single terminal goal.
And indeed, although these three goals are quite arbitrary examples, when you start to think about other possible goals that agents might have, you will likely find that there are shared goals which you want to attain first that helps in all cases; the so-called instrumental goals. Some of the most prominent instrumental goals would be attaining power or resources in general, self-preservation (it becomes slightly hard to achieve one's goal without existing), and goal integrity (the final goal should not be changed, because then you can't achieve it anymore). For the ones who are interested in the (slightly rephrased) way in which Bostrom presents the instrumental convergence thesis:
Several instrumental goals can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental goals are likely to be pursued by many intelligent agents.
It is important to realize the consequences of this thesis. Seeking power is not something that might possibly happen when developing artificially intelligent agents. No, it is something that makes sense to happen, is the default to happen. Neither is an agent who is interested in self-preservation, some kind of anomaly which must have been reading too many books on evolution. It, too, is the default to happen (although some people disagree).
Intelligent agents will develop instrumental goals, such as seeking power, preserving themselves, and maintaining goal integrity, not through some error in development, but as the default. We must, therefore, be extremely certain that we develop an artificially intelligent agent in which this does not happen. (Or find a way in which we get these instrumental goals to be precisely what we want them to be.) Importantly, this is still an unsolved question! Don't worry though, surely CEO's such as Sam Altman who are acutely aware of the unsolved question of how to ensure agents develop instrumental goals like seeking power would not try to develop increasingly intelligent systems? Right? Right?
Let's summarize. I've presented two of the theses Bostrom lays out: the orthogonality thesis and the instrumental convergence thesis. The former implies that intelligence in agents can in principle be combined with any kind of final goal. The latter implies that pursuing any of these final goals will, by default, result in the agent forming instrumental goals such as seeking power, self-preservation, and goal integrity.
This might sound like quite a challenge when it comes to creating artificially intelligent agents and ensuring they remain aligned with what we want them to do. And indeed, many AI alignment researchers believe this to be fundamental (unsolved) challenges. Whether you find that prospect sobering or motivating, I hope this has at least given you an intuitive sense of why Bostrom's thesis still matters, and why the conversation is far from over.