May 15, 2012
Note: informally, the point of this paper is to argue against the instinctive "if the AI were so smart, it would figure out the right morality and everything will be fine." It is targeted mainly at philosophers, not at AI programmers. The paper succeeds if it forces proponents of that position to put forwards positive arguments, rather than just assuming it as the default position. This post is presented as an academic paper, and will hopefully be published, so any comments and advice are welcome, including stylistic ones! Also let me know if I've forgotten you in the acknowledgements.
Abstract: In his paper “The Superintelligent Will”, Nick Bostrom formalised the Orthogonality thesis: the idea that the final goals and intelligence levels of agents are independent of each other. This paper presents arguments for a (slightly narrower) version of the thesis, proceeding through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist. Then it argues that if humans are capable of building human-level artificial intelligences, we can build them with any goal. Finally it shows that the same result holds for any superintelligent agent we could directly or indirectly build. This result is relevant for arguments about the potential motivations of future agents.
The Orthogonality thesis, due to Nick Bostrom (Bostrom, 2011), states that:
It is analogous to Hume’s thesis about the independence of reason and morality (Hume, 1739), but applied more narrowly, using the normatively thinner concepts ‘intelligence’ and ‘final goals’ rather than ‘reason’ and ‘morality’.
But even ‘intelligence’, as generally used, has too many connotations. A better term would be efficiency, or instrumental rationality, or the ability to effectively solve problems given limited knowledge and resources (Wang, 2011). Nevertheless, we will be sticking with terminology such as ‘intelligent agent’, ‘artificial intelligence’ or ‘superintelligence’, as they are well established, but using them synonymously with ‘efficient agent’, artificial efficiency’ and ‘superefficient algorithm’. The relevant criteria is whether the agent can effectively achieve its goals in general situations, not whether its inner process matches up with a particular definition of what intelligence is.
Thus an artificial intelligence (AI) is an artificial algorithm, deterministic or probabilistic, implemented on some device, that demonstrates an ability to achieve goals in varied and general situations. We don’t assume that it need be a computer program, or a well laid-out algorithm with clear loops and structures – artificial neural networks or evolved genetic algorithms certainly qualify.
A human level AI is defined to be an AI that can successfully accomplish any task at least as well as an average human would (to avoid worrying about robot bodies and such-like, we may restrict the list of tasks to those accomplishable over the internet). Thus we would expect the AI to hold conversations about Paris Hilton’s sex life, to compose ironic limericks, to shop for the best deal on Halloween costumes and to debate the proper role of religion in politics, at least as well as an average human would.
A superhuman AI is similarly defined as an AI that would exceed the ability of the best human in all (or almost all) tasks. It would do the best research, write the most successful novels, run companies and motivate employees better than anyone else. In areas where there may not be clear scales (what’s the world’s best artwork?) we would expect a majority of the human population to agree the AI’s work is among the very best.
Nick Bostrom’s paper argued that the Orthogonality thesis does not depend on the Humean theory of motivation. This paper will directly present arguments in its favour. We will assume throughout that human level AIs (or at least human comparable AIs) are possible (if not, the thesis is void of useful content). We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms: this is not vital to the paper, but is useful for comparison of goals between various types of agents. We will do the same with entities such as committees of humans, institutions or corporations, if these can be considered to be acting in an agent-like way.
The Orthogonality thesis, taken literally, is false. Some motivations are mathematically incompatible with changes in intelligence (“I want to prove the Gödel statement for the being I would be if I were more intelligent”). Some goals specifically refer to the intelligence of the agent, directly (“I want to be an idiot!”) or indirectly (“I want to impress people who want me to be an idiot!”). Though we could make a case that an agent wanting to be an idiot could initially be of any intelligence level, it won’t stay there long, and it’s hard to see how an agent with that goal could have become intelligent in the first place. So we will exclude from consideration those goals that intrinsically refer to the intelligence level of the agent.
We will also exclude goals that are so complex or hard to describe that the complexity of the goal becomes crippling for the agent. If the agent’s goal takes five planets worth of material to describe, or if it takes the agent five years each time it checks its goal, it’s obvious that that agent can’t function as an intelligent being on any reasonable scale.
Further we will not try to show that intelligence and final goals can vary freely, in any dynamical sense (it could be quite hard to define this varying). Instead we will look at the thesis as talking about possible states: that there exist agents of all levels of intelligence with any given goals. Since it’s always possible to make an agent stupider or less efficient, what we are really claiming is that there exist high-intelligence agents with any given goal. Thus the restricted Orthogonality thesis that we will be discussing is:
If we were to step back for a moment and consider, in our mind’s eyes, the space of every possible algorithm, peering into their goal systems and teasing out some measure of their relative intelligences, would we expect the Orthogonality thesis to hold? Since we are not worrying about practicality or constructability, all that we would require is that for any given goal system, there exists a theoretically implementable algorithm of extremely high intelligence.
At this level of abstraction, we can consider any goal to be equivalent with maximising a utility function. It is generally not that hard to translate given goals into utilities (many deontological systems are equivalent with maximising the expected utility of a function that gives 1 if the agent always makes the correct decision and 0 otherwise), and any agent making a finite number of decisions can always be seen as maximising a certain utility function.
For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005). AIXI itself is incomputable, but there are computable variants such as AIXItl or Gödel machines (Schmidhuber, 2007) that accomplish comparable levels of efficiency. These methods work for whatever utility function is plugged into them. Thus in the extreme theoretical case, the Orthogonality thesis seems trivially true.
There is only one problem with these agents: they require incredibly large amounts of computing resources to work. Let us step down from the theoretical pinnacle and require that these agents could actually exist in our world (still not requiring that we be able or likely to build it).
An interesting thought experiment occurs here. We could imagine an AIXI-like super-agent, with all its resources, that is tasked to design and train an AI that could exist in our world, and that would accomplish the super-agent’s goals. Using its own vast intelligence, the super-agent would therefore design a constrained agent maximally effective at accomplishing those goals in our world. Then this agent would be the high-intelligence real-world agent we are looking for. It doesn’t matter that this is a thought experiment – if the super-agent can succeed in the thought experiment, then the trained AI can exist in our world.
This argument generalises to other ways of producing the AI. Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:
All of these seem extraordinarily strong claims to make! The last claims all derive from the first, and merely serve to illustrate how strong the first claim actually is. Thus until such time as someone comes up with such a G and strong arguments for why it must fulfil these conditions, we can assume the Orthogonality statement established in the theoretical case.
Of course, even if efficient agents could exist for all these goals, that doesn’t mean that we could ever build them, even if we could build AIs. In this section, we’ll look at the ground for assuming the Orthogonality thesis holds for human-level agents. Since intelligence isn’t varying much, the thesis becomes simply:
So, is this true? The arguments in this section are generally independent of each other, and can be summarised as:
The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility function, packaged neatly and separately from the intelligence module (whatever part of the machine calculates which actions maximise expected utility). Demonstrating the Orthogonality thesis is as simple as saying that the utility function can be replaced with another. However, many putative agent designs are not utility function based, such as neural networks, genetic algorithms, or humans. Nor do we have the extreme calculating ability that we had in the purely theoretic case to transform any goals into utility functions. So from now on we will consider that our agents are not expected utility maximisers with clear and separate utility functions.
It seems a reasonable assumption that if there exists a human being with particular goals, then we can construct a human-level AI with similar goals. This is immediately the case if the AI was a whole brain emulation/upload (Sandberg & Bostrom, 2008), a digital copy of a specific human mind. Even for more general agents, such as evolved agents, this remains a reasonable thesis. For a start, we know that real-world evolution has produced us, so constructing human-like agents that way is certainly possible. Human minds remain our only real model of general intelligence, and this strongly direct and informs our AI designs, which are likely to be as human-similar as we can make them. Similarly, human goals are the easiest goals for us to understand, hence the easiest to try and implement in AI. Hence it seems likely that we could implement most human goals in the first generation of human-level AIs.
So how wide is the space of human motivations? Our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles. It is evident that the space of possible human motivations is vast. For any desire, any particular goal, no matter how niche, pathological, bizarre or extreme, as long as there is a single human who ever had it, we could build and run an AI with the same goal.
But with AIs we can go even further. We could take any of these goals as a starting point, make them malleable (as goals are in humans), and push them further out. We could provide the AIs with specific reinforcements to push their goals in extreme directions (reward the saint for ever-more saintly behaviour). If the agents are fast enough, we could run whole societies of them with huge varieties of evolutionary or social pressures, to further explore the goal-space.
We may also be able to do surgery directly on their goals, to introduce more yet variety. For example, we could take a dedicated utilitarian charity worker obsessed with saving lives in poorer countries (but who doesn’t interact, or want to interact, directly with those saved), and replace ‘saving lives’ with ‘maximising paperclips’ or any similar abstract goal. This is more speculative, of course – but there are other ways of getting similar results.
If someone were to hold a gun to your head, they could make you do almost anything. Certainly there are people who, with a gun at their head, would be willing to do almost anything. A distinction is generally made between interim goals and terminal goals, with the former being seen as simply paths to the latter, and interchangeable with other plausible paths. The gun to your head disrupts the balance: your terminal goal is simply not to get shot, while your interim goals become what the gun holder wants them to be, and you put a great amount of effort into accomplishing the minute details of these interim goals. Note that the gun has not changed your level of intelligence or ability.
This is relevant because interim goals seem to be far more varied in humans than terminal goals. One can have interim goals of filling papers, solving equations, walking dogs, making money, pushing buttons in various sequences, opening doors, enhancing shareholder value, assembling cars, bombing villages or putting sharks into tanks. Or simply doing whatever the guy with gun at our head orders us to do. If we could accept human interim goals as AI terminal goals, we would extend the space of goals quite dramatically.
To do we would want to put the threatened agent, and the gun wielder, together into the same AI. Algorithmically there is nothing extraordinary about this: certain subroutines have certain behaviours depending on the outputs of other subroutines. The ‘gun wielder’ need not be particularly intelligent: it simply needs to be able to establish whether its goals are being met. If for instance those goals are given by a utility function then all that is required in an automated system that measure progress toward increasing utility and punishes (or erases) the rest of the AI if not. The ‘rest of AI’ is just required to be a human-level AI which would be susceptible to this kind of pressure. Note that we do not require that it even be close to human in any way, simply that it place a highest value on self-preservation (or on some similar small goal that the ‘gun wielder’ would have power over).
For humans, another similar model is that of a job in a corporation or bureaucracy: in order to achieve the money required for their terminal goals, some human are willing to perform extreme tasks (organising the logistics of genocides, weapon design, writing long detailed press releases they don’t agree with at all). Again, if the corporation-employee relationship can be captured in a single algorithm, this would generate an intelligent AI whose goal is anything measurable by the ‘corporation’. The ‘money’ could simply be an internal reward channel, perfectly aligning the incentives.
If the subagent is anything like a human, they would quickly integrate the other goals into their own motivation, removing the need for the gun wielder/corporation part of the algorithm.
There are further ways of extending the space of goals we could implement in human-level AIs. One simple way is simply to introduce noise: flip a few bits and subroutines, add bugs and get a new agent. Of course, this is likely to cause the agent’s intelligence to decrease somewhat, but we have generated new goals. Then, if appropriate, we could use evolution or other improvements to raise the agent’s intelligence again; this will likely undo some, but not all of effect of the noise. Or we could use some of the tricks above to make a smarter agent implement the goals of the noise-modified agent.
A more extreme example would be to create an anti-agent: an agent whose single goal is to stymie the plans and goals of single given agent. This already happens with vengeful humans, and we would just need to dial it up: have an anti-agent that would do all it can to counter the goals of a given agent, even if that agent doesn’t exist (“I don’t care that you’re dead, I’m still going to despoil your country, because that’s what you’d wanted me to not do”). This further extends the space of possible goals.
Different agents with different goals can also be combined into a single algorithm. With some algorithmic method for the AIs to negotiate their combined objective and balance the relative importance of their goals, this procedure would construct a single AI with a combined goal system. There would likely be no drop in intelligence/efficiency: committees of two can work very well towards their common goals, especially if there is some automatic penalty for disagreements.
This section started by emphasising the wide space of human goals, and then introduced tricks to push goal systems further beyond these boundaries. The list isn’t exhaustive: there are surely more devices and ideas one can use to continue to extend the space of possible goals for human-level AIs. Though this might not be enough to get every goal, we can nearly certainly use these procedures to construct a human-level AI with any human-comprehensible goal. But would the same be true for superhuman AIs?
We now come to the area where the Orthogonality thesis seems the most vulnerable. It is one thing to have human-level AIs, or abstract superintelligent algorithms created ex nihilo, with certain goals. But if ever the human race were to design a superintelligent AI, there would be some sort of process involved – directed evolution, recursive self-improvement (Yudkowsky, 2001), design by a committee of AIs, or similar – and it seems at least possible that such a process could fail to fully explore the goal-space. If we define the Orthogonality thesis in this context as:
There are two counter-theses. The weakest claim is:
A stronger claim is:
They should be distinguished; Incompleteness is all that is needed to contradict Orthogonality, but Convergence is often the issue being discussed. Often convergence is assumed to be to some particular model of metaethics (Müller, 2012).
The plausibility of the convergence thesis is highly connected with the connotations of the terms used in it. “All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient algorithms would accomplish the same task” seems ridiculous. To quote an online commentator, how good at playing chess would a chess computer have to be before it started feeding the hungry?
Similarly, if there were such a convergence, then all self-improving or constructed superintelligence must fall prey to it, even if it were actively seeking to avoid it. After all, the lower-level AIs or the AI designers have certain goals in mind (as we’ve seen in the previous section, potentially any goals in mind). Obviously, they would be less likely to achieve their goals if these goals were to change (Omohundro, 2008) (Bostrom, 2012). The same goes if the superintelligent AI they designed didn’t share these goals. Hence the AI designers will be actively trying to prevent such a convergence, if they suspected that one was likely to happen. If for instance their goals were immoral, they would program their AI not to care about morality; they would use every trick up their sleeves to prevent the AI’s goals from drifting from their own.
So the convergence thesis requires that for the vast majority of goals G:
This makes the convergence thesis very unlikely. The argument also works against the incompleteness thesis, but in a weaker fashion: it seems more plausible that some goals would be unreachable, despite being theoretically possible, rather than most goals being unreachable because of convergence to a small set.
There is another interesting aspect of the convergence thesis: it postulates that certain goals G will emerge, without them being aimed for or desired. If one accepts that goals aimed for will not be reached, one has to ask why convergence is assumed: why not divergence? Why not assume that though G is aimed for, random accidents or faulty implementation will lead to the AI ending up with one of a much wider array of possible goals, rather than a much narrower one.
If the Orthogonality thesis is wrong, then it implies that Oracles are impossible to build. An Oracle is a superintelligent AI that accurately answers questions about the world (Armstrong, Sandberg, & Bostrom, 2011). This includes hypothetical questions about the future, which means that we can produce a superintelligent AI with goal G by wiring a human-level AI with goal G to an Oracle: the human level AI will go through possible actions, have the Oracle check the outcomes, and choose the one that best accomplishes G.
What makes the “no Oracle” position even more counterintuitive is that any superintelligence must be able to look ahead, design actions, predict the consequences of its actions, and choose the best one available. But the convergence thesis implies that this general skill is one that we can make available only to AIs with certain specific goals. Though the agents with those narrow goals are capable of doing these predictions, they automatically lose this ability if their goals were to change.
Just as with human-level AIs, one could construct a superintelligent AI by wedding together a superintelligence with a large dedicated committee of human-level AIs dedicated to implementing a goal G, and checking the superintelligence’s actions. Thus to deny the Orthogonality thesis requires that one believes that the superintelligence is always capable of tricking this committee, no matter how detailed and thorough their oversight.
This argument extends the Orthogonality thesis to moderately superintelligent AIs, or to any situation where there was a diminishing return to intelligence. It only fails if we take AI to be fantastically superhuman: capable of tricking or seducing any collection of human-level beings.
These are other tricks that can be used to create an AI with any goals. For any superintelligent AI, there are certain inputs that will make it behave in certain ways. For instance, a human-loving moral AI could be compelled to follow most goals G for a day, if they were rewarded with something sufficiently positive afterwards. But its actions for that one day are the result of a series of inputs to a particular algorithm; if we turned off the AI after that day, we would have accomplished moves towards goal G without having to reward its “true” goals at all. And then we could continue the trick the next day with another copy.
For this to fail, it has to be the case that we can create an algorithm which will perform certain actions on certain inputs as long as it isn’t turned off afterwards, but that we cannot create an algorithm that does the same thing if it was to be turned off.
Another alternative is to create a superintelligent AI that has goals in a fictional world (such as a game or a reward channel) over which we have control. Then we could trade interventions in the fictional world against advice in the real world towards whichever goals we desire.
These two arguments may feel weaker than the ones before: they are tricks that may or may not work, depending on the details of the AI’s setup. But to deny the Orthogonality thesis requires not only denying that these tricks would ever work, but denying that any tricks or methods that we (or any human-level AIs) could think up, would ever work at controlling the AIs. We need to assume superintelligent AIs cannot be controlled.
Denying the Orthogonality thesis thus requires that:
All the previous sections concern hypotheticals, but of a different kind. Section 2 touches upon what kinds of algorithm could theoretically exist. But sections 3 and 4 concern algorithms that could be constructed by humans (or from AIs originally constructed by humans): they refer to the future. As AI research advances, and certain approaches or groups start to show or lose prominence, we’ll start getting a better idea of how such an AI will emerge.
Thus the orthogonality thesis will narrow as we achieve better understanding of how AIs would work in practice, of what tasks they will be put to and of what requirements their designers will desire. Most importantly of all, we will get more information on the critical question as to whether the designers will actually be able to implement their desired goals in an AI. On the eve of creating the first AIs (and then the first superintelligent AIs), the Orthogonality thesis will likely have pretty much collapsed: yes, we could in theory construct an AI with any goal, but at that point, the most likely outcome is an AI with particular goals – either the goals desired by their designers, or specific undesired goals and error modes.
However, until that time arises, because we do not know any of this information currently, we remain in the grip of a Bayesian version of the Orthogonality thesis:
It is not enough to know that an agent is intelligent (or superintelligent). If we want to know something about its final goals, about the actions it will be willing to undertake to achieve them, and hence its ultimate impact on the world, there are no shortcuts. We have to directly figure out what these goals are, and cannot rely on the agent being moral just because it is superintelligent/superefficient.
It gives me great pleasure to acknowledge the help and support of Anders Sandberg, Nick Bostrom, Toby Ord, Owain Evans, Daniel Dewey, Eliezer Yudkowsky, Vladimir Slepnev, Viliam Bur, Matt Freeman, Wei Dai, Will Newsome, Paul Crowley, Alexander Kruel and Rasmus Eide, as well as those members of the Less Wrong online community going by the names shminux, Larks and Dmytry.
Armstrong, S., Sandberg, A., & Bostrom, N. (2011). Thinking Inside the Box: Controlling and Using an Oracle AI. forthcoming in Minds and Machines .
Bostrom, N. (2012). Superintelligence: Groundwork to a Strategic Analysis of the Machine Intelligence Revolution. to be published.
Bostrom, N. (2011). The Superintelligent Will: Motivation and Instrumental Rationality in Advance Artificial Agents. forthcoming in Minds and Machines .
de Fabrique, N., Romano, S. J., Vecchi, G. M., & van Hasselt, V. B. (2007). Understanding Stockholm Syndrome. FBI Law Enforcement Bulletin (Law Enforcement Communication Unit) , 76 (7), 10-15.
Hume, D. (1739). A Treatise of Human Nature.
Hutter, M. (2005). Universal algorithmic intelligence: A mathematical top-down approach. In B. Goertzel, & C. Pennachin (Eds.), Artiﬁcial General Intelligence. Springer-Verlag.
Müller, J. (2012). Ethics, risks and opportunities of superintelligences. Retrieved May 2012, from http://www.jonatasmuller.com/superintelligences.pdf
Omohundro, S. M. (2008). The Basic AI Drives. In P. Wang, B. Goertzel, & S. Franklin (Eds.), Artificial General Intelligence: Proceedings of the First AGI Conference (Vol. 171).
Sandberg, A., & Bostrom, N. (2008). Whole brain emulation: A roadmap. Future of Humanity Institute Technical report , 2008-3.
Schmidhuber, J. (2007). Gödel machines: Fully self-referential optimal universal self-improvers. In Artificial General Intelligence. Springer.
Wang, P. (2011). The assumptions on knowledge and resources in models of rationality. International Journal of Machine Consciousness , 3 (1), 193-218.
Yudkowsky, E. (2001). General Intelligence and Seed AI 2.3. Retrieved from Singularity Institute for Artificial Intelligence: http://singinst.org/ourresearch/publications/GISAI/
 We need to assume it has goals, of course. Determining whether something qualifies as a goal-based agent is very tricky (researcher Owain Evans is trying to establish a rigorous definition), but this paper will adopt the somewhat informal definition that an agent has goals if it achieves similar outcomes from very different starting positions. If the agent ends up making ice cream in any circumstances, we can assume ice creams are in its goals.
 Every law of nature being algorithmic (with some probabilistic process of known odds), and no exceptions to these laws being known.
 One could argue that we should consider the space of general animal intelligences – octopuses, supercolonies of social insects, etc... But we won’t pursue this here; the methods described can already get behaviours like this.
 Even today, many people have had great fun torturing and abusing their characters in games like “the Sims” (http://meodia.com/article/281/sadistic-ways-people-torture-their-sims/). The same urges are present, albeit diverted to fictionalised settings. Indeed games offer a wide variety of different goals that could conceivably be imported into an AI if it were possible to erase the reality/fiction distinction in its motivation.
 As can be shown by a glance through a biography of famous people – and famous means they were generally allowed to rise to prominence in their own society, so the space of possible motivations was already cut down.
 Of course, if we built an AI with that goal and copied it millions of times, it would no longer be niche.
 Such as the hostages suffering from Stockholm syndrome (de Fabrique, Romano, Vecchi, & van Hasselt, 2007).