AI prediction case study 5: Omohundro's AI drives

by Stuart_Armstrong8 min read15th Mar 20135 comments

10

Instrumental ConvergenceAI
Personal Blog

Myself, Kaj Sotala and Seán ÓhÉigeartaigh recently submitted a paper entitled "The errors, insights and lessons of famous AI predictions and what they mean for the future" to the conference proceedings of the AGI12/AGI Impacts Winter Intelligenceconference. Sharp deadlines prevented us from following the ideal procedure of first presenting it here and getting feedback; instead, we'll present it here after the fact.

The prediction classification shemas can be found in the first case study.

What drives an AI?

  • Classification: issues and metastatements, using philosophical arguments and expert judgement.

Steve Omohundro, in his paper on 'AI drives', presented arguments aiming to show that generic AI designs would develop 'drives' that would cause them to behave in specific and potentially dangerous ways, even if these drives were not programmed in initially (Omo08). One of his examples was a superintelligent chess computer that was programmed purely to perform well at chess, but that was nevertheless driven by that goal to self-improve, to replace its goal with a utility function, to defend this utility function, to protect itself, and ultimately to acquire more resources and power.

This is a metastatement: generic AI designs would have this unexpected and convergent behaviour. This relies on philosophical and mathematical arguments, and though the author has expertise in mathematics and machine learning, he has none directly in philosophy. It also makes implicit use of the outside view: utility maximising agents are grouped together into one category and similar types of behaviours are expected from all agents in this category.

In order to clarify and reveal assumptions, it helps to divide Omohundro's thesis into two claims. The weaker one is that a generic AI design could end up having these AI drives; the stronger one that it would very likely have them.

Omohundro's paper provides strong evidence for the weak claim. It demonstrates how an AI motivated only to achieve a particular goal, could nevertheless improve itself, become a utility maximising agent, reach out for resources and so on. Every step of the way, the AI becomes better at achieving its goal, so all these changes are consistent with its initial programming. This behaviour is very generic: only specifically tailored or unusual goals would safely preclude such drives.

The claim that AIs generically would have these drives needs more assumptions. There are no counterfactual resiliency tests for philosophical arguments, but something similar can be attempted: one can use humans as potential counterexamples to the thesis. It has been argued that AIs could have any motivation a human has (Arm,Bos13). Thus according to the thesis, it would seem that humans should be subject to the same drives and behaviours. This does not fit the evidence, however. Humans are certainly not expected utility maximisers (probably the closest would be financial traders who try to approximate expected money maximisers, but only in their professional work), they don't often try to improve their rationality (in fact some specifically avoid doing so (many examples of this are religious, such as the Puritan John Cotton who wrote 'the more learned and witty you bee, the more fit to act for Satan will you bee'(Hof62)), and some sacrifice cognitive ability to other pleasures (BBJ+03)), and many turn their backs on high-powered careers. Some humans do desire self-improvement (in the sense of the paper), and Omohundro cites this as evidence for his thesis. Some humans don't desire it, though, and this should be taken as contrary evidence (or as evidence that Omohundro's model of what constitutes self-improvement is overly narrow). Thus one hidden assumption of the model is:

  • Generic superintelligent AIs would have different motivations to a significant subset of the human race, OR
  • Generic humans raised to superintelligence would develop AI drives.

This position is potentially plausible, but no real evidence is presented for it in the paper.

A key assumption of Omohundro is that AIs will seek to re-express their goals in terms of a utility function. This is based on the Morgenstern-von Neumann expected utility theorem (vNM44). The theorem demonstrates that any decision process that cannot be expressed as expected utility maximising, will be exploitable by other agents or by the environments. Hence in certain circumstances, the agent will predictably lose assets, to no advantage to itself.

That theorem does not directly imply, however, that the AI will be driven to become an expected utility maximiser (to become ''rational''). First of all, as Omohundro himself points out, real agents can only be approximately rational: fully calculating the expected utility of every action is too computationally expensive in the real world. Bounded rationality (Sim55) is therefore the best that can be achieved, and the benefits of becoming rational can only be partially realised.

Secondly, there are disadvantages to becoming rational: these agents tend to be ''totalitarian'', ruthlessly squeezing out anything not explicitly in their utility function, sacrificing everything to the smallest increase in expected utility. An agent that didn't start off as utility-based could plausibly make the assessment that becoming so might be dangerous. It could stand to lose values irrevocably, in ways that it could not estimate at the time. This effect would become stronger as its future self continues to self-improve. Thus an agent could conclude that it is too dangerous to become ''rational'', especially if the agent's understanding of itself is limited.

Thirdly, the fact that an agent can be exploited in theory, doesn't mean that it will be much exploited in practice. Humans are relatively adept at not being exploited, despite not being rational agents. Though human 'partial rationality' is vulnerable to tricks such as extended warranties and marketing gimmicks, it generally doesn't end up losing money, again and again and again, through repeated blatant exploitation. The pressure to become fully rational would be weak for an AI similarly capable of ensuring it was exploitable for only small amounts. An expected utility maximiser would find such small avoidable loses intolerable; but there is no reason for a not-yet-rational agent to agree.

Finally, social pressure should be considered. The case for an AI becoming more rational is at its strongest in a competitive environment, where the theoretical exploitability is likely to actually be exploited. Conversely, there may be situations of social equilibriums, with different agents all agreeing to forgo rationality individually, in the interest of group cohesion (there are many scenarios where this could be plausible).

Thus another hidden assumption of the strong version of the thesis is:

  • The advantages of becoming less-exploitable outweigh the possible disadvantages of becoming an expected utility maximiser (such as possible loss of value or social disagreements). The advantages are especially large when the potentially exploitable aspects of the agent are likely to be exploited, such as in a highly competitive environment.

Any sequence of decisions can be explained as maximising a (potentially very complicated or obscure) utility function. Thus in the abstract sense, saying that an agent is an expected utility maximiser is not informative. Yet there is a strong tendency to assume such agents will behave in certain ways (see for instance the previous comment on the totalitarian aspects of expected utility maximisation). This assumption is key to rest of the thesis. It is plausible that most agents will be 'driven' towards gaining extra power and resources, but this is only a problem if they do so dangerously (at the cost of human lives, for instance). Assuming that a realistic utility function based agent would do so is plausible but unproven.

In general, generic statements about utility function based agents are only true for agents with relatively simple goals. Since human morality is likely very complicated to encode in a computer, and since most putative AI goals are very simple, this is a relatively justified assumption but is an assumption nonetheless. So there are two more hidden assumptions:

  • Realistic AI agents with utility functions will be in a category such that one can make meaningful, generic claims for (almost) all of them. This could arise, for instance, if their utility function is expected to be simpler that human morality.
  • Realistic AI agents are likely not only to have the AI drives Omohundro mentioned, but to have them in a very strong way, being willing to sacrifice anything else to their goals. This could happen, for instance, if the AIs were utility function based with relatively simple utility functions.

This simple analysis suggests that a weak form of Omohundro's thesis is nearly certainly true: AI drives could emerge in generic AIs. The stronger thesis, claiming that the drives would be very likely to emerge, depends on some extra assumptions that need to be analysed.

But there is another way of interpreting Omohundro's work: it presents the generic behaviour of simplified artificial agents (similar to the way that supply and demand curves present the generic behaviour of simplified human agents). Thus even if the model is wrong, it can still be of great use for predicting AI behaviour: designers and philosophers could explain how and why particular AI designs would deviate from this simplified model, and thus analyse whether that AI is likely to be safer than that in the Omohundro model. Hence the model is likely to be of great use, even if it turns out to be an idealised simplification.

 

Dangerous AIs and the failure of counterexamples

Another thesis, quite similar to Omohundro's, is that generic AIs would behave dangerously, unless they were exceptionally well programmed. This point has been made repeatedly by Roman Yampolskiy, Eliezer Yudkowsky and Marvin Minsky, among others (Yam12, Yud08, Min84). That thesis divides in the same fashion as Omohundro's: a weaker claim that any AI could behave dangerously, and a stronger claim that it would likely do so. The same analysis applies as for the 'AI drives': the weak claim is solid, the stronger claim needs extra assumptions (but describes a useful 'simplified agent' model of AI behaviour).

There is another source of evidence for both these theses: the inability of critics to effectively dismiss them. There are many counter-proposals to the theses (some given in question and answer sessions at conferences) in which critics have presented ideas that would 'easily' dispose of the dangers; every time, the authors of the theses have been able to point out flaws in the counter-proposals. This demonstrated that the critics had not grappled with the fundamental issues at hand, or at least not sufficiently to weaken the theses.

This should obviously not be taken as a proof of the theses. But it does show that the arguments are currently difficult to counter. Informally this is a reverse expert-opinion test: if experts often find false counter-arguments, then then any given counter-argument is likely to be false (especially if it seems obvious and easy). Thus any counter-argument should have been subject to a degree of public scrutiny and analysis, before it can be accepted as genuinely undermining the theses. Until that time, both predictions seem solid enough that any AI designer would do well to keep them in mind in the course of their programming.

References:

  • [Arm] Stuart Armstrong. General purpose intelligence: arguing the orthogonality thesis. In preparation.
  • [ASB12] Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle ai. Minds and Machines, 22:299-324, 2012.
  • [BBJ+03] S. Bleich, B. Bandelow, K. Javaheripour, A. Muller, D. Degner, J. Wilhelm, U. Havemann-Reinecke, W. Sperling, E. Ruther, and J. Kornhuber. Hyperhomocysteinemia as a new risk factor for brain shrinkage in patients with alcoholism. Neuroscience Letters, 335:179-182, 2003.
  • [Bos13] Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. forthcoming in Minds and Machines, 2013.
  • [Cre93] Daniel Crevier. AI: The Tumultuous Search for Artificial Intelligence. NY: BasicBooks, New York, 1993.
  • [Den91] Daniel Dennett. Consciousness Explained. Little, Brown and Co., 1991.
  • [Deu12] D. Deutsch. The very laws of physics imply that artificial intelligence must be possible. what's holding us up? Aeon, 2012.
  • [Dre65] Hubert Dreyfus. Alchemy and ai. RAND Corporation, 1965.
  • [eli66] Eliza-a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9:36-45, 1966.
  • [Fis75] Baruch Fischho . Hindsight is not equal to foresight: The effect of outcome knowledge on judgment under uncertainty. Journal of Experimental Psychology: Human Perception and Performance, 1:288-299, 1975.
  • [Gui11] Erico Guizzo. IBM's Watson jeopardy computer shuts down humans in final game. IEEE Spectrum, 17, 2011.
  • [Hal11] J. Hall. Further reflections on the timescale of ai. In Solomonoff 85th Memorial Conference, 2011.
  • [Han94] R. Hanson. What if uploads come first: The crack of a future dawn. Extropy, 6(2), 1994.
  • [Har01] S. Harnad. What's wrong and right about Searle's Chinese room argument? In M. Bishop and J. Preston, editors, Essays on Searle's Chinese Room Argument. Oxford University Press, 2001.
  • [Hau85] John Haugeland. Artificial Intelligence: The Very Idea. MIT Press, Cambridge, Mass., 1985.
  • [Hof62] Richard Hofstadter. Anti-intellectualism in American Life. 1962.
  • [Kah11] D. Kahneman. Thinking, Fast and Slow. Farra, Straus and Giroux, 2011.
  • [KL93] Daniel Kahneman and Dan Lovallo. Timid choices and bold forecasts: A cognitive perspective on risk taking. Management science, 39:17-31, 1993.
  • [Kur99] R. Kurzweil. The Age of Spiritual Machines: When Computers Exceed Human Intelligence. Viking Adult, 1999.
  • [McC79] J. McCarthy. Ascribing mental qualities to machines. In M. Ringle, editor, Philosophical Perspectives in Artificial Intelligence. Harvester Press, 1979.
  • [McC04] Pamela McCorduck. Machines Who Think. A. K. Peters, Ltd., Natick, MA, 2004.
  • [Min84] Marvin Minsky. Afterword to Vernor Vinges novel, "True names." Unpublished manuscript. 1984.
  • [Moo65] G. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), 1965.
  • [Omo08] Stephen M. Omohundro. The basic ai drives. Frontiers in Artificial Intelligence and applications, 171:483-492, 2008.
  • [Pop] Karl Popper. The Logic of Scientific Discovery. Mohr Siebeck.
  • [Rey86] G. Rey. What's really going on in Searle's Chinese room". Philosophical Studies, 50:169-185, 1986.
  • [Riv12] William Halse Rivers. The disappearance of useful arts. Helsingfors, 1912.
  • [San08] A. Sandberg. Whole brain emulations: a roadmap. Future of Humanity Institute Technical Report, 2008-3, 2008.
  • [Sea80] J. Searle. Minds, brains and programs. Behavioral and Brain Sciences, 3(3):417-457, 1980.
  • [Sea90] John Searle. Is the brain's mind a computer program? Scientific American, 262:26-31, 1990.
  • [Sim55] H.A. Simon. A behavioral model of rational choice. The quarterly journal of economics, 69:99-118, 1955.
  • [Tur50] A. Turing. Computing machinery and intelligence. Mind, 59:433-460, 1950.
  • [vNM44] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton, NJ, Princeton University Press, 1944.
  • [Wal05] Chip Walter. Kryder's law. Scientific American, 293:32-33, 2005.
  • [Win71] Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. MIT AI Technical Report, 235, 1971.
  • [Yam12] Roman V. Yampolskiy. Leakproofing the singularity: artificial intelligence confinement problem. Journal of Consciousness Studies, 19:194-214, 2012.
  • [Yud08] Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. In Nick Bostrom and Milan M. Ćirković, editors, Global catastrophic risks, pages 308-345, New York, 2008. Oxford University Press.

10

5 comments, sorted by Highlighting new comments since Today at 1:57 AM
New Comment

This does not fit the evidence, however. Humans are certainly not expected utility maximisers (probably the closest would be financial traders who try to approximate expected money maximisers, but only in their professional work),

Huh? Omohundro's thesis is not 'humans are expected-dollar maximizers'. (And pointing this out is not adopting an improbable convoluted utility function.)

they don't often try to improve their rationality (in fact some specifically avoid doing so (many examples of this are religious, such as the Puritan John Cotton who wrote 'the more learned and witty you bee, the more fit to act for Satan will you bee'(Hof62)),

Falls under the 'weird' criteria, no? These people are espousing a defensive tactic (more education correlates with less religiosity) by a socially-communicated meme; this is a weird self-sustaining belief which has had to gradually evolve new tactics over many millennia and probably stems from peculiar properties of evolved human consciousness re agent detection.

and some sacrifice cognitive ability to other pleasures (BBJ+03)), and many turn their backs on high-powered careers.

What part of "expected utility maximizer" don't you understand?

Some humans do desire self-improvement (in the sense of the paper), and Omohundro cites this as evidence for his thesis. Some humans don't desire it, though, and this should be taken as contrary evidence (or as evidence that Omohundro's model of what constitutes self-improvement is overly narrow).

Or it reflects utility-maximizing behavior under the constraints that humans - but not pretty much any AI - face: eg.

  • the high opportunity costs of learning (I've read lifetime income is maximized at the master's degree level - because PhDs take too much time!)
  • the limited lifespan of humans
  • the even more limited productive lifespan of a human (consider the decay of intelligence with age by age 40 or 50, and the simultaneous sharp decline in scientific achievement observed in Jones's samples)
  • and the high discount rates of almost everyone (rarely less than 5%, often double-digits)

Nitpick:

and some sacrifice cognitive ability to other pleasures (BBJ+03)), and many turn their backs on high-powered careers.

What part of "expected utility maximizer" don't you understand?

It's a bit confusing to quote across a bracket boundary like that. The bit about sacrificing cognitive ability for other pleasures is an example of "they don't often try to improve their rationality", whereas turning backs on careers was about expected utility maximization.

I agree that turning your back on a high-powered career is not a good example of failing to maximize utility, but trading cognition for pleasure seems like a reasonable example of not valuing, or failing to act on the value of, being more rational.

trading cognition for pleasure seems like a reasonable example of not valuing, or failing to act on the value of, being more rational.

I think it's the same thing as before. AI drives is about a particular set of behaviors being an instrumental value for a large subset of all plausible agents; rationality is one of these instrumental (and not terminal) drives.

Providing an instance where an agent trades off an instrumental good (rationality) for a terminal good (pleasure) is simply not a counter-example - what else would an agent do when offered such a tradeoff? It would be like saying "supposedly, people earn money so as to spend it on things they want; but look! they're spending money on things like trips to Tahiti! Clearly that is not why they really earn money..."

Another thesis, quite similar to Omohundro's, is that generic AIs would behave dangerously, unless they were exceptionally well programmed. This point has been made repeatedly by Roman Yampolskiy, Eliezer Yudkowsky and Marvin Minsky, among others (Yam12, Yud08, Min84).

Minsky's "unpublished manuscript" seems to be here.