Part of the series AI Risk and Opportunity: A Strategic Analysis.

(You can leave anonymous feedback on posts in this series here. I alone will read the comments, and may use them to improve past and forthcoming posts in this series.)

This post chronicles the story of humanity's growing awareness of AI risk and opportunity, along with some recent AI safety efforts. I will not tackle any strategy questions directly in this post; my purpose today is merely to "bring everyone up to speed."

I know my post skips many important events and people. Please suggest additions in the comments, and include as much detail as possible.

 

Early history

Late in the Industrial Revolution, Samuel Butler (1863) worried about what might happen when machines become more capable than the humans who designed them:

...we are ourselves creating our own successors; we are daily adding to the beauty and delicacy of their physical organisation; we are daily giving them greater power and supplying by all sorts of ingenious contrivances that self-regulating, self-acting power which will be to them what intellect has been to the human race. In the course of ages we shall find ourselves the inferior race.

...the time will come when the machines will hold the real supremacy over the world and its inhabitants...

This basic idea was picked up by science fiction authors, for example in the 1921 Czech play that introduced the term “robot,” R.U.R. In that play, robots grow in power and intelligence and destroy the entire human race, except for a single survivor.

Another exploration of this idea is found in John W. Campbell’s (1932) short story The Last Evolution, in which aliens attack Earth and the humans and aliens are killed but their machines survive and inherit the solar system. Campbell's (1935) short story The Machine contained perhaps the earlier description of recursive self-improvement:

 

On the planet Dwranl, of the star you know as Sirius, a great race lived, and they were not too unlike you humans. ...they attained their goal of the machine that could think. And because it could think, they made several and put them to work, largely on scientific problems, and one of the obvious problems was how to make a better machine which could think.

The machines had logic, and they could think constantly, and because of their construction never forgot anything they thought it well to remember. So the machine which had been set the task of making a better machine advanced slowly, and as it improved itself, it advanced more and more rapidly. The Machine which came to Earth is that machine.

 

The concern for AI safety is most popularly identified with Isaac Asimov’s Three Laws of Robotics, introduced in his short story Runaround. Asimov used his stories, including those collected in the popular book I, Robot, to illustrate many of the ways in which such well-meaning and seemingly comprehensive rules for governing robot behavior could go wrong.

In the year of I, Robot’s release, mathematician Alan Turing (1950) noted that machines may one day be capable of whatever human intelligence can achieve:

I believe that at the end of the century... one will be able to speak of machines thinking without expecting to be contradicted.

Turing (1951) concluded:

...it seems probable that once the machine thinking method has started, it would not take long to outstrip our feeble powers... At some stage therefore we should have to expect the machines to take control...

Given the profound implications of machine intelligence, it's rather alarming that the early AI scientists who believed AI would be built during the 1950s-1970s didn't show much interest in AI safety. We are lucky they were wrong about the difficulty of AI — had they been right, humanity probably would not have been prepared to protect its interests.

Later, statistician I.J. Good (1959), who had worked with Turing to crack Nazi codes in World War II, reasoned that the transition from human control to machine control may be unexpectedly sudden:

Once a machine is designed that is good enough… it can be put to work designing an even better machine. At this point an "explosion" will clearly occur; all the problems of science and technology will be handed over to machines and it will no longer be necessary for people to work. Whether this will lead to a Utopia or to the extermination of the human race will depend on how the problem is handled by the machines. The important thing will be to give them the aim of serving human beings.

The more famous formulation of this idea, and the origin of the phrase "intelligence explosion," is from Good (1965):

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an “intelligence explosion," and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make

Good (1970) says that "...by 1980 I hope that the implications and the safeguards [concerning machine superintelligence] will have been thoroughly discussed," and argues that an association devoted to discussing the matter be created. Unfortunately, no such association was created until either 1991 (Extropy Institute) or 2000 (Singularity Institute), and we might say these issues have not to this day been "thoroughly" discussed.

Good (1982) proposed a plan for the design of an ethical machine:

I envisage a machine that would be given a large number of examples of human behaviour that other people called ethical, and examples of discussions of ethics, and from these examples and discussions the machine would formulate one or more consistent general theories of ethics, detailed enough so that it could deduce the probable consequences in most realistic situations.

Even critics of AI like Jack Schwartz (1987) saw the implications of intelligence that can improve itself:

If artificial intelligences can be created at all, there is little reason to believe that initial successes could not lead swiftly to the construction of artificial superintelligences able to explore significant mathematical, scientific, or engineering alternatives at a rate far exceeding human ability, or to generate plans and take action on them with equally overwhelming speed. Since man's near-monopoly of all higher forms of intelligence has been one of the most basic facts of human existence throughout the past history of this planet, such developments would clearly create a new economics, a new sociology, and a new history.

Ray Solomonoff (1985), founder of algorithmic information theory, speculated on the implications of full-blown AI:

After we have reached [human-level AI], it shouldn't take much more than ten years to construct ten thousand duplicates of our original [human-level AI], and have a total computing capability close to that of the computer science community...

The last 100 years have seen the introduction of special and general relatively, automobiles, airplanes, quantum mechanics, large rockets and space travel, fission power, fusion bombs, lasers, and large digital computers. Any one of these might take a person years to appreciate and understand. Suppose that they had all been presented to mankind in a single year!

Moravec (1988) argued that AI was an existential risk, but nevertheless, one toward which we must run (pp. 100-101):

...intelligent machines... threaten our existence... Machines merely as clever as human beings will have enormous advantages in competitive situations... So why rush headlong into an era of intelligent machines? The answer, I believe, is that we have very little choice, if our culture is to remain viable... The universe is one random event after another. Sooner or later an unstoppable virus deadly to humans will evolve, or a major asteroid will collide with the earth, or the sun will expand, or we will be invaded from the stars, or a black hole will swallow the galaxy. The bigger, more diverse, and competent a culture is, the better it can detect and deal with external dangers. The larger events happen less frequently. By growing rapidly enough, a culture has a finite chance of surviving forever.

Ray Kurzweil's The Age of Intelligent Machines (1990) did not mention AI risk, and his followup, The Age of Spiritual Machines (1998) does so only briefly, in an "interview" between the reader and Kurzweil. The reader asks, "So we risk the survival of the human race for [the opportunity AI affords us to expand our minds and advance our ability to create knowledge]?" Kurzweil answers: "Yeah, basically."

Minsky (1984) pointed out the difficulty of getting machines to do what we want:

...it is always dangerous to try to relieve ourselves of the responsibility of understanding exactly how our wishes will be realized. Whenever we leave the choice of means to any servants we may choose then the greater the range of possible methods we leave to those servants, the more we expose ourselves to accidents and incidents. When we delegate those responsibilities, then we may not realize, before it is too late to turn back, that our goals have been misinterpreted, perhaps even maliciously. We see this in such classic tales of fate as Faust, the Sorcerer's Apprentice, or the Monkey's Paw by W.W. Jacobs.

[Another] risk is exposure to the consequences of self-deception. It is always tempting to say to oneself... that "I know what I would like to happen, but I can't quite express it clearly enough." However, that concept itself reflects a too-simplistic self-image, which portrays one's own self as [having] well-defined wishes, intentions, and goals. This pre-Freudian image serves to excuse our frequent appearances of ambivalence; we convince ourselves that clarifying our intentions is merely a matter of straightening-out the input-output channels between our inner and outer selves. The trouble is, we simply aren't made that way. Our goals themselves are ambiguous.

The ultimate risk comes when [we] attempt to take that final step — of designing goal-achieving programs that are programmed to make themselves grow increasingly powerful, by self-evolving methods that augment and enhance their own capabilities. It will be tempting to do this, both to gain power and to decrease our own effort toward clarifying our own desires. If some genie offered you three wishes, would not your first one be, "Tell me, please, what is it that I want the most!" The problem is that, with such powerful machines, it would require but the slightest accident of careless design for them to place their goals ahead of [ours]. The machine's goals may be allegedly benevolent, as with the robots of With Folded Hands, by Jack Williamson, whose explicit purpose was allegedly benevolent: to protect us from harming ourselves, or as with the robot in Colossus, by D.H.Jones, who itself decides, at whatever cost, to save us from an unsuspected enemy. In the case of Arthur C. Clarke's HAL, the machine decides that the mission we have assigned to it is one we cannot properly appreciate. And in Vernor Vinge's computer-game fantasy, True Names, the dreaded Mailman... evolves new ambitions of its own.

 

The Modern Era

Novelist Vernor Vinge (1993) popularized Good's "intelligence explosion" concept, and wrote the first novel about self-improving AI posing an existential threat: A Fire Upon the Deep (1992). It was probably Vinge who did more than anyone else to spur discussions about AI risk, particularly in online communities like the extropians mailing list (since 1991) and SL4 (since 2000). Participants in these early discussions included several of today's leading thinkers on AI risk: Robin Hanson, Eliezer Yudkowsky, Nick Bostrom, Anders Sandberg, and Ben Goertzel. (Other posters included Peter Thiel, FM-2030, Robert Bradbury, and Julian Assange.) Proposals like Friendly AI, Oracle AI, and Nanny AI were discussed here long before they were brought to greater prominence with academic publications (Yudkowsky 2008; Armstrong et al. 2012; Goertzel 2012).

Meanwhile, philosophers and AI researchers considered whether or not machines could have moral value, and how to ensure ethical behavior from less powerful machines or 'narrow AIs', a field of inquiry variously known as 'artificial morality' (Danielson 1992; Floridi & Sanders 2004; Allen et al. 2000), 'machine ethics' (Hall 2000; McLaren 2005; Anderson & Anderson 2006), 'computational ethics' (Allen 2002) and 'computational metaethics' (Lokhorst 2011), and 'robo-ethics' or 'robot ethics' (Capurro et al. 2006; Sawyer 2007). This vein of research — what I'll call the 'machine ethics' literature — was recently summarized in two books: Wallach & Allen (2009); Anderson & Anderson (2011). Thus far, there has been a significant communication gap between the machine ethics literature and the AI risk literature (Allen and Wallach 2011), excepting perhaps Muehlhauser and Helm (2012).

The topic of AI safety in the context of existential risk was left to the futurists who had participated in online discusses of AI risk and opportunity. Here, I must cut short my review and focus on just three (of many) important figures: Eliezer Yudkowksy, Robin Hanson, and Nick Bostrom. (Your author also apologizes for the fact that, because he works with Yudkowsky, Yudkowsky gets a more detailed treatment here than Hanson or Bostrom.)

Other figures in the modern era of AI risk research include Bill Hibbard (Super-Intelligent Machines) and Ben Goertzel ("Should Humanity Build a Global AI Nanny to Delay the Singularity Until It's Better Understood").

 

Eliezer Yudkowsky

According to "Eliezer, the person," Eliezer Yudkowsky (born 1979) was a bright kid — in the 99.9998th percentile of cognitive ability, according to the Midwest Talent Search. He read lots of science fiction as a child, and at age 11 read Great Mambo Chicken and the Transhuman Condition — his introduction to the impending reality of transhumanist technologies like AI and nanotech. The moment he became a Singularitarian was the moment he read page 47 of True Names and Other Dangers by Vernor Vinge:

Here I had tried a straightforward extrapolation of technology, and found myself precipitated over an abyss. It's a problem we face every time we consider the creation of intelligences greater than our own. When this happens, human history will have reached a kind of singularity - a place where extrapolation breaks down and new models must be applied - and the world will pass beyond our understanding.

Yudkowsky reported his reaction:

My emotions at that moment are hard to describe; not fanaticism, or enthusiasm, just a vast feeling of "Yep. He's right." I knew, in the moment I read that sentence, that this was how I would be spending the rest of my life.

(As an aside, I'll note that this is eerily similar to my own experience of encountering the famous I.J. Good paragraph about ultraintelligence (quoted above), before I knew what "transhumanism" or "the Singularity" was. I read Good's paragraph and thought, "Wow. That's... probably correct. How could I have missed that implication? … … … Well, shit. That changes everything.")

As a teenager in the mid 1990s, Yudkowsky participated heavily in Singularitarian discussions on the extropians mailing list, and in 1996 (at age 17) he wrote "Staring into the Singularity," which gained him much attention, as did his popular "FAQ about the Meaning of Life" (1999).

In 1998 Yudkowsky was invited (along with 33 others) by economist Robin Hanson to comment on Vinge (1993). Thirteen people (including Yudkowsky) left comments, then Vinge responded, and a final open discussion was held on the extropians mailing list. Hanson edited together these results here. Yudkowsky thought Max More's comments on Vinge underestimated how different from humans AI would probably be, and this prompted Yudkowsky to begin an early draft of "Coding a Transhuman AI" (CaTAI) which by 2000 had grown into the first large explication of his thoughts on "Seed AI" and "friendly" machine superintelligence (Yudkowsky 2000).

Around this same time, Yudkowsky wrote "The Plan to the Singularity" and "The Singularitarian Principles," and launched the SL4 mailing list.

At a May 2000 gathering hosted by the Foresight Institute, Brian Atkins and Sabine Stoeckel discussed with Yudkowsky the possibility of launching an organization specializing in AI safety. In July of that year, Yudkowsky formed the Singularity Institute and began his full-time research on the problems of AI risk and opportunity.

In 2001, he published two "sequels" to CaTAI, "General Intelligence and Seed AI" and, most importantly, "Creating Friendly AI" (CFAI) (Yudkowsky 2001).

The publication of CFAI was a significant event, prompting Ben Goertzel (the pioneer of the new Artificial General Intelligence research community) to say that "Creating Friendly AI is the most intelligent writing about AI that I've read in many years," and prompting Eric Drexler (the pioneer of molecular manufacturing) to write that "With Creating Friendly AI, the Singularity Institute has begun to fill in one of the greatest remaining blank spots in the picture of humanity's future."

CFAI was both frustrating and brilliant. It was frustrating because: (1) it was disorganized and opaque, (2) it invented new terms instead of using the terms being used by everyone else, for example speaking of "supergoals" and "subgoals" instead of final and instrumental goals, and speaking of goal systems but never "utility functions," and (3) it hardly cited any of the relevant works in AI, philosophy, and psychology — for example it could have cited McCulloch (1952), Good (1959, 1970, 1982), Cade (1966)Versenyi (1974), Evans (1979)Lampson (1979), the conversation with Ed Fredkin in McCorduck (1979)Sloman (1984)Schmidhuber (1987), Waldrop (1987)Pearl (1989), De Landa (1991)Crevier (1993, ch. 12), Clarke (1993, 1994), Weld & Etzioni (1994)Buss (1995), Russell & Norvig (1995), Gips (1995), Whitby (1996)Schmidhuber et al. (1997), Barto & Sutton (1998), Jackson (1998), Levitt (1999)Moravec (1999), Kurzweil (1999), Sobel (1999), Allen et al. (2000), Gordon (2000), Harper (2000)Coleman 2001, and Hutter (2001). These features still substantially characterize Yudkowsky's independent writing, e.g. see Yudkowsky (2010). As late as January 2006, he still wrote that "It is not that I have neglected to cite the existing major works on this topic, but that, to the best of my ability to discern, there are no existing major works to cite."

On the other hand, CFAI was in many ways was brilliant, and it tackled many of the problems left mostly untouched by mainstream machine ethics researchers. For example, CFAI (but not the mainstream machine ethics literature) engaged the problems of: (1) radically self-improving AI, (2) AI as an existential risk, (3) hard takeoff, (4) the interplay of goal content, acquisition, and structure, (5) wireheading, (6) subgoal stomp, (7) external reference semantics, (8) causal validity semantics, and (9) selective support (which Bostrom (2002) would later call "differential technological development").

For many years, the Singularity Institute was little more than a vehicle for Yudkowsky's research. In 2002 he wrote "Levels of Organization in General Intelligence," which later appeared in the first edited volume on Artificial General Intelligence (AGI). In 2003 he wrote what would become the internet's most popular tutorial on Bayes' Theorem, followed in 2005 by "A Technical Explanation of Technical Explanation." In 2004 he explained his vision of a Friendly AI goal structure: "Coherent Extrapolated Volition." In 2006 he wrote two chapters that would later appear in the volume Global Catastrohpic Risks volume from Oxford University Press (co-edited by Bostrom): "Cognitive Biases Potentially Affecting Judgment of Global Risks" and, what remains his "classic" article on the need for Friendly AI, "Artificial Intelligence as a Positive and Negative Factor in Global Risk.

In 2004, Tyler Emerson was hired as the Singularity Institute's executive director. Emerson brought on Nick Bostrom (then a post doctoral fellow at Yale), Christine Peterson (of the Foresight Institute), and others, as advisors. In February 2006, Paypal co-founder Peter Thiel donated $100,000 to the Singularity Institute, and, we might say, the Singularity Institute as we know it today was born.

From 2005-2007, Yudkowsky worked at various times with Marcello Herreshoff, Nick Hay and Peter de Blanc on the technical problems of AGI necessary for technical FAI work, for example creating AIXI-like architectures, developing a reflective decision theory, and investigating limits inherent in self-reflection due to Löb's Theorem. Almost none of this research has been published, in part because of the desire not to accelerate AGI research without having made corresponding safety progress. (Marcello also worked with Eliezer during the summer of 2009.)

Much of the Singularity Institute's work has been "movement-building" work. The institute's Singularity Summit, held annually since 2006, attracts technologists, futurists, and social entrepreneurs from around the world, bringing to their attention not only emerging and future technologies but also the basics of AI risk and opportunity. The Singularity Summit also gave the Singularity Institute much of its access to cultural, academic, and business elites.

Another key piece of movement-building work was Yudkowsky's "The Sequences," which were written during 2006-2009. Yudkowsky blogged, almost daily, on the subjects of epistemology, language, cognitive biases, decision-making, quantum mechanics, metaethics, and artificial intelligence. These posts were originally published on a community blog about rationality, Overcoming Bias (which later became Hanson's personal blog). Later, Yudkowsky's posts were used as the seed material for a new group blog, Less Wrong.

Yudkowsky's goal was to create a community of people who could avoid common thinking mistakes, change their minds in response to evidence, and generally think and act with an unusual degree of Technical Rationality. In CFAI he had pointed out that when it comes to AI, humanity may not have a second chance to get it right. So we can't run a series of intelligence explosion experiments and "see what works." Instead, we need to predict in advance what we need to do to ensure a desirable future, and we need to overcome common thinking errors when doing so. (Later, Yudkowsky expanded his "community of rationalists" by writing the most popular Harry Potter fanfiction in the world, Harry Potter and the Methods of Rationality, and is currently helping to launch a new organization that will teach classes on the skills of rational thought and action.)

This community demonstrated its usefulness in 2009 when Yudkowsky began blogging about some problems in decision theory related to the project of building a Friendly AI. Much like Tim Gowers' Polymath Project, these discussions demonstrated the power of collaborative problem-solving over the internet. The discussions led to a decision theory workshop and then a decision theory mailing list, which quickly became home to some of the most interesting work in decision theory anywhere in the world. Yudkowsky summarized some of his earlier results in "Timeless Decision Theory" (2010), and newer results have been posted to Less Wrong, for example A model of UDT with a halting oracle and Formulas of arithmetic that behave like decision agents.

The Singularity Institute also built its community with a Visiting Fellows program that hosted groups of researchers for 1-3 months at a time. Together, both visiting fellows and newly hired research fellows produced several working papers between 2009 and 2011, including Machine Ethics and Superintelligence, Implications of a Software-Limited Singularity, Economic Implications of Software Minds, Convergence of Expected Utility for Universal AI, and Ontological Crises in Artificial Agents' Value Systems.

In 2011, then-president Michael Vassar left the Singularity Institute to help launch a personalized medicine company, and research fellow Luke Muehlhauser (the author of this document) took over leadership from Vassar, as Executive Director. During this time, the Institute underwent a major overhaul to implement best practices for organizational process and management: it published its first strategic plan, began to maintain its first donor database, adopted best practices for accounting and bookkeeping, updated its bylaws and articles of incorporation, adopted more standard roles for the Board of Directors and the Executive Director, held a series of strategic meetings to help decide the near-term goals of the organization, began to publish monthly progress reports to its blog, started outsourcing more work, and began to work on more articles for peer-reviewed publications: as of March 2012, the Singularity Institute has more peer-reviewed publications forthcoming in 2012 than it had published in all of 2001-2011 combined.

Today, the Singularity Institute collaborates regularly with its (non-staff) research associates, and also with researchers at the Future of Humanity Institute at Oxford University (directed by Bostrom), which as of March 2012 is the world's only other major research institute largely focused on the problems of existential risk.

 

Robin Hanson

Whereas Yudkowsky has never worked in the for-profit world and had no formal education after high school, Robin Hanson (born 1959) has a long and prestigious academic and professional history. Hanson took a B.S. in physics from U.C. Irvine in 1981, took an M.S. in physics and an M.A. in the conceptual foundations of science from U. Chicago in 1984, worked in artificial intelligence for Lockheed and NASA, got a Ph.D. in social science from Caltech in 1997, did a post-doctoral fellowship at U.C. Berkeley in Health policy from 1997-1999, and finally was made an assistant professor of economics at George Mason University in 1999. In economics, he is best known for conceiving of prediction markets.

When Hanson moved to California in 1984, he encountered the Project Xanadu crowd and met Eric Drexler, who showed him an early draft of Engines of Creation. This community discussed AI, nanotech, cryonics, and other transhumanist topics, and Hanson joined the extropians mailing list (along with many others from Project Xanadu) when it launched in 1991.

Hanson has published several papers on the economics of whole brain emulations (what he calls "ems") and AI (1994, 1998a, 1998b, 2008a, 2008b, 2008c, 2012a). His writings at Overcoming Bias (launched November 2006) are perhaps even more influential, and cover a wide range of topics.

Hanson's views on AI risk and opportunity differ from Yudkowsky's. First, Hanson sees the technological singularity and the human-machine conflict it may produce not as a unique event caused by the advent of AI, but as a natural consequence of "the general fact that accelerating rates of change increase intergenerational conflicts" (Hanson 2012b). Second, Hanson thinks an intelligence explosion will be slower and more gradual than Yudkowsky does, denying Yudkowsky's "hard takeoff" thesis (Hanson & Yudkowsky 2008).

 

Nick Bostrom

Nick Bostrom (born 1973) received a B.S. in philosophy, mathematics, mathematical logic, and artificial intelligence from the University of Goteborg in 1994, setting a national record in Sweden for undergraduate academic performance. He received an M.A. in philosophy and physics from from U. Stockholm in 1996, did work in astrophysics and computational neuroscience at King's College London, and received his Ph.D. from the London School of Economics in 2000. He went on to be a post-doctoral fellow at Yale University and in 2005 became the founding director of Oxford University's Future of Humanity Institute (FHI). Without leaving FHI, he became the founding director of Oxford's Programme on the Impacts of Future Technology (aka FutureTech) in 2011.

Bostrom had long been interested in cognitive enhancement, and in 1995 he joined the extropians mailing list and learned about cryonics, uploading, AI, and other topics.

Bostrom worked with British philosopher David Pearce) to found the World Transhumanist Association (now called H+) in 1998, with the purpose of developing a more mature and academically respectable form of transhumanism than was usually present on the extropians mailing list. During this time Bostrom wrote "The Transhumanist FAQ" (now updated to version 2.1), with input from more than 50 others.

His first philosophical publication was "Predictions from Philosophy? How philosophers could make themselves useful" (1997). In this paper, Bostrom proposed "a new type of philosophy, a philosophy whose aim is prediction." On Bostrom's view, one role for the philosopher is to be a polymath who can engage in technological prediction and try to figure out how to steer the future so that humanity's goals are best met.

Bostrom gave three examples of problems this new breed of philosopher-polymath could tackle: the Doomsday argument and anthropics, the Fermi paradox, and superintelligence:

What questions could a philosophy of superintelligence deal with? Well, questions like: How much would the predictive power for various fields increase if we increase the processing speed of a human-like mind a million times? If we extend the short-term or long-term memory? If we increase the neural population and the connection density? What other capacities would a superintelligence have? How easy would it be for it to rediscover the greatest human inventions, and how much input would it need to do so? What is the relative importance of data, theory, and intellectual capacity in various disciplines? Can we know anything about the motivation of a superintelligence? Would it be feasible to preprogram it to be good or philanthropic, or would such rules be hard to reconcile with the flexibility of its cognitive processes? Would a superintelligence, given the desire to do so, be able to outwit humans into promoting its own aims even if we had originally taken strict precautions to avoid being manipulated? Could one use one superintelligence to control another? How would superintelligences communicate with each other? Would they have thoughts which were of a totally different kind from the thoughts that humans can think? Would they be interested in art and religion? Would all superintelligences arrive at more or less the same conclusions regarding all important scientific and philosophical questions, or would they disagree as much as humans do? And how similar in their internal belief-structures would they be? How would our human self-perception and aspirations change if were forced to abdicate the throne of wisdom...? How would we individuate between superminds if they could communicate and fuse and subdivide with enormous speed? Will a notion of personal identity still apply to such interconnected minds? Would they construct an artificial reality in which to live? Could we upload ourselves into that reality? Could we then be able to compete with the superintelligences, if we were accelerated and augmented with extra memory etc., or would such profound reorganisation be necessary that we would no longer feel we were humans? Would that matter?

Bostrom went on to examine some philosophical issues related to superintelligence, in "Predictions from Philosophy" and in "How Long Before Superintelligence?" (1998), "Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards" (2002), "Ethical Issues in Advanced Artificial Intelligence" (2003), "The Future of Human Evolution" (2004), and "The Ethics of Artificial Intelligence" (2012, coauthored with Yudkowsky). (He also played out the role of philosopher-polymath with regard to several other topics, including human enhancement and anthropic bias.)

Bostrom's industriousness paid off:

In 2009, [Bostrom] was awarded the Eugene R. Gannon Award (one person selected annually worldwide from the fields of philosophy, mathematics, the arts and other humanities, and the natural sciences). He has been listed in the FP 100 Global Thinkers list, the Foreign Policy Magazineʹs list of the worldʹs top 100 minds. His writings have been translated into more than 21 languages, and there have been some 80 translations or reprints of his works. He has done more than 470 interviews for TV, film, radio, and print media, and he has addressed academic and popular audiences around the world.

The other long-term member of the Future of Humanity Institute, Anders Sandberg, has also published some research on AI risk. Sandberg was a co-author on the whole brain emulation roadmap and "Anthropic Shadow", and also wrote "Models of the Technological Singularity" and several other papers.

Recently, Bostrom and Sandberg were joined by Stuart Armstrong, who wrote "Anthropic Decision Theory" (2011) and was the lead author on "Thinking Inside the Box: Using and Controlling Oracle AI" (2012). He had previously written Chaining God (2007).

For more than a year, Bostrom has been working on a new book titled Superintelligence: A Strategic Analysis of the Coming Machine Intelligence Revolution, which aims to sum up and organize much of the (published and unpublished) work done in the past decade by researchers at the Singularity Institute and FHI on the subject of AI risk and opportunity, as well as contribute new insights.

 

AI Risk Goes Mainstream

In 1997, professor of cybernetics Kevin Warwick published March of the Machines, in which he predicted that within a couple decades, machines would become more intelligent than humans, and would pose an existential threat.

In 2000, Sun Microsystems co-founder Bill Joy published "Why the Future Doesn't Need Us" in Wired magazine. In this widely-circulated essay, Joy argued that "Our most powerful 21st-century technologies — robotics, genetic engineering, and nanotech — are threatening to make humans an endangered species." Joy advised that we relinquish development of these technologies rather than sprinting headlong into an arms race between destructive uses of these technologies and defenses against those destructive uses.

Many people dismissed Bill Joy as a "Neo-Luddite," but many experts expressed similar concerns about human extinction, including philosopher John Leslie (The End of the World), physicist Martin Rees (Our Final Hour), legal theorist Richard Posner (Catastrophe: Risk and Response), and the contributors to Global Catastrophic Risks (including Yudkowsky, Hanson, and Bostrom).

Even Ray Kurzweil, known as an optimist about technology, devoted a chapter of his 2005 bestseller The Singularity is Near to a discussion of existential risks, including risks from AI. Though discussing the possibility of existential catastrophe at length, his take on AI risk was cursory (p. 420):

Inherently there will be no absolute protection against strong AI. Although the argument is subtle I believe that maintaining an open free-market system for incremental scientific and technological progress, in which each step is subject to market acceptance, will provide the most constructive environment for technology to embody widespread human values. As I have pointed out, strong AI is emerging from many diverse efforts and will be deeply integrated into our civilization's infrastructure. Indeed, it will be intimately embedded in our bodies and brains. As such, it will reflect our values because it will be us.

AI risk finally became a "mainstream" topic in analytic philosophy with Chalmers (2010) and an entire issue of Journal of Consciousness Studies devoted to the topic.

The earliest popular discussion of machine superintelligence may have been in Christopher Evans' international bestseller The Mighty Micro (1979), pages 194-198, 231-233, and 237-246.


The Current Situation

Two decades have passed since the early transhumanists began to seriously discuss AI risk and opportunity on the extropians mailing list. (Before that, some discussions took place at the MIT AI lab, but that was before the web was popular, so they weren't recorded.) What have we humans done since then?

Lots of talking. Hundreds of thousands of man-hours have been invested into discussions on the extropians mailing list, SL4, Overcoming Bias, Less Wrong, the Singularity Institute's decision theory mailing list, several other internet forums, and also in meat-space (especially in the Bay Area near the Singularity Institute and in Oxford near FHI). These are difficult issues; talking them through is usually the first step to getting anything else done.

Organization. Mailing lists are a form of organization, as are organizations like The Singularity Institute and university departments like the FHI and FutureTech. Established organizations provide opportunities to bring people together, and to pool and direct resources efficiently.

Resources. Many people of considerable wealth, along with thousands of others of "concerned citizens" around the world, have decided that AI is the most significant risk and opportunity we face, and are willing to invest in humanity's future.

Outreach. Publications (both academic and popular), talks, and interactions with major and minor media outlets have been used to raise awareness of AI risk and opportunity. This has included outreach to specific AGI researchers, some of whom now take AI safety quite seriously. This also includes outreach to people in positions of influence who are in a position to engage in differential technological development. It also includes outreach to the rapidly growing "optimal philanthropy" community; a large fraction of those associated with Giving What We Can take existential risk — and AI risk in particular — quite seriously.

Research. So far, most research on the topic has been concerned with trying to become less confused about what, exactly, the problem is, how worried we should be, and which strategic actions we should take. How do we predict technological progress? How can we predict AI outcomes? Which interventions, taken now, would probably increase the odds of positive AI outcomes? There has also been some "technical" research in decision theory (e.g. TDT, UDT, ADT), the math of AI goal systems ("Learning What to Value"," "Ontological Crises in Artificial Agents' Value Systems," "Convergence of Expected Utility for Universal AI"), and Yudkowsky's unpublished research on Friendly AI.

Muehlhauser 2011 provides an overview of the categories of research problems we have left to solve. Most of the known problems aren't even well-defined at this point.

 

References

New Comment
49 comments, sorted by Click to highlight new comments since: Today at 7:51 AM

I laughed out loud as I was scrolling down, and the third major section item was "Eliezer Yudkowsky". And you even included a little biography, "According to "Eliezer, the person," Eliezer Yudkowsky (born 1979) was an exceptionally bright kid"

I'm not sure if that was honestly intended to be a joke, or if I'm completely missing it.

Elevating him to the pantheon of history seems a bit premature as of yet; why not retitle the section with a summary of his contributions to thinking on AI risk? The same goes for Robin Hanson. I'm just picking on Eliezer because people accuse us of worshipping Eliezer, and this is exactly why.

[-][anonymous]12y420

I had a very similar reaction. I think the problem is that the tone of this piece is strongly reminiscent of a high school history textbook: Bostrom, Hanson, and Yudkowsky are talked about as if they are famous historical figures, and the growing awareness of AI risks is discussed like it's a historical movement that happened hundreds of years ago. This tone is even more evident when Luke talks about Bostrom, Hanson, and especially Yudkowsky. Luke cites primary sources about Eliezer (e.g. his autobiography) and talks about his early life as if it's historically noteworthy, and adds commentary on his works (e.g. "CFAI was both frustrating and brilliant") in the same tone of voice that a history teacher would critique George Washington's battle plans. It just comes off as baffling to me, and probably even more so to an outside audience.

I'm not sure if that was honestly intended to be a joke, or if I'm completely missing it.

It'd be a little more obvious if the understatement was more exaggerated, eg. just 'was a bright kid'.

The discussions led to a decision theory workshop and then a decision theory mailing list, which quickly became home to some of the most advanced work in decision theory anywhere in the world.

Please, please, please stop calling the current theory "advanced", much less "the most advanced work in decision theory anywhere in the world". This is in many senses false, and in any case violates communication norms.

Yudkowsky summarized some of these results in "Timeless Decision Theory" (2010)

Yudkowsky's paper presents his own results, not the results produced by the discussion on LW/decision theory list. As I understand, LW had little impact on TDT (as it's presented in the paper), instead it produced the many variants of UDT (that build on some ideas in TDT, but don't act as its direct further development).

Fixed and fixed.

I would give Vernor Vinge a bit more credit. He was a professor of computer science as well as a novelist in 1993. His "A Fire Upon the Deep", published in 1992, featured a super-intelligent AI (called the Blight) that posed an existential threat to a galactic civilization. (I wonder if Eliezer had been introduced to the Singularity through that book instead of "True Names", he would have invented the FAI idea several years earlier.)

To quote Schmidhuber:

It was Vinge, however, who popularized the technological singularity and significantly elaborated on it, exploring pretty much all the obvious related topics, such as accelerating change, computational speed explosion, potential delays of the singularity, obstacles to the singularity, limits of predictability and negotiability of the singularity, evil vs benign super-intelligence, surviving the singularity, etc.

Luke, can you please also add "pro­fes­sor of com­puter sci­ence" to the post to describe Vernor Vinge? Apparently "It’s straight out of science fiction" is a major reason for people to dismiss AI risk, so adding "pro­fes­sor of com­puter sci­ence" to "novelist" would probably help a lot in that regard.

Reading a concrete summary like this reduced my amount of hero worship of Eliezer by between 20% to 40%. (It is still, according to my estimates, far above that of most religious fanatics. They don't tend to worship Eliezer very much. :p )

Why are Eliezer's old essays stored in the Wayback Machine? Wouldn't they be safer on the SIAI's servers?

Some of them are still publicly available, but like gwern says, embarrassing but not-worth-trying-to-hide.

They surely would be, but perhaps he regards them as juvenilia and embarrassing. (I would, for some of them.)

Which?

(I love 'Eliezer, the Person.')

This is a wonderful summary of events-to-date.

Well, there's a big issue here right here: Eliezer didn't write any practical AIs of any kind, or do programming in general, and thus the 'mainstream' AI developer's view of him is that even though he's bright, he's also full of wrong intuitions about the topic. E.g. great many people believe that their argumentation about 'agents' is very correct and reliable even when casually done. Do ten years of software development, please, you'll have ten years of a very simple and rude machine demonstrating you that your reasoning is flawed, daily, you'll hopefully be a lot more calibrated.

Other thing people believe in, is that with more hardware, worse algorithms will work ok and the models like 'utility maximization' will be more appropriate. When you do have multiple choice problems, maybe. When you have immense number of choices - such as searching solution spaces when finding a way to self improve - the more powerful is the hardware the more important is the impact of algorithms and heuristics that bring down the complexity from naive O(2^n) . One doesn't calculate arbitrary utilities then compare in the huge solution space world. One starts from utility then derives points of interest in solution space. Okay, there's the wrong intuition that when theres more powerful hardware, such heuristics could be disposed of. Won't happen. Won't be the case for AI that's made of Dyson Spheres around every star in the galaxy (running classical computation). Big number is not infinity. The infinity-based intuitions get more and more wrong as the AI gets larger, for all foreseeable sizes. The AI crowd who are not programmers making practically useful software, are working from infinity-based intuitions.

edit: also, i would guess there's 'ai making lesion', by the name of curiosity, that would make the future AI maker also learn to program, at very young age too. If you are curious if certain neat FAI idea is correct - you can code a test, where there is something simple, but concrete, that the simple FAI should be stably friendly towards.

Other thing people believe in, is that with more hardware, worse algorithms will work ok and the models like 'utility maximization' will be more appropriate.

Better than utility maximization? I'm inclined to facepalm.

Better than "predict future worlds resulting from actions, calculate utilities of future worlds, choose one with largest" naive model, considering the imperfect information, highly inaccurate prediction, expensive-to-evaluate utility, and the time constraints.

Real agents: predict at varying levels of accuracy depending to available time, predict the sign of difference of utilities instead of numeric utilities, substitute cheaper-to-evaluate utilities in place of original utilities, employ reverse reasoning (start with desired state, sort of simulate backwards), employ general strategies (e.g. trying to produce actions that are more reversible to avoid cornering itself), and a zillion of other approaches.

Evolution normally provides an underlying maximand for organisms: fitness. To maximise their fitnesses, many organisms use their own optimisation tool: a brain - which is essentially a hedonism maximiser. It's true that sometimes the best thing to do is to use a fast-and-cheap utility function to burn through some task. However, that strategy is normally chosen by a more advanced optimiser for efficiency reasons.

I think a reasonable picture is that creatures act so as to approximate utility maximization as best they can. Utility is an organism's proxy for its own fitness, and fitness is what really is being maximised.

"Complex reasoning" and "simulating backwards" are legitimate strategies for a utility maximizer to use - if they help to predict how their environment will behave.

Well, being a hedonism maximizer leads to wireheading.

I think a reasonable picture is that creatures act so as to approximate utility maximization as best they can. Utility is an organism's proxy for its own fitness, and fitness is what really is being maximised.

Nah, whenever a worse/flawed approximation to maximization of hedonistic utility results in better fitness, the organisms do that too.

"Complex reasoning" and "simulating backwards" are legitimate strategies for a utility maximizer to use - if they help to predict how their environment will behave.

They don't help predict, they help to pick actions out of the space of some >10^1000 actions that the agent really has to choose from. That results in apparent preference for some actions, but not others, preference having nothing to do with any utilities and everything to do with action generation. Choosing in the huge action space is hard.

The problem really isn't so much with utility maximization as model - one could describe anything as utility maximization, in the extreme, defining the utility so that when action matches that chosen by agent, the utility is 1, and 0 otherwise. The problem is when inexperienced, uneducated people start taking the utility maximization too literally and imagining as an ideal some very specific architecture - the one that's forecasting and comparing utilities that it forecasts - and start modelling the AI behaviour with this idea.

Note, by the way, that such system is not maximizing an utility, but system's prediction of the utility. Map vs territory confusion here. Maximizing your predicted utility gets you into checkmate when the opponent knows how your necessarily inexact prediction works.

Well, being a hedonism maximizer leads to wireheading.

Well, that gets complicated. Of course, we can see that there are non-wireheading hedonism maximizers - so there are ways around this problem.

I think a reasonable picture is that creatures act so as to approximate utility maximization as best they can. Utility is an organism's proxy for its own fitness, and fitness is what really is being maximised.

Nah, whenever a worse/flawed approximation to maximization of hedonistic utility results in better fitness, the organisms do that too.

Well again this is complicated territory. The evolutionary purpose of the brain is to maximise inclusive fitness - and to do that it models its expectations about its fitness and maximises those. If the environment changes in such a way that the brain's model is inaccurate then evolution could hack in non-brain-based solutions - but ultimately it is going to try and fix the problem by getting the organism's brain proxy for fitness and evolutionary fitness back into alignment with each other. Usually these two are fairly well aligned - though humans are in a rapidly-changing environment, and so are a bit of an exception to this.

Basically, having organisms fight against their own brains is wasteful - and nature tries to avoid it - by making organisms as harmonious as it can manage.

"Complex reasoning" and "simulating backwards" are legitimate strategies for a utility maximizer to use - if they help to predict how their environment will behave.

They don't help predict, they help to pick actions out of the space of some >10^1000 actions that the agent really has to choose from.

So, that is the point of predicting! It's a perfectly conventional way of traversing a search tree to run things backwards and undo - rather than attempt to calculate each prediction from scratch somehow or another. Something like that is not a deviation from a utility maximisation algorithm.

Note, by the way, that such system is not maximizing an utility, but system's prediction of the utility. Map vs territory confusion here. Maximizing your predicted utility gets you into checkmate when the opponent knows how your necessarily inexact prediction works.

Not necessarily - if you expect to face a vastly more powerful agent, you can sometimes fall-back on non-deterministic algorithms - and avoid being outwitted in this particular way.

Anyway, you have to maximise your expectations of utility (rather than your utility). That isn't a map vs territory confusion, it's just the way agents have to work.

Well, that gets complicated. Of course, we can see that there are non-wireheading hedonism maximizers - so there are ways around this problem.

Not sure how non-wireheaded though. One doesn't literally need a wire to the head to have a shortcut. Watching movies or reading fiction is pretty wireheaded effort.

Well again this is complicated territory. The evolutionary purpose of the brain is to maximise inclusive fitness - and to do that it models its expectations about its fitness and maximises those.

I'm not quite sure its how it works. The pain and the pleasure seem to be the reinforcement values for neural network training, rather than actual utilities of any kind. Suppose you are training a dog not to chew stuff, by reinforcements. The reinforcement value is not proportional to utility of behaviour, but is set as to optimize the training process.

So, that is the point of predicting! It's a perfectly conventional way of traversing a search tree to run things backwards and undo - rather than attempt to calculate each prediction from scratch somehow or another. Something like that is not a deviation from a utility maximisation algorithm.

See, if there's two actions, one result in utility 1000 and other result in utility 100, this method can choose the one that results in utility 100 because it is reachable by imperfect backwards tracing while the 1000 one isn't (and is lost in giant space that one can't search). At that point, you could of course declare that being backwards traceable to is a very desirable property of an action, and goalpost shift so that this action has utility 2000, but its clear that this is a screwy approach.

Anyway, you have to maximise your expectations of utility (rather than your utility). That isn't a map vs territory confusion, it's just the way agents have to work.

And how do you improve the models you use for expectations?

Of course you can describe literally anything as 'utility maximization', the issue is that agent which is maximizing something, doesn't really even need to know what it is maximizing, doesn't necessarily do calculation of utilities, et cetera. You don't really have model here, you just have a description, and if you are to model it as utility maximizer, you'll be committing fallacy as with that blue minimizing robot

The reinforcement value is not proportional to utility of behaviour, but is set as to optimize the training process.

Maybe - if the person rewarding the dog is doing it wrong. Normally, you would want those things to keep reasonably in step.

[...] lost in giant space that one can't search [...]

So, in the "forecasting/evaluation/tree pruning" framework, that sounds as though it is a consequence of tree pruning.

Pruning is inevitablle in resource-limited agents. I wouldn't say it stopped them from being expected utility maximisers, though.

how do you improve the models you use for expectations?

a) get more data; b) figure out how to compress it better;

You don't really have model here, you just have a description, and if you are to model it as utility maximizer, you'll be committing fallacy as with that blue minimizing robot

Alas, I don't like that post very much. It is an attack on the concept of utility-maximization, which hardly seems productive to me. Anyway, I think I see your point here - though I am less clear about how it relates to the previous conversation (about expectations of utility vs utility - or more generally about utility maximisation being somehow "sub-optimal").

Maybe - if the person rewarding the dog is doing it wrong. Normally, you would want those things to keep reasonably in step.

If the dog chews something really expensive up, there is no point punishing the dog proportionally more for that. That would be wrong; some level of punishment is optimal for training; beyond this is just letting anger out.

Pruning is inevitablle in resource-limited agents. I wouldn't say it stopped them from being expected utility maximisers, though.

It's not mere pruning. You need a person to be able to feed your pets, you need them to get through the door, you need a key, you can get a key at key duplicating place, you go to key duplicating place you know of to make a duplicate.

That stops them from being usefully modelled as 'choose action that gives maximum utility'. You can't assume that it makes action that results in maximum utility. You can say that it makes action which results in as much utility as this agent with its limitations could get out of that situation, but that's almost tautological at this point. Also, see

http://en.wikipedia.org/wiki/Intelligent_agent

for terminology.

Anyway, I think I see your point here - though I am less clear about how it relates to the previous conversation (about expectations of utility vs utility - or more generally about utility maximisation being somehow "sub-optimal").

Well, the utility agent as per wiki article, is clearly stupid because it won't reason backwards. And the utility maximizers discussed by purely theoretical AI researchers, likewise.

a) get more data; b) figure out how to compress it better;

Needs something better than trying all models and seeing what fits, though. One should ideally be able to use the normal reasoning to improve models. It feels that a better model has bigger utility.

Maybe - if the person rewarding the dog is doing it wrong. Normally, you would want those things to keep reasonably in step.

If the dog chews something really expensive up, there is no point punishing the dog proportionally more for that. That would be wrong; some level of punishment is optimal for training; beyond this is just letting anger out.

You would probably want to let the dog know that some of your chewable things are really expensive. You might also want to tell it about the variance in the value of your chewable items. I'm sure there are some cases where the owner might want to manipulate the dog by giving it misleading reward signals - but honest signals are often best.

Well, the utility agent as per wiki article, is clearly stupid because it won't reason backwards. And the utility maximizers discussed by purely theoretical AI researchers, likewise.

These are the researchers who presume no computational resource limitation? They have no need to use optimisation heuristics - such as the ones you are proposing - they assume unlimited computing resources.

a) get more data; b) figure out how to compress it better;

Needs something better than trying all models and seeing what fits, though. One should ideally be able to use the normal reasoning to improve models. [...]

Sure. Humans use "normal reasoning" to improve their world models.

You would probably want to let the dog know that some of your chewable things are really expensive. You might also want to tell it about the variance in the value of your chewable items. I'm sure there are some cases where the owner might want to manipulate the dog by giving it misleading reward signals - but honest signals are often best.

Well, i don't think that quite works, dogs aren't terribly clever. Back to humans, e.g. significant injuries hurt a lot less than you'd think they would, my guess is that small self inflicted ones hurt so much for effective conditioning.

These are the researchers who presume no computational resource limitation? They have no need to use optimisation heuristics - such as the ones you are proposing - they assume unlimited computing resources.

The ugly is when some go on and talk about the AGIs certainly killing everyone unless designed in some way that isn't going to work. And otherwise paint wrong pictures of AGI.

Sure. Humans use "normal reasoning" to improve their world models.

Ya. Sometimes even resulting in breakage, when they modify world models to fit with some pre-existing guess.

significant injuries hurt a lot less than you'd think they would, my guess is that small self inflicted ones hurt so much for effective conditioning.

The idea and its explanation both seem pretty speculative to me.

Pavlovian conditioning is settled science; the pain being negative utility value for intelligence etc, not so much.

The "idea" was:

significant injuries hurt a lot less than you'd think they would

...and its explanation was:

my guess is that small self inflicted ones hurt so much for effective conditioning.

I'm inclined towards scepticism - significant injuries often hurt a considerable amount - and small ones do not hurt by disproportionally large amounts - at least as far as I know.

There do seem to be some ceiling-llike effects - to try and prevent people passing out and generally going wrong. I don't think that is to do with your hypothesis.

The very fact that you can pass out from pain and otherwise the pain interfering with thought and actions, implies that the pain doesn't work remotely like utility should. Of course one does factor in pain into the utility, but that is potentially dangerous for survival (as you may e.g. have to cut your hand off when its stuck under boulder and you already determined that cutting the hand off is the best means of survival). You can expect interference along the lines of passing out from the network training process. You can't expect interference from utility values being calculated.

edit:

Okay for the middle ground: would you agree that pain has Pavlovian conditioning role? The brain also assigns it negative utility, but the pain itself isn't utility, it evolved long before brains could think very well. And in principle you'd be better off assigning utility to lasting damage rather than to pain (and most people do at least try).

edit: that is to say, removing your own appendix got to be easy (for surgeons) if pain was just utility, properly summed with other utilities, making you overall happy that you got the knife for the procedure and can save yourself, through the entire process. It'd be like giving up an item worth $10 for $10 000 000 . There the values are properly summed first, not making you feel the loss and feel the gain separately.

The very fact that you can pass out from pain and otherwise the pain interfering with thought and actions, implies that the pain doesn't work remotely like utility should.

You don't think consciousness should be sacrificed - no matter what the degree of damage - in an intelligently designed machine? Nature sacrifices consciousness under a variety of circumstances. Can you defend your intuition about this issue? Why is nature wrong to permit fainting and passing out from excessive pain?

Of course pain should really hurt. It is supposed to distract you and encourage you to deal with it. Creatures in which pain didn't really, really hurt are likely to have have left fewer descendants.

Well, in so much as the intelligence is not distracted and can opt to sit still play dead, there doesn't seem to be a point in fainting. Any time i have somewhat notable injury (falling off bike, ripping the chin, and getting nasty case of road rash), the pain is less than pain of minor injuries.

Contrast the anticipation of pain with actual pain. Those feel very different. Maybe it is fair to say that the pain is instrumental in creating anticipation of pain, which acts more like utility for intelligent agent. It also serves as a warning signal, and for conditioning, and generally as something that stops you from eating yourself. (and perhaps for telling the intelligence what is and isn't your body). The pain is supposed to encourage you to deal with the damage, but not to distract you from dealing with the damage.

Well, in so much as the intelligence is not distracted and can opt to sit still play dead, there doesn't seem to be a point in fainting.

I don't pretend to know exactly why nature does it - but I expect there's a reason. It mat be that sometimes being conscious is actively bad. This is one of the reasons for administering anaesthetics - there are cases where a conscious individual in a lot of pain willl ineffectually flail around and get themselves into worse trouble - where they would be better off being quiet and still - "playing dead".

As to why not "play dead" while remaining conscious - that's a bit like having two "off" switches. There's already an off switch. Building a second one that bypasses all the usual responses of the conscious mind while remaining conscious could be expensive. Perhaps not ideal for a rarely-used feature.

A lot of the time something is just a side effect. E.g. you select less aggressive foxes, you end up with foxes with floppy ears and white spots on fur.

With regards to flailing around that strikes me as more of reflex than utility driven behaviour. For playing dead, I mean, I can sit still when having teeth done without anaesthesia.

The problem with just fainting is that it is reflex - when conditions, do faint, when other conditions, flail around - not a proper utility maximizing agent behaviour - what are the consequences to flailing around, what are the consequences of sitting still, choose the one that has better consequences.

It seems to me that originally the pain was to train the neural network not to eat yourself, then it got re-used for other stuff, that it is not very suitable for.

The problem with just fainting is that it is reflex - when conditions, do faint, when other conditions, flail around - not a proper utility maximizing agent behaviour - what are the consequences to flailing around, what are the consequences of sitting still, choose the one that has better consequences.

Well, it's a consequence of resource limitation. A supercomputer with moment-by-moment control over actions might never faint. However, when there's a limited behavioural repertoire, with less precise control over what action to take - and a limited space in which to associate sensory stimulii and actions, occasionally fainting could become a more reasonable course of action.

It seems to me that originally the pain was to train the neural network not to eat yourself, then it got re-used for other stuff, that it is not very suitable for.

The pleasure-pain axis is basically much the same idea as a utility value - or perhaps the first derivative of a utility. The signal might be modulated by other systems a little, but that's the essential nature of it.

Then why the anticipated pain feels so different from actual ongoing pain?

Also, I think it's more of a consequence of resource limitations of a worm or a fish. We don't have such severe limitations.

Other issue: consider 10 hours of harmless but intense pain vs perfectly painless lobotomy. I think most of us would try harder to avoid the latter than the former, and would prefer the pain. edit: furthermore, we could consciously and wilfully take a painkiller, but not lobotomy-fear-neutralizer.

Then why the anticipated pain feels so different from actual ongoing pain?

I'm not sure I understand why they should be similar. Anticipated pain may never happen. Combining anticipated pain with actual pain probably doesn't happen more because that would "muddy" the reward signal. You want a "clear" reward signal to facilitate attributing the reward to the actions that led to it. Too much "smearing" of reward signals out over time doesn't help with that.

I think it's more of a consequence of resource limitations of a worm or a fish. We don't have such severe limitations.

Maybe - though that's probably not an easy hypothesis to test.

Well, the utility as in 'utility function' of an utility maximizing agent, is something that's calculated in predicted future state. The pain is only calculated in the now. That's a subtle distinction.

I think this lobotomy example (provided that subject knows what lobotomy is and what brain is and thus doesn't want lobotomy) clarifies why I don't think pain is working quite like an utility function's output. The fear does work like proper utility function's output. When you fear something you also don't want to get rid of that fear (with some exceptions in the people who basically don't fear correctly). And fear is all about future state.

Well, the utility as in 'utility function' of an utility maximizing agent, is something that's calculated in predicted future state. The pain is only calculated in the now. That's a subtle distinction.

It's better to think of future utility being an extrapolation of current utility - and current utility being basically the same thing as the position of the pleasure-pain axis. Otherwise there is a danger of pointlessly duplicating concepts.

It is the cause of lots of problems to distinguish too much between utility and pleasure. The pleasure-pain axis is nature's attempt to engineer a utility-based system. It did a pretty good job.

Of course, you should not take "pain" too literally - in the case of humans. Humans have modulations on pain that feed into their decision circuitry - but the result is still eventually collapsed down into one dimension - like a utility value.

It's better to think of future utility being an extrapolation of current utility - and current utility being basically the same thing as the position of the pleasure-pain axis. Otherwise there is a danger of pointlessly duplicating concepts.

The danger here is in inventing terminology that is at odds with normally used terminology, resulting in confusion when reading texts written in standard terminology. I would rather describe human behaviour as 'learning agent' as per Russell & Norvig 2003 . where the pain is part of 'critic'. You can see a diagram on wikipedia:

http://en.wikipedia.org/wiki/Intelligent_agent

Ultimately, the overly broad definitions become useless.

We also have a bit of 'reflex agent' where pain makes you flinch away or flail around or faint (though i'd dare a guess that most people don't faint even when pain has saturated and can't increase any further).

Is there a more recent writeup on the history of AI safety anywhere?

"Eliezer, the Person" is one of my favorite things EY's written. I'd love to see an update to this. I found the description of EY's akrasia and "cleaning up the mental landscape" really helpful. I think you or someone posted that EY's energy problems had improved but were still there somewhat, IIRC?

About the only science fiction author you mention is Vernor Vinge. I think many of the earlier authors (such as Saberhagen) touched upon the threat to humanity from AIs that are smarter than we are. Possibly a good place to ask for more details is one of the science fiction discussion sites listed in the SFRG.

You can find fairly extensive coverage of the sci-fi that covers these issues in Broderick's article for JCS.

A passage from McCulloch (1952) that nearly suggests Good's intelligence explosion of 1959:

Von Neumann has already made... a most fascinating proposal... It is natural for us to suppose that if a machine of a given complexity makes another machine, that second machine cannot require any greater specification than was required for the first machine, and will in general be simpler. All our experience with simple machines has been of that kind. But when the complexity of a machine is sufficiently great, this limitation disappears. A generalized Turing machine, coupled with an assembling machine and a duplicator of its tape, could pick up parts from its environment, assemble a machine like itself and its assembling machine and its duplicator of program, put the program into it, and cut loose a new machine like itself... [And] as it is inherently capable of learning, it could make other machines better adapted to its environment or changing as the environment changed.

A passage from Hartree (1949) that could have been quoted in the opening of Complex Value Systems are Required to Realized Valuable Futures:

[Hartree describes a calculation a machine might fail to do if programmed naively, and then says:] A human [operator], faced with this unforseen situation, would have exercised intelligence, almost automatically and unconsciously, and made the small extrapolation of the operating instructions required to deal with it. The machine without operating instructions for dealing with negative values of z, could not make this extrapolation...

The moral of this experience is that in programming a problem for the machine, it is necessary to try to take a "machine's-eye view" of the operating instructions, that is to look at them from the point of view of the machine which can only follow them literally, without introducing anything not explicitly by them, and try to forsee all the unexpected things that might occur in the course of the calculation, and to provide the machine with the means of identifying each one and with appropriate operating instructions in each case. And this is not so easy as it sounds; it is quite difficult to put oneself in the position of doing without any of the things which intelligence and experience would suggest to a human [operator] in such situations.

Not quite 'Nanny AI', but a similar idea dates back to D.F. Parkhill (1966, p. 61-62), who imagined a "computer utility" that could sustain a global peace under the United Nations:

The new generation of on-line, multi-user military [computer] systems is rapidly coming to resemble the oft-heralded "total management systems" — an integrated, multi-user, multi-input, display, processing, communication, storage, and decision complex, incorporating in one man/machine package the total operations of an organization...

Such a total military management system is in fact now taking shape in the United States [and] will tied together into one gigantic integrated command and control complex all of the far-flung command-and-control systems of the Army, Navy, and Air Force as well as the information systems of other government agencies such as the Central Intelligence Agency, Stated Department, and Civil Defense organization... Conceivably, the Soviet Union also has her own version of [this] under development; thus it is not inconceivable that in some future more rational world the American and Soviet systems might be connected together into a global peace-control system under the United Nations. [Such a system has] obvious uses in policing an arms-control agreement and controlling the operations of international peace-keeping forces...

In the final chapters, Parkhill seems to have successfully predicted online, automatic banking, the universal encyclopedia, online shopping, and growing technological unexmployment.