So You Want to Save the World

by lukeprog12 min read1st Jan 2012149 comments


Machine Intelligence Research Institute (MIRI)AI

This post is very out-of-date. See MIRI's research page for the current research agenda.

So you want to save the world. As it turns out, the world cannot be saved by caped crusaders with great strength and the power of flight. No, the world must be saved by mathematicians, computer scientists, and philosophers.

This is because the creation of machine superintelligence this century will determine the future of our planet, and in order for this "technological Singularity" to go well for us, we need to solve a particular set of technical problems in mathematics, computer science, and philosophy before the Singularity happens.

The best way for most people to save the world is to donate to an organization working to solve these problems, an organization like the Singularity Institute or the Future of Humanity Institute.

Don't underestimate the importance of donation. You can do more good as a philanthropic banker than as a charity worker or researcher.

But if you are a capable researcher, then you may also be able to contribute by working directly on one or more of the open problems humanity needs to solve. If so, read on...


At this point, I'll need to assume some familiarity with the subject matter. If you haven't already, take a few hours to read these five articles, and then come back:

  1. Yudkowsky (2008a)
  2. Sandberg & Bostrom (2008)
  3. Chalmers (2010)
  4. Omohundro (2011)
  5. Armstrong et al. (2011)

Or at the very least, read my shorter and more accessible summary of the main points in my online book-in-progress, Facing the Singularity.

Daniel Dewey's highly compressed summary of several key points is:

Hardware and software are improving, there are no signs that we will stop this, and human biology and biases indicate that we are far below the upper limit on intelligence. Economic arguments indicate that most AIs would act to become more intelligent. Therefore, intelligence explosion is very likely. The apparent diversity and irreducibility of information about "what is good" suggests that value is complex and fragile; therefore, an AI is unlikely to have any significant overlap with human values if that is not engineered in at significant cost. Therefore, a bad AI explosion is our default future.

The VNM utility theorem suggests that there is some formally stated goal that we most prefer. The CEV thought experiment suggests that we could program a metaethics that would generate a good goal. The Gandhi's pill argument indicates that goal-preserving self-improvement is possible, and the reliability of formal proof suggests that long chains of self-improvement are possible. Therefore, a good AI explosion is likely possible.

Next, I need to make a few important points:

1. Defining each problem is part of the problem. As Bellman (1961) said, "the very construction of a precise mathematical statement of a verbal problem is itself a problem of major difficulty." Many of the problems related to navigating the Singularity have not yet been stated with mathematical precision, and the need for a precise statement of the problem is part of the problem. But there is reason for optimism. Many times, particular heroes have managed to formalize a previously fuzzy and mysterious concept: see Kolmogorov on complexity and simplicity (Kolmogorov 1965; Grunwald & Vitanyi 2003; Li & Vitányi 2008), Solomonoff on induction (Solomonoff 1964a, 1964b; Rathmanner & Hutter 2011), Von Neumann and Morgenstern on rationality (Von Neumann & Morgenstern 1947; Anand 1995), and Shannon on information (Shannon 1948; Arndt 2004).

2. The nature of the problem space is unclear. Which problems will biological humans need to solve, and which problems can a successful Friendly AI (FAI) solve on its own (perhaps with the help of human uploads it creates to solve the remaining open problems)? Are Friendly AI (Yudkowsky 2001) and CEV (Yudkowsky 2004) coherent ideas, given the confused nature of human "values"? Should we aim instead for something like Oracle AI (Armstrong et al. 2011)? Which problems are we unable to state with precision because they are irreparably confused, and which problems are we unable to state due to a lack of insight?

3. Our intervention priorities are unclear. There are a limited number of capable researchers who will work on these problems. Which are the most important problems they should be working on, if they are capable of doing so? Should we focus on "control problem" theory (FAI, AI-boxing, oracle AI, etc.), or on strategic considerations (differential technological development, methods for raising the sanity waterline, methods for bringing more funding to existential risk reduction and growing the community of x-risk reducers, reducing the odds of AI arms races, etc.)? Is AI more urgent than other existential risks, especially synthetic biology? Is research the most urgent thing to be done, or should we focus on growing the community of x-risk reducers, raising the sanity waterline, bringing in more funding for x-risk reduction, etc.? Can we make better research progress in the next 10 years if we work to improve sanity and funding for 7 years and then have the resources to grab more and better researchers, or can we make better research progress by focusing on research now?

Problem Categories

There are many ways to categorize our open problems; I'll divide them into three groups:

Safe AI Architectures. This may include architectures for securely confined or "boxed" AIs (Lampson 1973), including Oracle AIs, and also AI architectures capable of using a safe set of goals (resulting in Friendly AI).

Safe AI Goals. What could it mean to have a Friendly AI with "good" goals?

Strategy. How do we predict the future and make recommendations for differential technological development? Do we aim for Friendly AI or Oracle AI or both? Should we focus on growing support now, or do we focus on research? How should we interact with the public and with governments?

The list of open problems on this page is very preliminary. I'm sure there are many problems I've forgotten, and many problems I'm unaware of. Probably all of the problems are stated poorly: this is only a "first step" document. Certainly, all listed problems are described at a very "high" level, far away (so far) from mathematical precision, and can themselves be broken down into several and often dozens of subproblems.

Safe AI Architectures

Is "rationally-shaped" transparent AI the only potentially safe AI architecture? Omohundro (2007, 2008, 2011) describes "rationally shaped" AI as AI that is as economically rational as possible given its limitations. A rationally shaped AI has beliefs and desires, its desires are defined by a utility function, and it seeks to maximize its expected utility. If an AI doesn't use a utility function, then it's hard to predict its actions, including whether they will be "friendly." The same problem can arise if the decision mechanism or the utility function is not transparent to humans. At least, this seems to be the case, but perhaps there are strong attractors that would allow us to predict friendliness even without the AI having a transparent utility function, or even a utility function at all? Or, perhaps a new decision theory could show the way to a different AI architecture that would allow us to predict the AI's behavior without it having a transparent utility function?

How can we develop a reflective decision theory? When an agent considers radical modification of its own decision mechanism, how can it ensure that doing so will keep constant or increase its expected utility? Yudkowsky (2011a) argues that current decision theories stumble over Löb's Theorem at this point, and that a new, "reflectively consistent" decision theory is needed.

How can we develop a timeless decision theory with the bugs worked out? Paradoxes like Newcomb's Problem (Ledwig 2000) and Solomon's Problem (Gibbard & Harper 1978) seem to show that neither causal decision theory nor evidential decision theory is ideal. Yudkowsky (2010) proposes an apparently superior alternative, timeless decision theory. But it, too, has bugs that need to be worked out, for example the "5-and-10 problem" (described here by Gary Drescher, who doesn't use the 5-and-10 example illustration).

How can we modify a transparent AI architecture to have a utility function over the external world? Reinforcement learning can only be used to define agents whose goal is to maximize expected rewards. But this doesn't match human goals, so advanced reinforcement learning agents will diverge from our wishes. Thus, we need a class of agents called "value learners" (Dewey 2011) that "can be designed to learn and maximize any initially unknown utility function" (see Hibbard 2011 for clarifications). Dewey's paper, however, is only the first step in this direction.

How can an agent keep a stable utility function through ontological shifts? An agent's utility function may refer to states of, or entities within, its ontology. As De Blanc (2011) notes, "If the agent may upgrade or replace its ontology, it faces a crisis: the agent's original [utility function] may not be well-defined with respect to its new ontology." De Blanc points toward some possible solutions for these problems, but they need to be developed further.

How can an agent choose an ideal prior? We want a Friendly AI's model of the world to be as accurate as possible so that it successfully does friendly things if we can figure out how to give it friendly goals. Solomonoff induction (Li & Vitanyi 2008) may be our best formalization of induction yet, but it could be improved upon.

First, we may need to solve the problem of observation selection effects or "anthropic bias" (Bostrom 2002b): even an agent using a powerful approximation of Solomonoff induction may, due to anthropic bias, make radically incorrect inferences when it does not encounter sufficient evidence to update far enough away from its priors. Several solutions have been proposed (Neal 2006; Grace 2010;, Armstrong 2011), but none are as yet widely persuasive.

Second, we need improvements to Solomonoff induction. Hutter (2009) discusses many of these problems. We may also need a version of Solmonoff induction in second-order logic because second-order logic with binary predicates can simulate higher-order logics with nth-order predicates. This kind of Solomonoff induction would be able to imagine even, for example, hypercomputers and time machines.

Third, we would need computable approximations for this improvement to Solomonoff induction.

What is the ideal theory of how to handle logical uncertainty? Even an AI will be uncertain about the true value of certain logical propositions or long chains of logical reasoning. What is the best way to handle this problem? Partial solutions are offered by Gaifman (2004), Williamson (2001), and Haenni (2005), among others.

What is the ideal computable approximation of perfect Bayesianism? As explained elsewhere, we want a Friendly AI's model of the world to be as accurate as possible. Thus, we need ideal computable theories of priors and of logical uncertainty, but we also need computable approximations of Bayesian inference. Cooper (1990) showed that inference in unconstrained Bayesian networks is NP-hard, and Dagum & Luby (1993) showed that the corresponding approximation problem is also NP-hard. The most common solution is to use randomized sampling methods, also known as "Monte Carlo" algorithms (Robert & Casella 2010). Another approach is variational approximation (Wainwright & Jordan 2008), which works with a simpler but similar version of the original problem. Another approach is called "belief propagation" — for example, loopy belief propagation (Weiss 2000).

Can we develop a safely confined AI? Can we develop Oracle AI? One approach to constraining a powerful AI is to give it "good" goals. Another is to externally constrain it, creating a "boxed" AI and thereby "leakproofing the singularity" (Chalmers 2010). A fully leakproof singularity is impossible or pointless: "For an AI system to be useful... to us at all, it must have some effects on us. At a minimum, we must be able to observe it." Still, there may be a way to constrain a superhuman AI such that it is useful but not dangerous. Armstrong et al. (2011) offer a detailed proposal for constraining an AI, but there remain many worries about how safe and sustainable such a solution is. The question remains: Can a superhuman AI be safely confined, and can humans managed to safely confine all superhuman AIs that are created?

What convergent AI architectures and convergent instrumental goals can we expect from superintelligent machines? Omohundro (2008, 2011) argues that we can expect that "as computational resources increase, there is a natural progress through stimulus-response systems, learning systems, reasoning systems, self-improving systems, to fully rational systems," and that for rational systems there are several convergent instrumental goals: self-protection, resource acquisition, replication, goal preservation, efficiency, and self-improvement. Are these claims true? Are there additional convergent AI architectures or instrumental goals that we can use to predict the implications of machine superintelligence?

Safe AI Goals

Can "safe" AI goals only be derived from contingent "desires" and "goals"? Might a single procedure for responding to goals be uniquely determined by reason? A natural approach to selecting goals for a Friendly AI is to ground them in an extrapolation of current human goals, for this approach works even if we assume the naturalist's standard Humean division between motives and reason. But might a sophisticated Kantian approach work, such that some combination of decision theory, game theory, and algorithmic information theory provides a uniquely dictated response to goals? Drescher (2006) attempts something like this, though his particular approach seems to fail.

How do we construe a utility function from what humans "want"? A natural approach to Friendly AI is to program a powerful AI with a utility function that accurately represents an extrapolation of what humans want. Unfortunately, humans do not seem to have coherent utility functions, as demonstrated by the neurobiological mechanisms of choice (Dayan 2011) and behavioral violations of the axioms of utility theory (Kahneman & Tversky 1979). Economists and computer scientists have tried to extract utility theories from human behavior with choice modelling (Hess & Daly 2010) and preference elicitation (Domshlak et al. 2011), but these attempts have focused on extracting utility functions over a narrow range of human preferences, for example those relevant to developing a particular decision support system. We need new more powerful and universal methods for preference extraction. Or, perhaps we must allow actual humans to reason about their own preferences for a very long time until they reach a kind of "reflective equilibrium" in their preferences (Yudkowsky 2004). The best path may be to upload a certain set of humans, which would allow them to reason through their preferences with greater speed and introspective access. Unfortunately, the development of human uploads may spin off dangerous neuromorphic AI before this can be done.

How should human values be extrapolated? Value extrapolation is an old subject in philosophy (Muehlhauser & Helm 2011), but the major results of the field so far have been to show that certain approaches won't work (Sobel 1994); we still have no value extrapolation algorithms that might plausibly work.

Why extrapolate the values of humans alone? What counts as a human? Do values converge if extrapolated? Would the choice to extrapolate the values of humans alone be an unjustified act of speciesism, or is it justified because humans are special in some way — perhaps because humans are the only beings who can reason about their own preferences? And what counts as a human? The problem is more complicated than one might imagine (Bostrom 2006; Bostrom & Sandberg 2011). Moreover, do we need to scan the values of all humans, or only some? These problems are less important if values converge upon extrapolation for a wide variety of agents, but it is far from clear that this is the case (Sobel 1999, Doring & Steinhoff 2009).

How do aggregate or assess value in an infinite universe? What can we make of other possible laws of physics? Our best model of the physical universe predicts that the universe is spatially infinite, meaning that all possible "bubble universes" are realized an infinite number of times. Given this, how do we make value calculations? The problem is discussed by Knobe (2006) and Bostrom (2009), but more work remains to be done. These difficulties may be exacerbated if the universe is infinite in a stronger sense, for example if all possible mathematical objects exist (Tegmark 2005).

How should we deal with normative uncertainty? We may not solve the problems of value or morality in time to build Friendly AI. Perhaps instead we need a theory of how to handle this normative uncertainty. Sepielli (2009) and Bostrom (2009) have made the initial steps, here.

Is it possible to program an AI to do what is "morally right" rather than give it an extrapolation of human goals? Perhaps the only way to solve the Friendly AI problem is to get an AI to do moral philosophy and come to the correct answer. But perhaps this exercise would only result in the conclusion that our moral concepts are incoherent (Beavers 2011).


What methods can we use to predict technological development? Predicting progress in powerful technologies (AI, synthetic biology, nanotechnology) can help us decide which existential threats are most urgent, and can inform our efforts in differential technological development (Bostrom 2002a). The stability of Moore's law may give us limited predictive hope (Lundstrom 2003; Mack 2011), but in general we have no proven method for long-term technological forecasting, including expert elicitation (Armstrong 1985; Woudenberg 1991; Rowe & Wright 2001) and prediction markets (Williams 2011). Nagy's performance curves database (Nagy 2010) may aid our forecasting efforts, as may "big data" in general (Weinberger 2011).

Which kinds of differential technological development should we encourage, and how? Bostrom (2002) proposes a course of differential technological development: "trying to retard the implementation of dangerous technologies and accelerate implementation of beneficial technologies, especially those that ameliorate the hazards posed by other technologies." Many examples are obvious: we should retard the development of technologies that pose an existential risk, and accelerate the development of technologies that help protect us from existential risk, such as vaccines and protective structures. Some potential applications are less obvious. Should we accelerate the development of whole brain emulation technology so that uploaded humans can solve the problems of Friendly AI, or will the development of WBEs spin off dangerous neuromorphic AI first? (Shulman & Salamon 2011)

Which open problems are safe to discuss, and which are potentially highly dangerous. There was a recent debate on whether a certain scientist should publish his discovery of a virus that "could kill half of humanity." (The answer in this case was "no.") The question of whether to publish results is particularly thorny when it comes to AI research, because most of the work in the "Safe AI Architectures" section above would, if completed, bring us closer to developing both uFAI and FAI, but in particular it would make it easier to develop uFAI. Unfortunately, it looks like that work must be done to develop any kind of FAI, while if it is not done then only uFAI can be developed (Dewey 2011).

What can we do to reduce the risk of an AI arms race? AGI is, in one sense, a powerful weapon for dominating the globe. Once it is seen by governments as a feasible technology goal, we may predict an arms race for AGI. Shulman (2009) gives several reasons to recommend "cooperative control of the development of software entities" over other methods for arms race risk mitigation, but these scenarios require more extensive analysis.

What can we do to raise the "sanity waterline," and how much will this help? The Singularity Institute is a strong advocate of rationality training, in part so that both AI safety researchers and supporters of x-risk reduction can avoid the usual thinking failures that occur when thinking about those issues (Yudkowsky 2008b). This raises the question of how well rationality can be taught, and how much difference it will make for existential risk reduction.

What can we do to attract more funding, support, and research to x-risk reduction and to specific sub-problems of successful Singularity navigation? Much is known about how to raise funding (Oppenheimer & Olivola 2010) and awareness (Kotler & Armstrong 2009), but applying these principles is always a challenge, and x-risk reduction may pose unique problems for these tasks.

Which interventions should we prioritize? There are limited resources available for existential risk reduction work, and for AI safety research in particular. How should these resources be allocated? Should the focus be on direct research, or on making it easier for a wider pool of researchers to contribute, or on fundraising and awareness-raising, or on other types of interventions?

How should x-risk reducers and AI safety researchers interact with governments and corporations? Governments and corporations are potential sources of funding for x-risk reduction work, but they may also endanger the x-risk reduction community. AI development labs will be unfriendly to certain kinds of differential technological development advocated by the AI safety community, and governments may face pressures to nationalize advanced AI research groups (including AI safety researchers) once AGI draws nearer.

How can optimal philanthropists get the most x-risk reduction for their philanthropic buck? Optimal philanthropists aim not just to make a difference, but to make the most possible positive difference. Bostrom (2011) makes a good case for existential risk reduction as optimal philanthropy, but more detailed questions remain. Which x-risk reduction interventions and organizations should be funded? Should new organizations be formed, or should resources be pooled in one or more of the existing organizations working on x-risk reduction?

How does AI risk compare to other existential risks? Yudkowsky (2008a) notes that AI poses a special kind of existential risk, for it can surely destroy the human species but, if done right, it also has the unique capacity to save our species from all other existential risks. But will AI come before other existential risks, especially the risks of synthetic biology? How should efforts be allocated between safe AI and the mitigation of other existential risks? Is Oracle AI enough to mitigate other existential risks?

Which problems do we need to solve, and which ones can we have an AI solve? Can we get an AI to do Friendly AI philosophy before it takes over the world? Which problems must be solved by humans, and which ones can we hand off to the AI?

How can we develop microeconomic models of WBEs and self-improving systems? Hanson (1994, 1998, 2008a, 2008b, 2008c, forthcoming) provides some preliminary steps. Might such models help us predict takeoff speed and the likelihood of monopolar (singleton) vs. multipolar outcomes?.