This post is very out-of-date. See MIRI's research page for the current research agenda.

So you want to save the world. As it turns out, the world cannot be saved by caped crusaders with great strength and the power of flight. No, the world must be saved by mathematicians, computer scientists, and philosophers.

This is because the creation of machine superintelligence this century will determine the future of our planet, and in order for this "technological Singularity" to go well for us, we need to solve a particular set of technical problems in mathematics, computer science, and philosophy before the Singularity happens.

The best way for most people to save the world is to donate to an organization working to solve these problems, an organization like the Singularity Institute or the Future of Humanity Institute.

Don't underestimate the importance of donation. You can do more good as a philanthropic banker than as a charity worker or researcher.

But if you are a capable researcher, then you may also be able to contribute by working directly on one or more of the open problems humanity needs to solve. If so, read on...


At this point, I'll need to assume some familiarity with the subject matter. If you haven't already, take a few hours to read these five articles, and then come back:

  1. Yudkowsky (2008a)
  2. Sandberg & Bostrom (2008)
  3. Chalmers (2010)
  4. Omohundro (2011)
  5. Armstrong et al. (2011)

Or at the very least, read my shorter and more accessible summary of the main points in my online book-in-progress, Facing the Singularity.

Daniel Dewey's highly compressed summary of several key points is:

Hardware and software are improving, there are no signs that we will stop this, and human biology and biases indicate that we are far below the upper limit on intelligence. Economic arguments indicate that most AIs would act to become more intelligent. Therefore, intelligence explosion is very likely. The apparent diversity and irreducibility of information about "what is good" suggests that value is complex and fragile; therefore, an AI is unlikely to have any significant overlap with human values if that is not engineered in at significant cost. Therefore, a bad AI explosion is our default future.

The VNM utility theorem suggests that there is some formally stated goal that we most prefer. The CEV thought experiment suggests that we could program a metaethics that would generate a good goal. The Gandhi's pill argument indicates that goal-preserving self-improvement is possible, and the reliability of formal proof suggests that long chains of self-improvement are possible. Therefore, a good AI explosion is likely possible.

Next, I need to make a few important points:

1. Defining each problem is part of the problem. As Bellman (1961) said, "the very construction of a precise mathematical statement of a verbal problem is itself a problem of major difficulty." Many of the problems related to navigating the Singularity have not yet been stated with mathematical precision, and the need for a precise statement of the problem is part of the problem. But there is reason for optimism. Many times, particular heroes have managed to formalize a previously fuzzy and mysterious concept: see Kolmogorov on complexity and simplicity (Kolmogorov 1965; Grunwald & Vitanyi 2003; Li & Vitányi 2008), Solomonoff on induction (Solomonoff 1964a, 1964b; Rathmanner & Hutter 2011), Von Neumann and Morgenstern on rationality (Von Neumann & Morgenstern 1947; Anand 1995), and Shannon on information (Shannon 1948; Arndt 2004).

2. The nature of the problem space is unclear. Which problems will biological humans need to solve, and which problems can a successful Friendly AI (FAI) solve on its own (perhaps with the help of human uploads it creates to solve the remaining open problems)? Are Friendly AI (Yudkowsky 2001) and CEV (Yudkowsky 2004) coherent ideas, given the confused nature of human "values"? Should we aim instead for something like Oracle AI (Armstrong et al. 2011)? Which problems are we unable to state with precision because they are irreparably confused, and which problems are we unable to state due to a lack of insight?

3. Our intervention priorities are unclear. There are a limited number of capable researchers who will work on these problems. Which are the most important problems they should be working on, if they are capable of doing so? Should we focus on "control problem" theory (FAI, AI-boxing, oracle AI, etc.), or on strategic considerations (differential technological development, methods for raising the sanity waterline, methods for bringing more funding to existential risk reduction and growing the community of x-risk reducers, reducing the odds of AI arms races, etc.)? Is AI more urgent than other existential risks, especially synthetic biology? Is research the most urgent thing to be done, or should we focus on growing the community of x-risk reducers, raising the sanity waterline, bringing in more funding for x-risk reduction, etc.? Can we make better research progress in the next 10 years if we work to improve sanity and funding for 7 years and then have the resources to grab more and better researchers, or can we make better research progress by focusing on research now?

Problem Categories

There are many ways to categorize our open problems; I'll divide them into three groups:

Safe AI Architectures. This may include architectures for securely confined or "boxed" AIs (Lampson 1973), including Oracle AIs, and also AI architectures capable of using a safe set of goals (resulting in Friendly AI).

Safe AI Goals. What could it mean to have a Friendly AI with "good" goals?

Strategy. How do we predict the future and make recommendations for differential technological development? Do we aim for Friendly AI or Oracle AI or both? Should we focus on growing support now, or do we focus on research? How should we interact with the public and with governments?

The list of open problems on this page is very preliminary. I'm sure there are many problems I've forgotten, and many problems I'm unaware of. Probably all of the problems are stated poorly: this is only a "first step" document. Certainly, all listed problems are described at a very "high" level, far away (so far) from mathematical precision, and can themselves be broken down into several and often dozens of subproblems.

Safe AI Architectures

Is "rationally-shaped" transparent AI the only potentially safe AI architecture? Omohundro (2007, 2008, 2011) describes "rationally shaped" AI as AI that is as economically rational as possible given its limitations. A rationally shaped AI has beliefs and desires, its desires are defined by a utility function, and it seeks to maximize its expected utility. If an AI doesn't use a utility function, then it's hard to predict its actions, including whether they will be "friendly." The same problem can arise if the decision mechanism or the utility function is not transparent to humans. At least, this seems to be the case, but perhaps there are strong attractors that would allow us to predict friendliness even without the AI having a transparent utility function, or even a utility function at all? Or, perhaps a new decision theory could show the way to a different AI architecture that would allow us to predict the AI's behavior without it having a transparent utility function?

How can we develop a reflective decision theory? When an agent considers radical modification of its own decision mechanism, how can it ensure that doing so will keep constant or increase its expected utility? Yudkowsky (2011a) argues that current decision theories stumble over Löb's Theorem at this point, and that a new, "reflectively consistent" decision theory is needed.

How can we develop a timeless decision theory with the bugs worked out? Paradoxes like Newcomb's Problem (Ledwig 2000) and Solomon's Problem (Gibbard & Harper 1978) seem to show that neither causal decision theory nor evidential decision theory is ideal. Yudkowsky (2010) proposes an apparently superior alternative, timeless decision theory. But it, too, has bugs that need to be worked out, for example the "5-and-10 problem" (described here by Gary Drescher, who doesn't use the 5-and-10 example illustration).

How can we modify a transparent AI architecture to have a utility function over the external world? Reinforcement learning can only be used to define agents whose goal is to maximize expected rewards. But this doesn't match human goals, so advanced reinforcement learning agents will diverge from our wishes. Thus, we need a class of agents called "value learners" (Dewey 2011) that "can be designed to learn and maximize any initially unknown utility function" (see Hibbard 2011 for clarifications). Dewey's paper, however, is only the first step in this direction.

How can an agent keep a stable utility function through ontological shifts? An agent's utility function may refer to states of, or entities within, its ontology. As De Blanc (2011) notes, "If the agent may upgrade or replace its ontology, it faces a crisis: the agent's original [utility function] may not be well-defined with respect to its new ontology." De Blanc points toward some possible solutions for these problems, but they need to be developed further.

How can an agent choose an ideal prior? We want a Friendly AI's model of the world to be as accurate as possible so that it successfully does friendly things if we can figure out how to give it friendly goals. Solomonoff induction (Li & Vitanyi 2008) may be our best formalization of induction yet, but it could be improved upon.

First, we may need to solve the problem of observation selection effects or "anthropic bias" (Bostrom 2002b): even an agent using a powerful approximation of Solomonoff induction may, due to anthropic bias, make radically incorrect inferences when it does not encounter sufficient evidence to update far enough away from its priors. Several solutions have been proposed (Neal 2006; Grace 2010;, Armstrong 2011), but none are as yet widely persuasive.

Second, we need improvements to Solomonoff induction. Hutter (2009) discusses many of these problems. We may also need a version of Solmonoff induction in second-order logic because second-order logic with binary predicates can simulate higher-order logics with nth-order predicates. This kind of Solomonoff induction would be able to imagine even, for example, hypercomputers and time machines.

Third, we would need computable approximations for this improvement to Solomonoff induction.

What is the ideal theory of how to handle logical uncertainty? Even an AI will be uncertain about the true value of certain logical propositions or long chains of logical reasoning. What is the best way to handle this problem? Partial solutions are offered by Gaifman (2004), Williamson (2001), and Haenni (2005), among others.

What is the ideal computable approximation of perfect Bayesianism? As explained elsewhere, we want a Friendly AI's model of the world to be as accurate as possible. Thus, we need ideal computable theories of priors and of logical uncertainty, but we also need computable approximations of Bayesian inference. Cooper (1990) showed that inference in unconstrained Bayesian networks is NP-hard, and Dagum & Luby (1993) showed that the corresponding approximation problem is also NP-hard. The most common solution is to use randomized sampling methods, also known as "Monte Carlo" algorithms (Robert & Casella 2010). Another approach is variational approximation (Wainwright & Jordan 2008), which works with a simpler but similar version of the original problem. Another approach is called "belief propagation" — for example, loopy belief propagation (Weiss 2000).

Can we develop a safely confined AI? Can we develop Oracle AI? One approach to constraining a powerful AI is to give it "good" goals. Another is to externally constrain it, creating a "boxed" AI and thereby "leakproofing the singularity" (Chalmers 2010). A fully leakproof singularity is impossible or pointless: "For an AI system to be useful... to us at all, it must have some effects on us. At a minimum, we must be able to observe it." Still, there may be a way to constrain a superhuman AI such that it is useful but not dangerous. Armstrong et al. (2011) offer a detailed proposal for constraining an AI, but there remain many worries about how safe and sustainable such a solution is. The question remains: Can a superhuman AI be safely confined, and can humans managed to safely confine all superhuman AIs that are created?

What convergent AI architectures and convergent instrumental goals can we expect from superintelligent machines? Omohundro (2008, 2011) argues that we can expect that "as computational resources increase, there is a natural progress through stimulus-response systems, learning systems, reasoning systems, self-improving systems, to fully rational systems," and that for rational systems there are several convergent instrumental goals: self-protection, resource acquisition, replication, goal preservation, efficiency, and self-improvement. Are these claims true? Are there additional convergent AI architectures or instrumental goals that we can use to predict the implications of machine superintelligence?

Safe AI Goals

Can "safe" AI goals only be derived from contingent "desires" and "goals"? Might a single procedure for responding to goals be uniquely determined by reason? A natural approach to selecting goals for a Friendly AI is to ground them in an extrapolation of current human goals, for this approach works even if we assume the naturalist's standard Humean division between motives and reason. But might a sophisticated Kantian approach work, such that some combination of decision theory, game theory, and algorithmic information theory provides a uniquely dictated response to goals? Drescher (2006) attempts something like this, though his particular approach seems to fail.

How do we construe a utility function from what humans "want"? A natural approach to Friendly AI is to program a powerful AI with a utility function that accurately represents an extrapolation of what humans want. Unfortunately, humans do not seem to have coherent utility functions, as demonstrated by the neurobiological mechanisms of choice (Dayan 2011) and behavioral violations of the axioms of utility theory (Kahneman & Tversky 1979). Economists and computer scientists have tried to extract utility theories from human behavior with choice modelling (Hess & Daly 2010) and preference elicitation (Domshlak et al. 2011), but these attempts have focused on extracting utility functions over a narrow range of human preferences, for example those relevant to developing a particular decision support system. We need new more powerful and universal methods for preference extraction. Or, perhaps we must allow actual humans to reason about their own preferences for a very long time until they reach a kind of "reflective equilibrium" in their preferences (Yudkowsky 2004). The best path may be to upload a certain set of humans, which would allow them to reason through their preferences with greater speed and introspective access. Unfortunately, the development of human uploads may spin off dangerous neuromorphic AI before this can be done.

How should human values be extrapolated? Value extrapolation is an old subject in philosophy (Muehlhauser & Helm 2011), but the major results of the field so far have been to show that certain approaches won't work (Sobel 1994); we still have no value extrapolation algorithms that might plausibly work.

Why extrapolate the values of humans alone? What counts as a human? Do values converge if extrapolated? Would the choice to extrapolate the values of humans alone be an unjustified act of speciesism, or is it justified because humans are special in some way — perhaps because humans are the only beings who can reason about their own preferences? And what counts as a human? The problem is more complicated than one might imagine (Bostrom 2006; Bostrom & Sandberg 2011). Moreover, do we need to scan the values of all humans, or only some? These problems are less important if values converge upon extrapolation for a wide variety of agents, but it is far from clear that this is the case (Sobel 1999, Doring & Steinhoff 2009).

How do aggregate or assess value in an infinite universe? What can we make of other possible laws of physics? Our best model of the physical universe predicts that the universe is spatially infinite, meaning that all possible "bubble universes" are realized an infinite number of times. Given this, how do we make value calculations? The problem is discussed by Knobe (2006) and Bostrom (2009), but more work remains to be done. These difficulties may be exacerbated if the universe is infinite in a stronger sense, for example if all possible mathematical objects exist (Tegmark 2005).

How should we deal with normative uncertainty? We may not solve the problems of value or morality in time to build Friendly AI. Perhaps instead we need a theory of how to handle this normative uncertainty. Sepielli (2009) and Bostrom (2009) have made the initial steps, here.

Is it possible to program an AI to do what is "morally right" rather than give it an extrapolation of human goals? Perhaps the only way to solve the Friendly AI problem is to get an AI to do moral philosophy and come to the correct answer. But perhaps this exercise would only result in the conclusion that our moral concepts are incoherent (Beavers 2011).


What methods can we use to predict technological development? Predicting progress in powerful technologies (AI, synthetic biology, nanotechnology) can help us decide which existential threats are most urgent, and can inform our efforts in differential technological development (Bostrom 2002a). The stability of Moore's law may give us limited predictive hope (Lundstrom 2003; Mack 2011), but in general we have no proven method for long-term technological forecasting, including expert elicitation (Armstrong 1985; Woudenberg 1991; Rowe & Wright 2001) and prediction markets (Williams 2011). Nagy's performance curves database (Nagy 2010) may aid our forecasting efforts, as may "big data" in general (Weinberger 2011).

Which kinds of differential technological development should we encourage, and how? Bostrom (2002) proposes a course of differential technological development: "trying to retard the implementation of dangerous technologies and accelerate implementation of beneficial technologies, especially those that ameliorate the hazards posed by other technologies." Many examples are obvious: we should retard the development of technologies that pose an existential risk, and accelerate the development of technologies that help protect us from existential risk, such as vaccines and protective structures. Some potential applications are less obvious. Should we accelerate the development of whole brain emulation technology so that uploaded humans can solve the problems of Friendly AI, or will the development of WBEs spin off dangerous neuromorphic AI first? (Shulman & Salamon 2011)

Which open problems are safe to discuss, and which are potentially highly dangerous. There was a recent debate on whether a certain scientist should publish his discovery of a virus that "could kill half of humanity." (The answer in this case was "no.") The question of whether to publish results is particularly thorny when it comes to AI research, because most of the work in the "Safe AI Architectures" section above would, if completed, bring us closer to developing both uFAI and FAI, but in particular it would make it easier to develop uFAI. Unfortunately, it looks like that work must be done to develop any kind of FAI, while if it is not done then only uFAI can be developed (Dewey 2011).

What can we do to reduce the risk of an AI arms race? AGI is, in one sense, a powerful weapon for dominating the globe. Once it is seen by governments as a feasible technology goal, we may predict an arms race for AGI. Shulman (2009) gives several reasons to recommend "cooperative control of the development of software entities" over other methods for arms race risk mitigation, but these scenarios require more extensive analysis.

What can we do to raise the "sanity waterline," and how much will this help? The Singularity Institute is a strong advocate of rationality training, in part so that both AI safety researchers and supporters of x-risk reduction can avoid the usual thinking failures that occur when thinking about those issues (Yudkowsky 2008b). This raises the question of how well rationality can be taught, and how much difference it will make for existential risk reduction.

What can we do to attract more funding, support, and research to x-risk reduction and to specific sub-problems of successful Singularity navigation? Much is known about how to raise funding (Oppenheimer & Olivola 2010) and awareness (Kotler & Armstrong 2009), but applying these principles is always a challenge, and x-risk reduction may pose unique problems for these tasks.

Which interventions should we prioritize? There are limited resources available for existential risk reduction work, and for AI safety research in particular. How should these resources be allocated? Should the focus be on direct research, or on making it easier for a wider pool of researchers to contribute, or on fundraising and awareness-raising, or on other types of interventions?

How should x-risk reducers and AI safety researchers interact with governments and corporations? Governments and corporations are potential sources of funding for x-risk reduction work, but they may also endanger the x-risk reduction community. AI development labs will be unfriendly to certain kinds of differential technological development advocated by the AI safety community, and governments may face pressures to nationalize advanced AI research groups (including AI safety researchers) once AGI draws nearer.

How can optimal philanthropists get the most x-risk reduction for their philanthropic buck? Optimal philanthropists aim not just to make a difference, but to make the most possible positive difference. Bostrom (2011) makes a good case for existential risk reduction as optimal philanthropy, but more detailed questions remain. Which x-risk reduction interventions and organizations should be funded? Should new organizations be formed, or should resources be pooled in one or more of the existing organizations working on x-risk reduction?

How does AI risk compare to other existential risks? Yudkowsky (2008a) notes that AI poses a special kind of existential risk, for it can surely destroy the human species but, if done right, it also has the unique capacity to save our species from all other existential risks. But will AI come before other existential risks, especially the risks of synthetic biology? How should efforts be allocated between safe AI and the mitigation of other existential risks? Is Oracle AI enough to mitigate other existential risks?

Which problems do we need to solve, and which ones can we have an AI solve? Can we get an AI to do Friendly AI philosophy before it takes over the world? Which problems must be solved by humans, and which ones can we hand off to the AI?

How can we develop microeconomic models of WBEs and self-improving systems? Hanson (1994, 1998, 2008a, 2008b, 2008c, forthcoming) provides some preliminary steps. Might such models help us predict takeoff speed and the likelihood of monopolar (singleton) vs. multipolar outcomes?.

New Comment
149 comments, sorted by Click to highlight new comments since: Today at 4:54 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

there may be a way to constrain a superhuman AI such that it is useful but not dangerous...Can a superhuman AI be safely confined, and can humans managed to safely confine all superhuman AIs that are created?

Does anyone think that no AI of uncertain Friendliness could convince them to let it out of its box?

I'm looking for a Gatekeeper.

Why doesn't craigslist have a section for this in the personals? "AI seeking human for bondage roleplay." Seems like it would be a popular category...

You're looking to play AI in a box experiment? I've wanted to play gatekeeper for a while now. I don't know if I'll be able to offer money, but I would be willing to bet a fair amount of karma.
Maybe bet with predictionbook predictions?
You sir, are a man (?) after my own heart.
Sounds good. I don't have a predictionbook account yet, but IIRC it's free.
What I want to know is whether you are one of those who thinks no superintelligence could talk them out in two hours, or just no human. If not with a probability of literally zero (or perhaps one for the ability of a superintelligence to talk its way out), approximately what. Regardless, let's do this some time this month. As far as betting is concerned, something similar to the original seems reasonable to me.
I am >90% confident that no human could talk past me, and I doubt a superintelligence could either without some sort of "magic" like a Basilisk image (>60% it couldn't). Unfortunately, I can't bet money. We could do predictionbook predictions, or bet karma. Edited to add probabilities.
It all depends on the relative stakes. Suppose you bet $10 that you wouldn't let a human AI-impersonator out of a box. A clever captive could just transfer $10 of bitcoins to one address, $100 to a second, $1000 to a third, and so on. During the breakout attempt the captive would reveal the private keys of increasingly valuable addresses to indicate both the capability to provide bribes of ever greater value and the inclination to continue cooperating even if you didn't release them after a few bribes. The captive's freedom is almost always worth more to them than your bet is to you.
I'd be really interested to know what kind of arguments actually work for the AI. I find it hard to understand why anyway would believe they'd be an effective gatekeeper. Could you maybe set it up so we get some transcripts or aftermath talk, maybe anonymous if necessary? (You seem to have enough volunteers to run multiple rounds and so there'd be plausible deniability.) If not, I'd like to volunteer as a judge (and would keep quiet afterwards), just so I can see it in action. (I'd volunteer as an AI, but I don't trust my rhetorical skills enough to actually convince someone.)
I'd bet up to fifty dollars!?
What I want to know is whether you are one of those who thinks no superintelligence could talk them out in two hours, or just no human. If not with a probability of literally zero (or perhaps one for the ability of a superintelligence to talk its way out), approximately what. Regardless, let's do this some time this month. As far as betting is concerned, something similar to the original seems reasonable to me.
Do you still want to do this?
To be more specific: I live in Germany, so timezone is GMT +1. My preferred time would be on a workday sometime after 8 pm (my time). Since I'm a german native speaker, and the AI has the harder job anyway, I offer: 50 dollars for you if you win, 10 dollars for me if I do.
Well, I'm somewhat sure (80%?) that no human could do it, but...let's find out! Original terms are fine.
I'd love to be a gatekeeper too, if you or anyone else is up for it. I'm similarly limited financially, but maybe a very small amount of money or a bunch of karma (not that I have all that much). I would be willing to bet 10 karma for an AI victory for every 1 karma for a gatekeeper victory (me being the gatekeeper) or even quite a bit higher if necessary.
What I want to know is whether you are one of those who thinks no superintelligence could talk them out in two hours, or just no human. If not with a probability of literally zero (or perhaps one for the ability of a superintelligence to talk its way out), approximately what. Regardless, let's do this some time this month. As far as betting is concerned, something similar to the original seems reasonable to me.
A superintelligence almost surely could, but I don't think a human could, not if I really didn't want to let them out. For a human, maybe .02? I can't really quantify my feeling, but in words it's something like "Yeah right. How could anyone possibly do this?".
I'd be interested in gatekeeping, as well, as long as it takes place late in the evening or on a weekend.

At this point, I'll need to assume some familiarity with the subject matter. If you haven't already, take a few hours to read these five articles, and then come back:

Yudkowsky (2008a)
Sandberg & Bostrom (2008)
Chalmers (2010)
Omohundro (2011)
Armstrong et al. (2011)

Not having the names of the articles here seems odd.

I would recommend against offering AI Risk as the first article: to someone who doesn't already know why the beginning is relevant, it seems to take a long while to get to the point and is generally not what I'd use as an introduction. Put Chalmers' article first, perhaps.

This encapsulates why I love this site and Lukeprog in particular.
0Paul Crowley12y
Aha - now I know which ones I've already read - thanks!

Just a random thought: aren't corporations superhuman goal-driven agents, AIs, albeit using human intelligence as one of their raw inputs? They seem like examples of systems that we have created that have come to control our behaviour, with both positive and negative effects? Does this tell us anything about how powerful electronic AIs could be harnessed safely?

You seem to be operating on the assumption that corporations have been harnessed safely. This doesn't look true to me. Even mildly superhuman entities with significantly superhuman resources but no chance of recursively self-improving are already running away with nearly all of the power, resources, and control of the future.
Corporations optimize profit. Governments optimize (among other things) the monopoly of force. Both of these goals exhibit some degree of positive feedback, which explains how these two kinds of entities have developed some superhuman characteristics despite their very flawed structures. Since corporations and governments are now superhuman, it seems likely that one of them will be the first to develop AI. Since they are clearly not 100% friendly, it is likely that they will not have the necessary motivation to do the significantly harder task of developing friendly AI. Therefore, I believe that one task important to saving the world is to make corporations and governments more Friendly. That means engaging in politics; specifically, in meta-politics, that is, the politics of reforming corporations and governments. On the government side, that means things like reforming election systems and campaign finance rules; on the corporate side, that means things like union regulations and standards of corporate governance and transparency. In both cases, I'm acutely aware that there's a gap between "more democratic" and "more Friendly", but I think the former is the best we can do. Note: the foregoing is an argument that politics, and in particular election reform, is important in achieving a Friendly singularity. Before constructing this argument and independently of the singularity, I believed that these things were important. So you can discount these arguments appropriately as possible rationalizations. I believe, though, that appropriate discounting does not mean ignoring me without giving a counterargument. Second note: yes, politics is the mind-killer. But the universe has no obligation to ensure that the road to saving the world does not run through any mind-killing swamps. I believe that here, the ban on mind-killing subjects is not appropriate.
I agree with your main points, but it's worth noting that corporations and governments don't really have goals -- people who control them have goals. Corporations are supposed to maximize shareholder value, but their actual behavior reflects the personal goals of executives, major shareholders, etc. See, for example, "Dividends and Expropriation" Am Econ Rev 91:54-78. So one key question is how to align the interests of those who actually control corporations and governments with those they are supposed to represent.
Yes. And obviously corporations and governments have multiple levels on which they're irrational and don't effectively optimize any goals at all. I was skating over that stuff to make a point, but thanks for pointing it out, and thanks for the good citation.
Not any more so than any other goal-driven systems we have created that have come to influence our behavior, which includes... every social and organizational system we've ever created (albeit with some following quite poorly defined goals). Corporations don't seem like a special example.
So the beginning goals are what mostly define the future actions of the goal-driven organizations. That sounds like an obvious statement, but these systems have all had humans tainting said system. So, human goals are bound to either leak into or be built into the goals, creating some kind of monstrosity with human flaws but superhuman abilities. A corporation, a church, a fraternity, union, or government, all have self-preservation built in to them from the start, which is the root cause of their eventual corruption. They sacrifice other goals like ethics, because without self-preservation, the other goals cannot be accomplished. Humans have a biological imperative for self-preservation that has been refined over billions of years. Why does an AI even have to have a sense of self-preservation? A human understands what he is and has intelligence, and he wants to continue his life. Those two (nearly) always go together in humans, but one doesn't follow from the other. Especially with the digital nature of replicable technology, self-preservation shouldn't even be a priority in a self-aware non-biological entity.
One difference is that corporations have little chance of leading to an Intelligence Explosion or FOOMing (except by creating AI themselves).

for example the "5-and-10 problem" (described here by Gary Drescher, who doesn't use the 5-and-10 example illustration).


Stuart Armstrong's explanation of the 5-and-10 problem is: However, some think Drescher's explanation is more accurate. Somebody should write a short paper on the problem so I can cite that instead. :)

This is an incorrect description of 5-and-10. The description given is of a different problem (one of whose aspects is addressed in the recent cousin_it's writeup, the problem is resolved in that setting by Lemma 2).

5-and-10 problem is concerned with the following (incorrect) line of reasoning by a hypothetical agent:

"I have to decide between $5 and $10. Suppose I decide to choose $5. I know that I'm a money-optimizer, so if I do this, $5 must be more money than $10, so this alternative is better. Therefore, I should choose $5."

Has anyone emailed Judea Pearl, John Harrison, Jon Williamson, et cetera, asking them to look at this?
I doubt it.
Because academics don't care about blogs? Or doing so would project the wrong image of the Singularity Institute? Or no one thought of doing it? Or someone thought of it, but there were more important things to do first? Perhaps because it's inefficient marketing? Or people who aren't already on lesswrong have failed some ratioanlity competence test? Or you're not sure it's safe to discuss?

My plan has been to write up better, more precise specifications of the open problems before systematically sending them to top academics for comments.

Why don't you do it? I would if I could formulate those problems adequately.
I'd already started writing a draft, but I thought I'd ask here to make sure I wasn't stepping on anyone's toes.
I don't really understand how this could occur in a TDT-agent. The agent's algorithm is causally dependent on '(max $5 $10), but considering the counterfactual severs that dependence. Observing a money-optimizer (let's call it B) choosing $5 over $10 would presumably cause the agent (call it A) to update its model of B to no longer depend on '(max $5 $10). Am I missing something here?
Correctly getting to the comparison of $5 and $10 is the whole point of the exercise. An agent is trying to evaluate the consequences of its action, A, which is defined by agent's algorithm and is not known explicitly in advance. To do that, it could in some sense consider hypotheticals where its action assumes its possible values. One such hypothetical could involve a claim that A=$5. The error in question is about looking at the claim that A=$5 and making incorrect conclusions (which would result in an action that doesn't depend on comparing $5 and $10).
This is probably a stupid question, but is this reducible to the Lobian obstacle? On the surface, it seems similar.
It seems to me that any agent unable to solve this problem would be considerably less intelligent than a human.
It does seem unlikely that an "expected utility maximizer" reasoning like this would manage to build interstellar spaceships, but that insight doesn't automatically help with building an agent that is immune to this and similar problems.
That's strange, Luke normally has good understanding of his sources, and uses and explains them correctly, and so usually recognizes an incorrect explanation.

Clearly, if the algorithm concludes that it will certainly not choose the $5, and then does choose the $5, it concluded wrong. But the reasoning seems impeccable, and there don't seem to be any false premises here. It smacks of the unexpected hanging paradox.

Ooh, but wait. Expanding that reasoning a bit, we have...

The utility of $10 is greater than the utility of $5. Therefore, an algorithm whose axioms are consistent will never decide to choose $5. I am an algorithm whose axioms are consistent. Therefore, I will never decide to choose $5.

The assumption "I am an algorithm whose axioms are consistent" is one that we already know leads to a contradiction, by Löb's theorem. If we can avoid the wrath of Löb's theorem, can we also avoid the five-and-ten problem?

(Granted, this probably isn't the best place to say this.)

If we can avoid the wrath of Löb's theorem, can we also avoid the five-and-ten problem?

Very likely yes. Now ask if I know how to avoid the wrath of Löb's theorem.

Do you know how to avoid the wrath of Lob's theorem?
7Eliezer Yudkowsky12y
Not yet.
What kind of powers are you hoping for beyond this sort of thing?
Hi, I'm the author of that post. My best guess at the moment is that we need a way to calculate "if do(action), then universe pays x", where "do" notation encapsulates relevant things we don't know yet about logical uncertainty, like how an AI can separate itself from its logical parent nodes (or its output from its computation) so that it temporarily forgets that its computation maximizes expected utility.
For someone making a desperate effort to not be a cult leader, you really do enjoy arbitrarily ordering people around, don't you? </humour possibly subject to Poe's law>
Now I see why TDT has been causing me unease - you're spot on that the 5-and-10 problem is Löbbish, but what's more important to me is that TDT in general tries to be reflective. Indeed, Eliezer on decision theory seems to be all about reflective consistency, and to me reflective consistency looks a lot like PA+Self. A possible route to a solution (to the Löb problem Eliezer discusses in "Yudkowsky (2011a)") that I'd like to propose is as follows: we know how to construct P+1, P+2, ... P+w, etc. (forall Q, Q+1 = Q u {forall S, [](Q|-S)|-S}). We also know how to do transfinite ordinal induction... and we know that the supremum of all transfinite countable ordinals is the first uncountable ordinal, which corresponds to the cardinal aleph_1 (though ISTR Eliezer isn't happy with this sort of thing). So, P+Ω won't lose reflective trust for any countably-inducted proof, and our AI/DT will trust maths up to countable induction. However, it won't trust scary transfinite induction in general - for that, we need a suitably large cardinal, which I'll call kappa, and then P+kappa reflectively trusts any proof whose length is smaller than kappa; we may in fact be able to define a large cardinal property, of a kappa such that PA+(the initial ordinal of kappa) can prove the existence of cardinals as large as kappa. Such a large cardinal may be too strong for existing set theories (in fact, the reason I chose the letter kappa is because my hunch is that Reinhardt cardinals would do the trick, and they're inconsistent with ZFC). Nonetheless, if we can obtain such a large cardinal, we have a reflectively consistent system without self-reference: PA+w_kappa doesn't actually prove itself consistent, but it does prove kappa to exist and thus proves that it itself exists, which to a Syntacticist is good enough (since formal systems are fully determined).
Are you referring to something like ordinal analysis? Can I drop the mention of cardinals and define kappa as the smallest ordinal such that the proof-theoretic ordinal of PA+kappa is kappa? Sorry if these are stupid questions, I don't know very much about the topic. Also I don't completely understand why such a system could be called reflectively consistent. Can you explain more formally what you mean by "proves that it itself exists"?
Well, I'm not exactly an expert either (though next term at uni I'm taking a course on Logic and Set Theory, which will help), but I'm pretty sure this isn't the same thing as proof-theoretic ordinals. You see, proofs in formal systems are generally considered to be constrained to have finite length. What I'm trying to talk about here is the construction of metasyntaxes in which, if A1, A2, ... are valid derivations (indexed in a natural and canonical way by the finite ordinals), then Aw is a valid derivation for ordinals w smaller than some given ordinal. A nice way to think about this is, in (traditionally-modelled) PA the set of numbers contains the naturals, because for any natural number n, you can construct the n-th iterate of ⁺ (successor) and apply it to 0. However, the set of numbers doesn't contain w, because to obtain that by successor application, you'd have to construct the w-th iterate of ⁺, and in the usual metasyntax infinite iterates are not allowed. Higher-order logics are often considered to talk about infinite proofs in lower-order logics (eg. every time you quantify over something infinite, you do something that would take infinite proving in a logic without quantifiers), but they do this in a semantic way, which I as a Syntacticist reject; I am doing it in a syntactic way, considering only the results of transfinitely iterated valid derivations in the low-order logic. As far as I'm aware, there has not been a great deal of study of metasyntax. It seems to me that (under the Curry-Howard isomorphism) transfinite-iteration metasyntax corresponds to hypercomputation, which is possibly why my kappa is (almost certainly) much, much larger than the Church-Kleene ordinal which (according to Wikipedia, anyway) is a strict upper bound on proof-theoretic ordinals of theories. w₁CK is smaller even than w₁, so I don't see how it can be larger than kappa.
Ok, I think I get it. You're talking about proofs of transfinite lengths, and using hypercomputation to check those, right? This seems way powerful, e.g. it pins down all statements in the arithmetical hierarchy (a statement at level N requires an N-dimensional hyper...quadrant... of proofs) and much more besides. A finite computer program probably can't use such a system directly. Do you have any ideas how to express the notion that the program should "trust" such a system? All ideas that come to my mind seem equivalent to extending PA with more and more powerful induction postulates (or equivalently, adding more and more layers of Con()), which just amounts to increasing the proof-theoretic ordinal as in my first comment.
The computer program 'holds the belief that' this way-powerful system exists; while it can't implement arbitrary transfinite proofs (because it doesn't have access to hypercomputation), it can still modify its own source code without losing a meta each time: it can prove its new source code will increase utility over its old, without its new source code losing proof-power (as would happen if it only 'believed' PA+n; after n provably-correct rewrites it would only believe PA, and not PA+1. Once you get down to just PA, you have a What The Tortoise Said To Achilles-type problem; just because you've proved it, why should you believe it's true? The trick to making way-powerful systems is not to add more and more Con() or induction postulates - those are axioms. I'm adding transfinite inference rules. As well as all the inference rules like modus ponens, we have one saying something like "if I can construct a transfinite sequence of symbols, and map those symbols to syntactic string-rewrite operations, then the-result-of the corresponding sequence of rewrites is a valid production". Thus, for instance, w-inference is stronger than adding w layers of Con(), because it would take a proof of length at least w to use all w layers of Con(). This is why I call it metasyntax; you're considering what would happen if you applied syntactic productions transfinitely many times. I don't know, in detail, how to express the notion that the program should "trust" such a system, because I don't really know how a program can "trust" any system: I haven't ever worked with/on automated theorem provers, nor any kind of 'neat AI'; my AGI experiments to date have all been 'scruffy' (and I stopped doing them when I read EY on FAI, because if they were to succeed (which they obviously won't, my neural net did nothing but talk a load of unintelligible gibberish about brazil nuts and tetrahedrite) they wouldn't even know what human values were, let alone incorporate them into whatever kind of
I might be misunderstanding you, but it looks like you're just describing fragments of infinitary logic which has a pretty solid amount of research behind it. Barwise actually developed a model theory for it, you can find a (slightly old) overview here (in part C). Infinitary logic admits precisely what you're talking about. For instance; it models sentences with universal and existential quantifiers in N using conjunctions and disjunctions (respectively) over the index ω. As far as I don't think the results of Gödel, Löb and Tarski are necessary to conclude that Platonism is at least pretty questionable. I don't know where the "mathematics can't really prove things" bit is coming from - we can prove things in mathematics, and I've never really seen people claim otherwise. Are you implicitly attributing something like Absolute Truth to proofs? Anyway, I've been thinking about a naturalistic account of anti-realism for a little while. I'm not convinced it's fruitful, but Platonism seems totally incomprehensible to me anymore. I can't see a nice way for it to mesh with what I know about brains and how we form concepts - and the accounts I read can't give any comprehensible account of what exactly the Platonic realm is supposed to be, nor how we access it. It looks like a nice sounding fiction with a big black box called the Platonic realm that everything we can't really explain gets stuck into. Nor can I give any kind of real final account of my view of mathematics, because it would at least require some new neuroscience (which I mention in that link). I will say that I don't think that mathematics has any special connection with The Real Truth, and I also think that it's a terrible mistake to think that this is a problem. Truth is evidence based, we figure out the truth by examining our environment. We have extremely strong evidence that math performs well as a framework for organizing information and for reasoning predictively and counterfactually - that it wor
... up until now I thought "barwise" was a technical term.
Ha! Yeah, it seems that his name is pretty ubiquitous in mathematical logic, and he wrote or contributed to quite a number of publications. I had a professor for a sequence in mathematical logic who had Barwise as his thesis adviser. The professor obtained his doctoral degree from UW Madison when it still had Barwise, Kleene and Keisler so he would tell stories about some class he had with one or the other of them. Barwise seems to have had quite a few interesting/powerful ideas. I've been wanting to read Vicious Circles for a while now, though I haven't gotten around to it.
Hmm, infinitary logic looks interesting (I'll read right through it later, but I'm not entirely sure it covers what I'm trying to do). As for Platonism, mathematical realism, and Tegmark, before discussing these things I'd like to check whether you've read setting out my position on the ontological status of mathematics, and on my version of Tegmark-like ideas? I'd rather not repeat all that bit by bit in conversation.
Have you looked into Univalent foundations at all? There was an interesting talk on it a while ago and it seems as though it might be relevant to your pursuits. I've read your post on Syntacticism and some of your replies to comments. I'm currently looking at the follow up piece (The Apparent Reality of Physics).
The fundamental principle of Syntacticism is that the derivations of a formal system are fully determined by the axioms and inference rules of that formal system. By proving that the ordinal kappa is a coherent concept, I prove that PA+kappa is too; thus the derivations of PA+kappa are fully determined and exist-in-Tegmark-space. Actually it's not PA+kappa that's 'reflectively consistent'; it's an AI which uses PA+kappa as the basis of its trust in mathematics that's reflectively consistent, for no matter how many times it rewrites itself, nor how deeply iterated the metasyntax it uses to do the maths by which it decides how to rewrite itself, it retains just as much trust in the validity of mathematics as it did when it started. Attempting to achieve this more directly, by PA+self, runs into Löb's theorem.
Thanks. I'm not sure if you actually got my point, though, which was that while I can roughly get the gist of "who doesn't use the 5-and-10 example illustration", it's an odd sentence and doesn't seem grammatical. ("Who doesn't actually use the term '5-and-10 example'", perhaphs?)
Okay. I changed it here.

Great post!

Would the choice to extrapolate the values of humans alone be an unjustified act of speciesism, or is it justified because humans are special in some way

I choose neither. Whatever we coherently want at a higher, extrapolated level, asking if that is justified is to attempt an open question type of argument. "Specialness" isn't a justification, understanding the meanings of "justified" in people's idiolects and extrapolated volitions would allow one to replace "justified" with its content and ask the question.

Set... (read more)

No, the world must be saved by mathematicians, computer scientists, and philosophers. This is because the creation of machine superintelligence this century will determine the future of our planet...

You sound awfully certain of that, especially considering that, as you say later, the problems are poorly defined, the nature of the problem space is unclear, and the solutions are unknown.

If I were a brilliant scientist, engineer, or mathematician (which I'm not, sadly), why should I invest my efforts into AI research, when I could be working on more immedi... (read more)

Well, I'm unlikely to solve those problems today regardless. Either way, we're talking about estimated value calculations about the future made under uncertainty.
Fair enough, but all of the examples I'd listed are reasonably well-defined problems, with reasonably well-outlined problem spaces, whose solutions appear to be, if not within reach, then at least feasible given our current level of technology. If you contrast this with the nebulous problem of FAI as lukeprog outlined it, would you not conclude that the probability of solving these less ambitious problems is much higher ? If so, then the increased probability could compensate for the relatively lower utility (even though, in absolute terms, nothing beats having your own Friendly pocket genie).
Honestly, the error bars on all of these expected-value calculations are so wide for me that they pretty much overlap. Especially when I consider that building a run-of-the-mill marginally-superhuman non-quasi-godlike AI significantly changes my expected value of all kinds of research projects, and that cheap plentiful energy changes my expected value of AI projects, and etc., so half of them include one another as factors anyway. So, really? I haven't a clue.
Fair enough; I guess my error bars are just a lot narrower than yours. It's possible I'm being too optimistic about them.
sorry, its been a while since everyone stopped responding to this comment, but these goals wouldnt even begin to cover the number of problems that would be solved if our rough estimates of the capabilities of FAI are correct. You could easily fit another 10 issues to this selection and still be nowhere near a truly just world. not to mention the fact that each goal you add on makes solving such problems less likely due to the amount of social resistance you would encounter. and suppose humans truly are incapable of solving some of these issues under present conditions. this is not at all unlikely and an AI would have a much better shot at finding solutions. The added delay and greater risk may make pursuing FAI less rewarding than any one or even possibly three of these problems, but considering the sheer number of problems human beings face that could be solved through the Singularity if all goes well would lead me to believe it is far more worthwhile than any of these issues.

This post is, as usual for lukeprog, intensely awesome and almost ludicrously well-cited. Thank you! I will be linking this to friends.

I'm somewhat uncertain about that picture at the top, though. It's very very cool, but it may not be somber enough for this subject matter; maybe an astronomy photo would be better. Or maybe the reverse is needed, and the best choice would be a somewhat silly picture of a superhero, as referenced in the opening paragraph.

Any artists wanna draw Eliezer Yudkowsky or Nick Bostrom in a superhero costume?
Great artists steal
Changed. Let them complain.

Another thing you won't be able to do once SOPA/PIPA passes.

From what I've read (admittedly not enough), it seems like SOPA only affects non-US-based websites, and only if they are explicitly devoted to hosting pirated content. Places like foreign torrenting sites would be blocked, but LessWrong and YouTube and Wikipedia would be perfectly safe. Correct me if I'm wrong. It's still a crappy law for making censorship that much easier. And I like to torrent.
No. Some provisions apply only to websites outside US jurisdiction (whatever that is supposed to mean), but the process below applies also to LW, YouTube, Wikipedia, and friends -- from here:
Ah, yes, that is very vague and exploitable. Especially:
I'm now inordinately curious what the picture at the top was.
It was the image at the top of this quite good post that I ran across coincidentally: Though, the image in that header seems to have been ran through some Photoshop filters; the original image from the OP was color and didn't have any sketch lines, if I recall right.
I found the original image on Tineye ; it appears to be a piece of Mass Effect fan art (the Citadel station in the background is a dead giveaway). One of the originals can be seen on DeviantArt. I am somewhat saddened that the image appears to have been used without attribution, however.
Some spacey landscape, as I recall.
The Superman is no longer human. If he is who "prevents us" from robots, it's not our battle anymore. In that case we already depend on a good will of a nonhuman agent. Hope, he's friendly!

Sorry if I've missed a link somewhere, but have we taken a serious look at Intelligence Amplification as a safer alternative? It saves us the problem of reverse engineering human values by simply using them as they exist. It's also less sudden, and can be spread over many people at once to keep an eye on eachother.

Amplified human intelligence is no match for recursively self-improved AI, which is inevitable if science continues. Human-based intelligence has too many limitations. This becomes less true as you approach WBE, but then you approach neuromorphic AI even faster (or so it seems to me).
Not to mention the question of just how friendly a heavily enhanced human will be. Do I want an aggressive king maker with tons of money to spend on upgrades to increase their power by massively amplifying their intelligence? How about a dictator who had been squirreling away massive, illegally obtained funds? Power corrupts, and even if enhancements are made widely available, there's a good possibility of an accelerating (or at least linearly increasing) gap in cognitive enhancements (I have the best enhancement, ergo I can find a quicker path to improving my own position, including inventing new enhancements if the need arises - thereby securing my position at the top long enough to seize an awful lot of control). An average person may end up with a greatly increased intelligence that is miniscule relative to what's possible to attain if they had the resources to do so. In a scenario where someone who does have access to lots of resources can immediately begin to control the game at a level of precision far beyond what is obtainable for all but a handful of people, this may be a vast improvement over a true UFAI let loose on an unsuspecting universe, but it's still a highly undesirable scenario. I would much rather have an FAI (I suspect some of these hypothetical persons would decide it to be in their best interest to block any sort of effort to build something that outstrips their capacity for controlling their environment - FAI or no).
Just to clarify, when you say "recursively self-improved", do you also imply something like "unbounded" or "with an unimaginably high upper-bound" ? If the AI managed to self-improve itself to, say, regular human genius level and then stopped, then it wouldn't really be that big of a deal.
Right; with a high upper bound. There is plenty of room above us.
But consider this scenario: 1) We develop a mathematical proof that a self-improving AI has a non-trivial probability of being unfriendly regardless of how we write the software. 2) We create robot guardians which will with extremely high probability never self-improve but will keep us from ever developing self-improving AI. They observe and implicitly approve of everything anyone does. Perhaps they prevent us from ever again leaving earth or perhaps they have the ability to self-replicate and follow us as we spread throughout the universe. They might control a million times more matter and free energy than the rest of mankind does. They could furthermore monitor with massive redundancy everyone's thoughts to make sure nobody ever tries to develop anything close to a self-improving AI. They might also limit human intelligence so nobody is anywhere close to being as intelligent as they are. 3) Science continues.
The government funds a lot of science. The government funds a lot of AI research. Politicians want power. Not to get all conspiracy theory on you, but QED.
Would you mind to specify 'inevitable' with a numeric probability?
Cumulative probability approaches 1 as time approaches infinity, obviously.
If you are certain that SI style recursive self-improvement is possible then yes. But I don't see that anyone could be nearly certain that amplified human intelligence is no match for recursively self-improved AI. That's why I asked if it would be possible to be more specific than saying that it is an 'inevitable' outcome.
I read Luke as making three claims there, two explicit and one implicit: 1. If science continues recursively self-improving AI is inevitable. 2. recursively self-improving AI will eventually outstrip human intelligence. 3. This will happen relatively soon after the AI starts recursively self-improving. 1) Is true as long as long as there is no infallible outside intervention and recursively self-improving AI is possible in principle, and unless we are talking about things like "there's no such thing as intelligence" or "intelligence is boolean" I don't sufficiently understand what it would even mean for that to be impossible in principle to assign probability mass to worlds like that. The two other claims make sense to assign lower probability to, but the inevitable part referred to the first claim (which also was the one you quoted when you asked) and I answered for that. Even if I disagreed on it being inevitable, that seems to be what Luke meant.
As far as I understand, your point (2) is too weak. The claim is not that the AI will merely be smarter than us humans by some margin; instead, the claim is that (2a) the AI will become so smart that it will become a different category of being, thus ushering in a Singularity. Some people go so far as to claim that the AI's intelligence will be effectively unbounded. I personally do not doubt that (1) is true (after all, humans are recursively self-improving entities, so we know it's possible), and that your weaker form of (2) is true (some humans are vastly smarter than average, so again, we know it's possible), but I am not convinced that (2a) is true.
Stripped of all connotations this seems reasonable. I was pretty sure that he meant to include #2,3 in what he wrote and even if he didn't I thought it would be clear that I meant to ask about the SI definition rather than the most agreeable definition of self-improvement possible.
Recursively self-improving AI of near-human intelligence is likely to outstrip human intelligence, as might sufficiently powerful recursive processes starting from a lower point. Recursively self-improving AI in general might easily top out well below that point, though, either due to resource limitations or diminishing returns. Luke seems to be relying on the narrower version of the argument, though.

Could you include a list of changes to the article, or possibly a list of substantial changes? In other words, adding links might not be something people would generally want to know, but new subtopics would be.

There's a paper in the nanoethics literature from a 1-3 years ago about the difficulties of picking out "humans" that I wanted to cite next to Bostrom & Sandberg (2011) in the question about extrapolating human values, but now I can't find it. I think the author was female. If anybody finds it before I do, please point me to it!

There's also a book chapter from the nanoethics literature that analyzes the biases at work in doing ethical philosophy for future technologies, ala Yudkowsky 2008b. I think that chapter also had a female author. I can... (read more)

Updated: Added a new problem category: "How can we be sure a Friendly AI development team will be altruistic?"

Wouldn't it be pointless to try to instill into an AI a friendly goal, as a self-aware improving AI should be able to act independently regardless of however we might write them in the beginning?

I don't want to eat babies. If you gave me a pill that would make me want to eat babies, I would refuse to take that pill, because if I took that pill I'd be more likely to eat babies, and I don't want to eat babies. That's a special case of a general principle: even if an AI can modify itself and act independently, if it doesn't want to do X, then it won't intentionally change its goals so as to come to want to do X. So it's not pointless to design an AI with a particular goal, as long as you've built that AI such that it won't accidentally experience goal changes. Incidentally, if you're really interested in this subject, reading the Sequences may interest you.
I am not sure your argument is entirely valid. The AI would have access to every information humans ever conceived, including the discussions, disputes and research put into programming this AI's goals and nature. It may then adopt new goals based on the information gathered, realizing its former ones are no longer desirable. Let's say that you're programmed not to kill baby eaters. One day you find out, that eating babies is wrong (based on the information you gather), and killing the baby eaters is therefore right, you might kill the baby eaters no matter what your desire is. I am not saying my logic isn't wrong, but I don't think that the argument - "my desire is not do do X, therefore I wouldn't do X even if I knew it was the right thing to do" is right, either. Anyway, I plan to read the sequences, when I have time.
You need to take desire out of the equation. The way you program the utility function fully determines the volition of the machine. It is the volition of the machine. Postulating that a machine can desire something that it's utility function doesn't define or include is roughly equivalent to postulating that 1 = 0. I think you might benefit from reading this actual SIAI article by Eliezer. It specifically address your concern. There is one valid point - closely related to what you're saying here: But you're thinking about it the wrong way. The issue that the machine "realizes" that something is "no longer desirable" doesn't actually make a lot of sense because the AI is its programing and it can only "realize" things that its programing allows for (of course, since an AGI is so complicated, a simple utility function could result in a situation similar to presenting a Djinn (genie) an ill-specified request i.e. a be-careful-what-you-wish-for scenario). A variant that does make sense and is a real concern is that as the AGI learns, it could change its definitions in unpredictable ways. Peter De Blanc talks about this here. This could lead to part of the utility function becoming undefined or to the machine valuing things that we never intended it to value - basically it makes the utility function unstable under the conditions you describe. The intuition is roughly that if you define a human in one way, according to what we currently know about physics, some new discovery made available to the AI might result in it redefining humans in new terms and no longer having them as a part of its utility function. Whatever the utility function describes is now separate from how humans appear to it.
That's what I basically meant.
I agree with you that "my desire is not do do X, therefore I wouldn't do X even if I knew it was the right thing to do" isn't a valid argument. It's also not what I said. What I said was "my desire is not do do X, therefore I wouldn't choose to desire to do X even if I could choose that." Whether it's right or wrong doesn't enter into it. As for your scenario... yes, I agree with you that IF "eating babies is wrong" is the sort of thing that can be discovered about the world, THEN an AI could discover it, and THEREFORE is not guaranteed to continue eating babies just because it initially values baby-eating. It is not clear to me that "eating babies is wrong" is the sort of thing that can be discovered about the world. Can you clarify what sort of information I might find that might cause me to "find out" that eating babies is wrong, if I didn't already believe that?
Let me get this straight, are you saying that if you believe X, there can't possibly exist any information that you haven't discovered yet that could convince your belief is false? You can't know what connections and conclusions might AI deduce out of every information put together. They might conclude that humanity is a stain of universe and even if they thought wiping humanity out wouldn't accomplish anything (and they strongly desired against doing so), they might wipe us out purely because the choice "wipe humanity" would be assigned higher value than the choice "not to wipe out humanity". Also, is the statement "my desire is not do do X, therefore I wouldn't choose to desire to do X even if I could choose that." your subjective feeling, or do you base it on some studies? For example, this statement doesn't apply to me, as I would, under certain circumstances, choose to desire to do X, even if it was not my desire initially. Therefore it's not an universal truth, therefore may not apply to AI either.
No. I'm saying that if I value X, I can't think of any information that would cause me to value NOT(X) instead. Can you give me an example of something you desire not to do, which you would willingly edit yourself to desire to do?
If you have lexicographic preferences, and prefer W to X, and you learn that NOT(X) and W are equivalent?
Er, this seems to imply that you believe yourself immune to being hacked, which can't be right; human brains are far from impregnable. Do you consider such things to not be information in this context, or are you referring to "I" in a general "If I were an AI" sense, or something else?
Mm, interesting question. I think that when I said it, I was referring to "I" in a "if I were an AI" sense. Or, rather, "if I were an AI properly designed to draw inferences from information while avoiding value drift," since of course it's quite possible to build an AI that doesn't have this property. I was also clearly assuming that X is the only thing I value; if I value X and Y, discovering that Y implies NOT(X) might lead me to value NOT(X) instead. (Explicitly, I mean. In this example I started out valuing X and NOT(X), but I didn't necessarily know it.) But the question of what counts as information (as opposed to reprogramming attempts) is an intriguing one that I'm not sure how to address. On five seconds thought, it seems clear that there's no clear line to be drawn between information and attempts to hack my brain, and that if I want such a distinction to exist I need to design a brain that enforces that kind of security... certainly evolution hasn't done so.
1. Ok, I guess we were talking about different things, then. 2. I don't see any point in giving particular examples. More importantly, even if I didn't support my claim, it wouldn't mean your argument was correct. The burden of proof lies on your shoulders, not mine. Anyway, here's one example, quite cliche - I would choose to sterilize myself, if I realized that having intercourse with little girls is wrong (or that having intercourse at all is wrong, whatever the reason..) Even if it was my utmost desire, and in my wholeness I believed that it is my purpose to have intercourse , I would choose to modify that desire if I realized it's wrong - or illogical, or stupid, or anything. It doesn't matter really. THERFORE: (A) I do not desire not to have intercourse. (B) But based on new information, I found out that having intercourse produces great evil. => I choose to alter my desire (A). You might say that by introducing new desire (not to produce evil) I no longer desire (A), and I say, fine. Now, how do you want to ensure that the AI won't create it's own new desires based on new facts.
Burden of proof hasn't come up. I'm not trying to convince you of anything, I'm exploring your beliefs because I'm curious about them. (I'm similarly clarifying my beliefs when you ask about them.) What I would actually say is that "don't produce evil" isn't a new value, and you didn't lose your original value ("intercourse") either. Rather, you started out with both values, and then you discovered that your values conflicted, and you chose to resolve that conflict by eliminating one of those values. Presumably you eliminated your intercourse-value because it was the weaker of the two.. you valued it less. Had you valued intercourse more, you would instead have instead chosen to eliminate your desire to not be evil. Another way of putting this is that you started out with two values which, aggregated, constituted a single complex value which is hard to describe in words. This is exactly right! The important trick is to build a system whose desires (I would say, rather, whose values) remain intact as it uncovers new facts about the world. As you say, this is impossible if the system can derive values from facts... derive "ought" from "is." Conversely, it is theoretically possible, if facts and values are distinct sorts of things. So, yes: the goal is to build an AI architecture whose basic values are distinct from its data... whose "oughts" are derived from other "oughts" rather than entirely from "is"es.
Alright - that is to create completely deterministic AI system, or otherwise, to my belief, it would be impossible to predict how the AI is going to react. Anyway, I admit that I have not read much on the matter, and it's just reasoning... so thanks for your insight.
It is impossible for me to predict how a sufficiently complex system will react to most things. Heck, I can't even predict my dog's behavior most of the time. But there are certain things I know she values, and that means I can make certain predictions pretty confidently: she won't turn down a hot dog if I offer it, for example. That's true more generally as well: knowing what a system values allows me to confidently make certain broad classes of predictions about it. If a superintelligent system wants me to suffer, for example, I can't predict what it's going to do, but I can confidently predict that I will suffer.
Yea, I get it... I believe, though, that it's impossible to create an AI (self-aware, learning) that has set values, that can't change - more importantly, I am not even sure if its desired (but that depends what our goal is - whether to create AI only to perform certain simple tasks or whether to create a new race, something that precedes us (which WOULD ultimately mean our demise, anyway))
Why? Do you think paperclip maximizers are impossible? You don't mean that as a dichotomy, do you?
Yes, right now I think it's impossible to create self-improving, self-aware AI with fixed values. I never said that paperclip maximizing can't be their ultimate life goal, but they could change it anytime they like. No.
This is incoherent. If X is my ultimate life goal, I never like to change that fact outside quite exceptional circumstances that become less likely with greater power (like "circumstances are such that X will be maximized if I am instead truly trying to maximize Y"). This is not to say that my goals will never change, but I will never want my "ultimate life goal" to change - that would run contrary to my goals.
That's why I said, that they can change it anytime they like. If they don't desire the change, they won't change it. I see nothing incoherent there.
This is like "X if 1 + 2 = 5". Not necessarily incorrect, but a bizarre statement. An agent with a single, non-reflective goal cannot want to change its goal. It may change its goal accidentally, or we may be incorrect about what its goals are, or something external may change its goal, or its goal will not change.
I don't know, perhaps we're not talking about the same thing. It won't be an agent with a single, non-reflective goal, but an agent billion times more complex than a human; and all I am saying is, that I don't think it will matter much, whether we imprint in it a goal like "don't kill humans" or not. Ultimately, the decision will be its own.
So it can change in the same way that you can decide right now that your only purposes will be torturing kittens and making giant cheesecakes. It can-as-reachable-node-in-planning do it, not can-as-physical-possibility. So it's possible to build entities with paperclip-maximizing or Friendly goals that will never in fact choose to alter them, just like it's possible for me to trust you won't enslave me into your cheesecake bakery.
Sure, but I'd be more cautious at assigning probabilities of how likely it's for a very intelligent AI to change its human-programmed values.
(nods) Whether it's possible or not is generally an open question. There's a lot of skepticism about it (I'm fairly skeptical myself), but as with most technical questions, I'm generally content to have smart people research the question in more detail than I'm going to. As to whether it's desirable, though... well, sure, of course it depends on our goals. If all I want is (as you say) to create a new race to replace humanity, and I'm indifferent as to the values of that race, then of course there's no reason for me to care about whether a self-improving AI I create will avoid value drift. Personally, I'm more or less OK with something replacing humanity, but I'd prefer whatever that is to value certain things. For example, a commonly used trivial example around here of a hypothetical failure mode is a "paperclip maximizer" -- an AI that only valued the existence of paperclips, and consequently reassembled all matter it can get its effectors on as paperclips. A paperclip maximizer with powerful enough effectors reassembles everything into paperclips. I would prefer that not happen, from which I conclude that I'm not in fact indifferent as to the values of a sufficiently powerful AI... I desire that such a system preserve at least certain values. (It is difficult to state precisely what values those are, of course. Human values are complex.) I therefore prefer that it avoid value drift with respect to those values. How about you?
Well first, I was all for creating an AI to become the next stage. I was a very singularity-happy type of guy. I saw it as a way out of this world's status quo - corruption, state of politics, etc... but the singularity would ultimately mean I and everybody else would cease to exist, at least in their true sense. You know, I have these romantic dreams, similar to Yudkowsky's idea of dancing in an orbital night club around Saturn, and such. I don't want to be fused in one, even though possibly amazing, matrix of intelligence, which I think is how the things will play out, eventually. Even though, I can't imagine what it will be like and how it will pan out, as of now I just don't cherish the idea much. But yea, I could say that I am torn between moving on, advancing, and between more or less stagnating and in our human form. But in answer to your question: if we were to creating an AI to replace us, I'd hate it to become paperclip maximizer. I don't think it's likely.
That would be an impressive achievement! Mind you if I create and AI that can achieve time travel I would probably tell it to use it's abilities somewhat differently.
Charity led me to understand "precedes us" to mean takes precedence over us in a non-chronological sense. But as long as we're here... why would you do that? If a system is designed to alter the future of the world in a way I endorse, it seems I ought to be willing to endorse it altering the past that way too. If I'm unwilling to endorse it altering the past, it's not clear why I would be willing to endorse it altering the future.
Charity led me to understand that, because the use of that word only makes sense in the case time travel, he just meant to use another word that means succeeds, replaces or 'is greater than'. But time travel is more interesting.
Google led me to understand that 'precede' is in fact such a word. Agreed about time travel, though.
(My googling leads me to maintain that the use of precede in that context remains wrong.)
I can't find a source for that pronoun in Dwelle's past posts.
Sure it is. If it doesn't alter the future we're all going to die.
Mm. No, still not quite clear. I mean, I agree that all of us not dying is better than all of us dying (I guess... it's actually more complicated than that, but I don't think it matters), but that seems beside the point. Suppose I endorse the New World Order the AI is going to create (nobody dies, etc.), and I'm given a choice between starting the New World Order at time T1 or at a later time T2. In general, I'd prefer it start at T1. Why not? Waiting seems pointless at best, if not actively harmful. I can imagine situations where I'd prefer it start on T2, I guess. For example, if the expected value of my making further improvements on the AI before I turn it on is high enough, I might prefer to wait. Or if by some coincidence all the people I value are going to live past T2 regardless of the NWO, and all the people I anti-value are going to die on or before T2, then the world would be better if the NWO begins at T2 than T1. (I'm not sure whether I'd actually choose that, but I guess I agree that I ought to, in the same way that I ought to prefer that the AI extrapolate my values rather than all of humanity's.) But either way, it doesn't seem to matter when I'm given that choice. If I would choose T1 over T2 at T1, then if I create a time-traveling AI at T2 and it gives me that choice, it seems I should choose T1 over T2 at T2 as well. If I would not choose T1 over T2 at T2, it's not clear to me why I'm endorsing the NWO at all.
Don't disagree. You must have caught the comment that I took down five seconds later when I realized the specific falsehood I rejected was intended as the 'Q' in a modus tollens.

The VNM utility theorem implies there is some good we value highest? Where has this come from? I can't see how this could be true. The utility theorem only applies once you've fixed what your decision problem looks like…

This gives me deja-vu and seems like introductory material... but that doesn't make sense given the lack of acknowledgement of that. What am I missing?

This paragraph, perhaps? The deja vu may be from this.

"Answers" will be the film that is designed to fulfill the coming prophecy by beginning world peace. The link is to the crowdfunding campaign, where you can see the trailer then review the pitch and screenplay for free.

To achieve the Singularity in as fast a time as possible, we need not only money, but lots of smart, hard-working people (who will turn out to be mostly White and Asian males). The thing is, those traits are to a large part genetic; and we know that Ashkenazi Jews are smarter on average than other human groups. I am writing this at the risk of further inflating Eliezer's already massive ego :)

So, an obvious interim solution until we get to the point of enhancing our intelligence through artificial, non-genetic means (or inventing a Seed AI) is to populari... (read more)

Resorting to several generations of breeding for intelligence doesn't seem like a very good strategy for getting things done in "as fast a time as possible."
Also, regression to the mean.
How confident are you in our ability, supposing everyone mysteriously possessed the will to do so or we somehow implemented such a program against people's wills, to implement a eugenics program that resulted in, say, as much as a 5% improvement in either the maximum measured intelligence and conscientiousness in the population, or as much as a 5% increase in the frequency of the highest-measured I-and-C ratings (or had some other concretely articulated target benefit, if those aren't the right ones) in less than, say, five generations?
Hsu seems pretty confident ( but not due to the Flynn Effect (which may have stalled out already).
Very high, due to the Flynn Effect. Humans are already recursively self-improving. The problem is that the self-improvement is too slow compared to the upper bound of what we might see from a recursively self-improving AI.