TL;DR: let's visualize what the world looks like if we successfully prepare for the Singularity.
I remember reading once, though I can't remember where, about a technique called 'contrasting'. The idea is to visualize a world where you've accomplished your goals, and visualize the current world, and hold the two worlds in contrast to each other. Apparently there was a study about this; the experimental 'contrasting' group was more successful than the control in accomplishing its goals.
It occurred to me that we need some of this. Strategic insights about the path to FAI are not robust or likely to be highly reliable. And in order to find a path forward, you need to know where you're trying to go. Thus, some contrasting:
It's the year 20XX. The time is 10 AM, on the day that will thereafter be remembered as the beginning of the post-Singularity world. Since the dawn of the century, a movement rose in defense of humanity's future. What began with mailing lists and blog posts became a slew of businesses, political interventions, infrastructure improvements, social influences, and technological innovations designed to ensure the safety of the world.
Despite all odds, we exerted a truly extraordinary effort, and we did it. The AI research is done; we've laboriously tested and re-tested our code, and everyone agrees that the AI is safe. It's time to hit 'Run'.
And so I ask you, before we hit the button: what does this world look like? In the scenario where we nail it, which achievements enabled our success? Socially? Politically? Technologically? What resources did we acquire? Did we have superior technology, or a high degree of secrecy? Was FAI research highly prestigious, attractive, and well-funded? Did we acquire the ability to move quickly, or did we slow unFriendly AI research efforts? What else?
I had a few ideas, which I divided between scenarios where we did a 'fantastic', 'good', or 'sufficient' job at preparing for the Singularity. But I need more ideas! I'd like to fill this out in detail, with the help of Less Wrong. So if you have ideas, write them in the comments, and I'll update the list.
Some meta points:
- This speculation is going to be, well, pretty speculative. That's fine - I'm just trying to put some points on the map.
- However, I'd like to get a list of reasonable possibilities, not detailed sci-fi stories. Do your best.
- In most cases, I'd like to consolidate categories of possibilities. For example, we could consolidate "the FAI team has exclusive access to smart drugs" and "the FAI team has exclusive access to brain-computer interfaces" into "the FAI team has exclusive access to intelligence-amplification technology."
- However, I don't want too much consolidation. For example, I wouldn't want to consolidate "the FAI team gets an incredible amount of government funding" and "the FAI team has exclusive access to intelligence-amplification technology" into "the FAI team has a lot of power".
- Lots of these possibilities are going to be mutually exclusive; don't see them as aspects of the same scenario, but rather different scenarios.
Anyway - I'll start.
Visualizing the pre-FAI world
- Fantastic scenarios
- The FAI team has exclusive access to intelligence amplification technology, and use it to ensure Friendliness & strategically reduce X-risk.
- The government supports Friendliness research, and contributes significant resources to the problem.
- The government actively implements legislation which FAI experts and strategists believe has a high probability of making AI research safer.
- FAI research becomes a highly prestigious and well-funded field, relative to AGI research.
- Powerful social memes exist regarding AI safety; any new proposal for AI research is met with a strong reaction (among the populace and among academics alike) asking about safety precautions. It is low status to research AI without concern for Friendliness.
- The FAI team discovers important strategic insights through a growing ecosystem of prediction technology; using stables of experts, prediction markets, and opinion aggregation.
- The FAI team implements deliberate X-risk reduction efforts to stave off non-AI X-risks. Those might include a global nanotech immune system, cheap and rigorous biotech tests and safeguards, nuclear safeguards, etc.
- The FAI team implements the infrastructure for a high-security research effort, perhaps offshore, implementing the best available security measures designed to reduce harmful information leaks.
- Giles writes: Large amounts of funding are available, via government or through business. The FAI team and its support network may have used superior rationality to acquire very large amounts of money.
- Giles writes: The technical problem of establishing Friendliness is easier than expected; we are able t construct a 'utility function' (or a procedure for determining such a function) in order to implement human values that people (including people with a broad range of expertise) are happy with.
- Crude_Dolorium writes: FAI research proceeds much faster than AI research, so by the time we can make a superhuman AI, we already know how to make it Friendly (and we know what we really want that to mean).
- Pretty good scenarios
- Intelligence amplification technology access isn't exclusive to the FAI team, but it is differentially adopted by the FAI team and their supporting network, resulting in a net increase in FAI team intelligence relative to baseline. The FAI team uses it to ensure Friendliness and implement strategy surrounding FAI research.
- The government has extended some kind of support for Friendliness research, such as limited funding. No protective legislation is forthcoming.
- FAI research becomes slightly more high status than today, and additional researchers are attracted to answer important open questions about FAI.
- Friendliness and rationality memes grow at a reasonable rate, and by the time the Friendliness program occurs, society is more sane.
- We get slightly better at making predictions, mostly by refining our current research and discussion strategies. This allows us a few key insights that are instrumental in reducing X-risk.
- Some X-risk reduction efforts have been implemented, but with varying levels of success. Insights about which X-risk efforts matter are of dubious quality, and the success of each effort doesn't correlate well to the seriousness of the X-risk. Nevertheless, some X-risk reduction is achieved, and humanity survives long enough to implement FAI.
- Some security efforts are implemented, making it difficult but not impossible for pre-Friendly AI tech to be leaked. Nevertheless, no leaks happen.
- Giles writes: Funding is harder to come by, but small donations, limited government funding, or moderately successful business efforts suffice to fund the FAI team.
- Giles writes: The technical problem of aggregating values through a Friendliness function is difficult; people have contradictory and differing values. However, there is broad agreement as to how to aggregate preferences. Most people accept that FAI needs to respect values of humanity as a whole, not just their own.
- Crude_Dolorium writes: Superhuman AI arrives before we learn how to make it Friendly, but we do learn how to make an 'Anchorite' AI that definitely won't take over the world. The first superhuman AIs use this architecture, and we use them to solve the harder problems of FAI before anyone sets off an exploding UFAI.
- Sufficiently good scenarios
- Intelligence amplification technology is widespread, preventing any differential adoption by the FAI team. However, FAI researchers are able to keep up with competing efforts to use that technology for AI research.
- The government doesn't support Friendliness research, but the research group stays out of trouble and avoids government interference.
- FAI research never becomes prestigious or high-status, but the FAI team is able to answer the important questions anyway.
- Memes regarding Friendliness aren't significantly more widespread than today, but the movement has grown enough to attract the talent necessary to implement a Friendliness program.
- Predictive ability is no better than it is today, but the few insights we've gathered suffice to build the FAI team and make the project happen.
- There are no significant and successful X-risk reduction efforts, but humanity survives long enough to implement FAI anyway.
- No significant security measures are implemented for the FAI project. Still, via cooperation and because the team is relatively unknown, no dangerous leaks occur.
- Giles writes: The team is forced to operate on a shoestring budget, but succeeds anyway because the problem turns out to not be incredibly sensitive to funding constraints.
- Giles writes: The technical problem of aggregating values is incredibly difficult. Many important human values contradict each other, and we have discovered no "best" solution to those conflicts. Most people agree on the need for a compromise but quibble over how that compromise should be reached. Nevertheless, we come up with a satisfactory compromise.
- Crude_Dolorium writes: The problems of Friendliness aren't solved in time, or the solutions don't apply to practical architectures, or the creators of the first superhuman AIs don't use them, so the AIs have only unreliable safeguards. They're given cheap, attainable goals; the creators have tools to read the AIs' minds to ensure they're not trying anything naughty, and killswitches to stop them; they have an aversion to increasing their intelligence beyond a certain point, and to whatever other failure modes the creators anticipate; they're given little or no network connectivity; they're kept ignorant of facts more relevant to exploding than to their assigned tasks; they require special hardware, so it's harder for them to explode; and they're otherwise designed to be safer if not actually safe. Fortunately they don't encounter any really dangerous failure modes before they're replaced with descendants that really are safe.
I saw a post the other year about artists who asked Michael Vassar what they could do to ensure friendliness. "Modify the status of friendliness research" seems like it could be a good answer to that. Art can definitely impact cultural memes (e.g. Stranger in a Strange Land), so a reasonable proceduralization of "modify the status of activity X" might be to get some artist to do it. (Though, even if artists can create and spread powerful memes, figuring out exactly how to construct those memes in order to modify status could be a pain point.)
You're totally right. Art & Utopianism go together like a horse and carriage. This is an interesting blog on the subject:
Another dimension: value discovery.
I'm tempted to add:
*Not So Good: the FAI team, or one team member, takes over the world. (Imagine an Infinite Doom spell done right.)
I would much rather see any single human being's values take over the future light cone than a paperclip maximizer!
So would I. It's not so good, but it's not so bad either.
Agree with your Fantastic but disagree with how you arrange the others... it wouldn't be rational to favor a solution which satisfies others' values in larger measure at the expense of one's own values in smaller measure. If the solution is less than Fantastic, I'd rather see a solution which favors in larger measure the subset of humanity with values more similar to my own, and in smaller measure the subset of humanity whose values are more divergent from my own.
I know, I'm a damn, dirty, no good egoist. But you have to admit that in principle egoism is more rational than altruism.
OK - I wasn't too sure about how these ones should be worded.
Why are these part of the "fantastic scenario"? An asteroid defense system will almost certainly not be needed: the overwhelmingly likely case (backed up by telescope observations and outside view statistics) is that there won't be any big threatening asteroids over the relevant timescales.
Similarly, many of the other scenarios you list are concerned with differences that would slightly (or perhaps substantially for some) shift the probability of global outcomes, not outcomes. That's pretty different from a central requirement of a successful outcome. The framework here could be clearer.
I imagined the 'fantastic scenario' as being one in which "The good guys implement deliberate X-risk reduction efforts to stave off non-AI X-risks". I meant to cite "a global nanotech immune system, cheap and rigorous biotech tests and safeguards, an asteroid defense system, nuclear safeguards" as examples of "X-risk reduction efforts" in order to fill out the category, regardless of the individual relevance of any of the examples. Anyway, it's confusing, and I should remove it.
Yeah, I think I want a picture of what the world looks like where the probability of success was as high as possible, and then we succeeded. I think the central requirements of successful outcomes are far fewer, and less helpful for figuring out where to go.
I like how you've partitioned things up into IA/government/status/memes/prediction/xrisk/security and given excellent/good/ok options for each. This helps imagine mix-and-match scenarios, e.g. "FAI team has got good at security but sucks at memes and status".
A few quick points:
The fantastic list has 8 points and the others have 7 (as there are two "government" points). This brings me on to...
Should there be a category for "funding"? The fantastic/good/ok options could be something like:
Does it have to be the FAI team implementing the "other xrisk reduction efforts" or can it just be "such institutions exist"?
I'll add this, and the one from your other comment. (By the way, thank you for being the only person so far to actually answer the goddamn prompt!)
Answering the goddamn prompt is haaaard! I don't know if there's a relationship between the presence of non-answer comments and the absence of answer comments.
What's worse is I wasn't even consciously aware that I was doing that. I'll try and read posts more carefully in the future!
What are those "good guys" you speak of?
People pursuing a positive Singularity, with the right intentions, who understand the gravity of the problem, take it seriously, and do it on behalf of humanity rather than some smaller group.
I haven't offered a rigorous definition, and I'm not going to, but I think you know what I mean.
Right, but this is a public-facing post. A lot of readers might not know why you could think it was obvious that "good guys" would imply things like information security, concern for Friendliness so-named, etc., and they might think that the intuition you mean to evoke with a vague affect-laden term like "good guys" is just the same argument-disdaining groupthink that would be implied if they saw it on any other site.
To prevent this impression, if you're going to use the term "good guys", then at or before the place where you first use it, you should probably put an explanation, like
Okay, I'm convinced. I think I will just remove the term altogether, because it's confusing the issue.
I might have some inkling of what you want to mean, but on this forum, you ought to be able to define your terms to be taken seriously. I suspect that if you honestly try defining "good guys", you will find that it is harder than it looks and not at all obvious.
I'm not saying that the definition is obvious - I'm saying that it's besides the point. It was clearly detracting from the quality of the conversation, though, so I've removed the term.
What do the good guys look like? Do they look like a cabal with government sanction that performs research in secret facilities offshore, control the asteroid deflection system (and therefore the space program), and prohibit anyone else from using the most effective mind (and presumably quality-of-life) enhancing techniques?
Basically, should one of the very first thing a Friendly AI does be to wipe out the group of people who succeed in creating the first FAI?
If the "FAI is important" position is correct, but requires intelligence to understand, would widespread IA cause more people to become interested in working on FAI?
Yeah, I've heard that argument before. The idea is that intelligence not only makes you better at stuff, but also impacts how you make decisions about what to work on.
The alternate hypothesis is that intelligence-amplified people would just get better at being crazy. Perhaps one could start to tease apart the hypotheses by distinguishing 'intelligence' from 'reflectiveness' and 'altruism', and trying to establish how those quantities interact.
Related point: High-IQ folks are more likely to cooperate in the prisoner's dilemna. (See Section 3 of this article.) Which suggests that they'd be more inclined to do altruistic stuff like make an AI that's friendly to everyone rather than an AI that serves their individual wishes.
This list is focused on scenarios where FAI succeeds by creating an AI that explodes and takes over the world. What about scenarios where FAI succeeds by creating an AI that provably doesn't take over the world? This isn't a climactic ending (although it may be a big step toward one), but it's still a success for FAI, since it averts a UFAI catastrophe.
(Is there a name for the strategy of making an oracle AI safe by making it not want to take over the world? Perhaps 'Hermit AI' or 'Anchorite AI', because it doesn't want to leave its box?)
This scenario deserves more attention that it has been getting, because it doesn't depend on solving all the problems of FAI in the right order. Unlike Nanny AI that takes over the world but only uses its powers for certain purposes, Anchorite AI might be a much easier problem than full-fledged FAI, so it might be developed earlier.
In the form of the OP:
Thanks! I've added it to the post. I particularly like that you included the 'sufficiently good' scenario - I hadn't directly thought about that before.
Maybe even the five seconds before the AI takes over the world look just like any other day.
And then... self-replicating nanobots everywhere.
Great post! And there should also be one on visualizing the day after.
Enter 3d10 here, click roll, then get your results here.
Of course there won't be any "day before" or "day after". Machines reaching "human level" will be smooth and gradual - since the surpassing will happen one faculty at a time, and many faculties have already been surpassed (memory, arithmetic, etc). Historians won't assign a particular date. I'm pretty confident that anyone arguing for a particular "day" is just confused.
I disagree. Certain landmarks will seem more important than others, such as intelligence 'tests', like the day a machine unequivocally passes a Turing test. In hindsight, we should be able to further isolate and identify those important landmarks that led to, or directly caused, the singularity.
Furthermore, historians being historians, I am quite convinced a date WILL be included in the history books, regardless of merit.
Like the date of Google's birthday. Maybe a lot like that - since, of those who are trying, they are probably in the lead.
A more likely milestone is the moment the FAI team decides to hit "run" on a design that eventually takes over the world. There will be a day before and a day after that.
That is not going to happen, though. Designing intelligent machines is an incremental process involving teams of humans and machines - and the involvement of humans in the design process will diminish gradually - one human at a time - through automation.
Launching a particular design (version x.y.z) is a one-time event. From that perspective, AI is not a web site. It's a rocket.
Let me rephrase: the math behind AI will take some time do be discovered. The discovery process will be incremental, with some important milestones ("Conciousness Reduced", "Bypassing Löbs Theorem"…). At one point, the FAI team will have realized they have found the solution. They will know what to do to make Friendly AI happen. This realization will also be incremental. They will likely discuss the solution before everyone agrees it is the right one.
One key point about that solution is that it is provably safe. To prove a solution is safe, you have to describe it with mathematical precision. When you can do that, you can also write a computer program that implements that solution. And that's exactly what they will do, because they don't want to leave any room for human error.
At one point, the program that implements the solution will be build. Again, this will be incremental. There will be bugs, which will be corrected. Then there will be no bug left. Then the FAI team will, gradually, realize there is no bug.
And then, and only then, someone will push the red button, type
build && runor something. The provably correct program will then execute the provably correct solution to make FAI happen. The ultimate milestone.
We're not finished. The program runs, but we don't have a super-intelligence capable of fixing our problems yet. But we will. Singularity will happen, gradually, self improvement by self improvement. Humans will probably be kept in the loop as well, for preference extraction. But they won't be holding the steering wheel.
That looks like a very detailed story, but the core point is very simple: design, write the program, and prove (possibly iteratively), and only then, launch. Because running a program that's still in development is not safe —the video is a joke, but its core point stands: why in the name of Azathoth did the guy launched a self-improving sentient program without knowing for sure it will be safe? If we win, that sure won't be by trial and error.
We don't seem to agree. This isn't how technology gets built. Nobody proved the first aeroplane was safe. Nobody proved the first space rocket was safe. They weren't safe. No areoplanes or spaceships have ever been proven safe. You may be able to prove some things about the design process, but not safety - security doesn't work like that.
There is something called "provable security" in cryptography - but it doesn't really mean what its title says. It means that you can prove something relating to security in a particular model - not that you can prove something is safe.
I made 2 assumptions here
You, on the other hand, say that we will most certainly not take those drastic precautions. And you know the worst part? I agree. Which take us back to square one: by default, we're doomed. (There. You've done it. Now I'm scared.)
Evolution isn't really a win/lose game. Organisms succeed in perpetuating themselves - and the things they value - to varying degrees. Humans seem set to survive in the history books, but our overall influence on the future looks set to be quite marginal - due to the substantial influence of technological determinism - except in the unlikely case where we permanently destroy civilization. Of course we can still try - but I just don't think our best shot looks much like what you described. Of course it might be fun if we had time for all that stuff about provable security - but at the moment, the situation looks a lot like a frantic race, and that looks like exactly the sort of thing that will be first to go up against the wall.
Are you saying that you don't buy the scary idea?
I said I considered destroying "civilization" to be unlikely. Going by this:
...the scary idea claims to be about "the human race". I don't define "civilization" in a human-centric way - so I don't class those as being the same thing - for instance, I think that civilization might well continue after an "involuntary" robot takeover.
Well, a civilization with humanity all dead is pretty much certainly not what we want. I don't care if in the grand scheme of things, this isn't a win/lose game. I think I have something like a utility function, and I want it maximized, period.
Back to my question: do you see any other path to building a future we want than the one I described?
Well, humans will live on via historical simulations, with pretty good probability. Humans won't remain the dominant species, though. Those hoping for that have unrealistic expectations. Machines won't remain human tools, they are likely to be in charge.
Sure, but it's you and billions of other organisms - with often-conflicting futures in mind - and so most won't have things their way.
IIRC, your proposal put considerable emphasis on proof. We'll prove what we can, but proof often lags far behind the leading edge of computer science. There are many other approaches to building mission critical systems incrementally - I expect we will make more use those.
Historical simulations: assuming it preserves identity etc, why not…
Utility function: I know that my chances of maximizing my utiliy function are quite… slim, to say the least.
Path to best future(humanity): proofs do not lag so far behind right now. Modern type systems are now pretty good, and we have proof assistants that makes the "prove your whole program" quite feasible –though not cheap yet. Plus, the leading edge is generally the easiest to prove, because it tends to lie on solid mathematical ground. We don't do proofs because they're generally expensive, and we use ancient technologies that leak lots of low-level details, and make proofs much harder. (I program for a living.)
But I see at least the possibility of a slightly different path: still take precautions, just don't prove the thing.
Oh, and I forgot: if we solve safety before capability, incrementally designing the AI by trial-and-error would be quite reasonable. The definite milestone will be harder to define in this case. I guess I'll have to update a bit.
Um.... it's called a "Singularity" precisely BECAUSE it has an "event horizon"! Dude, are you sleep-deprived or something? :P
(Sorry if I'm behaving inappropriately; some prescribed antidepressants rather unexpectedly just started working for me when I was expecting little efffect.)
It's called mental contrasting. The relevant studies are cited in the optimizing optimism section of How to Beat Procrastination.
How about adding "international conflict (or lack thereof)" as another dimension? The space race, after all, occurred (and is discussed) largely in the context of the cold war.
So a fantastic scenario would be that there is no such conflict, and it's developed multinationally and across multinational blocs; a pretty good scenario would be that two otherwise politically-similar countries compete for prestige in being the first to develop FAI (which may positively affect funding and meme-status, but negatively affect security), and a sufficiently good scenario would be that the competition is between different political blocs, who nonetheless can recognize that the development of FAI means making their own political organizations obsolete.
Sure - if you can format your scenarios into an easily copy-pastable format like that in the post, I'd be happy to add it.
The sheer volume of the scenarios diminishes the likelihood of any one of them. The numerous variations indicate an intractable predictability. While subject to conjunction bias, a more granular approach is the only feasible method to determine even a hint of the pre-FAI environment. Only a progressively refined model can provide information of value.
Change the question to 'what does an optimal FAI team look like' and many objections are answered. What the world looks like and what a (hopefully) FAI will do remain problematic. As a get-there-from-here strategy, stating optimals has been helpful to me.
Just a gut reaction, but this whole scenario sounds preposterous. Do you guys seriously believe that you can create something as complex as a superhuman AI, and prove that it is completely safe before turning it on? Isn't that as unbelievable as the idea that you can prove that a particular zygote will never grow up to be an evil dictator? Surely this violates some principles of complexity, chaos, quantum mechanics, etc.? And I would also like to know who these "good guys" are, and what will prevent them from becoming "bad guys" when they wield this much power. This all sounds incredibly naive and lacking in common sense!
The main way complexity of this sort would be addressable is if the intellectual artifact that you tried to prove things about were simpler than the process that you meant the artifact to unfold into. For example, the mathematical specification of AIXI is pretty simple, even though the hypotheses that AIXI would (in principle) invent upon exposure to any given environment would mostly be complex. Or for a more concrete example, the Gallina kernel of the Coq proof engine is small and was verified to be correct using other proof tools, while most of the complexity of Coq is in built-up layers of proof search strategies which don't need to themselves be verified, as the proofs they generate are checked by Gallina.
Yes, any physical system could be subverted with a sufficiently unfavorable environment. You wouldn't want to prove perfection. The thing you would want to prove would be more along the lines of, "will this system become at least somewhere around as capable of recovering from any disturbances, and of going on to achieve a good result, as it would be if its designers had thought specifically about what to do in case of each possible disturbance?". (Ideally, this category of "designers" would also sort of bleed over in a principled way into the category of "moral constituency", as in CEV.) Which, in turn, would require a proof of something along the lines of "the process is highly likely to make it to the point where it knows enough about its designers to be able to mostly duplicate their hypothetical reasoning about what it should do, without anything going terribly wrong".
We don't know what an appropriate formalization of something like that would look like. But there is reason for considerable hope that such a formalization could be found, and that this formalization would be sufficiently simple that an implementation of it could be checked. This is because a few other aspects of decision-making which were previously mysterious, and which could only be discussed qualitatively, have had powerful and simple core mathematical descriptions discovered for cases where simplifying modeling assumptions perfectly apply. Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam's razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don't perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex -- like the "how to figure out what my designers would want to do in this case" idea above. Also, you'd have to formalize something like our informal notion of how to generate and evaluate approximations, because approximations are more complex than the ideals they approximate, and you wouldn't want to need to directly verify the safety of any more approximations than you had to. (But note that, for reasons related to Rice's theorem, you can't (and therefore shouldn't want to) lay down universally perfect rules for approximation in any finite system.)
Two other related points are discussed in this presentation: the idea that a digital computer is a nearly deterministic environment, which makes safety engineering easier for the stages before the AI is trying to influence the environment outside the computer, and the idea that you can design an AI in such a way that you can tell what goal it will at least try to achieve even if you don't know what it will do to achieve that goal. Presumably, the better your formal understanding of what it would mean to "at least try to achieve a goal", the better you would be at spotting and designing to handle situations that might make a given AI start trying to do something else.
(Also: Can you offer some feedback as to what features of the site would have helped you sooner be aware that there were arguments behind the positions that you felt were being asserted blindly in a vacuum? The "things can be surprisingly formalizable, here are some examples" argument can be found in lukeprog's "Open Problems Related to the Singularity" draft and the later "So You Want to Save the World", though the argument is very short and hard to recognize the significance of if you don't already know most of the mathematical formalisms mentioned. A backup "you shouldn't just assume that there's no way to make this work" argument is in "Artificial Intelligence as a Positive and Negative Factor in Global Risk", pp 12-13.)
That's a problem where successful/practically applicable formalizations are harder to hope for, so it's been harder for people to find things to say about it that pass the threshold of being plausible conceptual progress instead of being noisy verbal flailing. See the related "How can we ensure that a Friendly AI team will be sane enough?". But it's not like people aren't thinking about the problem.
This is actually one of the best comments I've seen on Less Wrong, especially this part:
Thanks for the clear explanation.
The idea is not "take an arbitrary superhuman AI and then verify it's destined to be well behaved" but rather "develop a mathematical framework that allows you from the ground up to design a specific AI that will remain (provably) well behaved, even though you can't, for arbitrary AIs, determine whether or not they'll be well behaved."
I think this comment is disingenuous, given your statements that the extinction of humanity is inevitable and that you have a website using evil AI imagery. http://lesswrong.com/lw/b5i/a_primer_on_risks_from_ai/64dq
Whether the individual in question has other motivations doesn't by itself make the questions raised any less valid.
It could be evidence that the questioner isn't worth engaging, because the conversation is unlikely to be productive. The questioner might have significantly motivated cognition or have written the bottom line.
The scenarios you listed are absurd, incoherent, irrelevant, and wildly, wildly optimistic.
Actually I'd argue this entire exercise is an indictment of Eliezer's approach to Friendly AI. The notion of a formal, rigorous success of "Friendliness theory" coming BEFORE the Singularity is astronomically improbable.
What a Friendly Singularity will actually look like is an AGI researcher or researchers forging ahead at an insane risk to themselves and the rest of humanity, and then somehow managing the improbable task of not annihilating humanity through intensive, inherently faulty, safety engineering, before later eventually realizing a formal solution to Friendliness theory post-Singularity. And of course it goes without saying that the odds are heavily against any such safety mechanisms even succeeding, let alone the odds that they will ever even be attempted.
Suffice to say a world in which we are successfully prepared to implement Friendly AI is unimaginable at this point.
EDIT: how the hell do I un-retract this? see below
Why is his pessimistic (realistic?) take down voted without counterargument? Hankx isn't in negative karma so I don't think he isn't usually disruptive and I think he is making this argument in good faith.
Well I didn't really substantively defend my position with reasons, and heaping on all the extra adjectives didn't help :P
I was trying to figure out how to strike-through the unsupported adjectives, now I can't figure out how to un-retract the comment... bleh what a mess.
While I still agree with all the adjectives, I'll take them out to be less over the top. Here's what the edit should say:
And just to give some indication of where I'm coming from, I would say that this conclusion follows pretty directly if you buy Eliezer's arguments in the sequences and elsewhere about locality and hard-takeoff, combined with his arguments that FAI is much harder than AGI. (see e.g. here)
Of course I have to wonder, is Eliezer holding out to try to "do the impossible" in some pipe dream FAI scenario like the OP imagines, or does he agree with this argument but still thinks he's somehow working in the best way possible to support this more realistic scenario when it comes up?
Strikethrough by using double tildes on each side of the struck-through portion: ~~unsupported adjective~~ => unsupported adjective.