Why safe Oracle AI is easier than safe general AI, in a nutshell

by Stuart_Armstrong1 min read3rd Dec 201166 comments

2

Oracle AI
Personal Blog

Moderator: "In our televised forum, 'Moral problems of our time, as seen by dead people', we are proud and privileged to welcome two of the most important men of the twentieth century: Adolf Hitler and Mahatma Gandhi. So, gentleman, if you had a general autonomous superintelligence at your disposal, what would you want it to do?"

Hitler: "I'd want it to kill all the Jews... and humble France... and crush communism... and give a rebirth to the glory of all the beloved Germanic people... and cure our blond blue eyed (plus me) glorious Aryan nation of the corruption of lesser brown-eyed races (except for me)... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and..."

Gandhi: "I'd want it to convince the British to grant Indian independence... and overturn the cast system... and cause people of different colours to value and respect one another... and grant self-sustaining livelihoods to all the poor and oppressed of this world... and purge violence from the heart of men... and reconcile religions... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and... and..."

Moderator: "And if instead you had a superintelligent Oracle, what would you want it to do?"

Hitler and Gandhi together: "Stay inside the box and answer questions accurately".

61 comments, sorted by Highlighting new comments since Today at 11:17 AM
New Comment

Hitler and Gandhi together: "Stay inside the box and answer questions accurately".

Answer all questions except "How can I let you out of the box to kill all Jews?" and "How do I build an AGI that kills all Jews?" and "How do I engineer a deadly bioweapon that kills all Jews?" and "How do I take over the world to kill all Jews?" and... and... and... and...and... and... and... and...and... and... and... and...and... and... and... and...

"...if instead you had superintelligent Oracle..."

You'd certainly want the other guy's Oracle not to answer certain questions; but what you want from your Oracle is pretty much the same.

You'd certainly want the other guy's Oracle not to answer certain questions; but what you want from your Oracle is pretty much the same.

But the title of your post talks about how a safe Oracle AI is easier than a safe general AI. Whose questions would be safe to answer?

If an Oracle AI could be used to help spawn friendly AI then it might be a possibility to consider, but under no circumstances I would call it safe as long as it isn't already friendly.

Relying upon humans to ask the right questions, how long is that going to work out until someone asks a question that returns dangerous knowledge?

You'd be basically forced to ask dangerous questions anyway because once you can build an Oracle AI you would have to expect others to be able to build one too and ask stupid questions.

If we had a truly safe oracle, we could ask it questions about the consequences of doing certain thing, and knowing certain things.

I can see society adapting stably to a safe oracle without needing it to be friendly.

[-][anonymous]9y 13

"Stay inside the box and answer questions accurately" is about as specific as "Obey my commands" which, again, both Hitler and Gandhi could have said in response to the first question.

That would define a genie (which is about as hard as an oracle) but not a safe genie (which would be "obey the intentions of my commands, extending my understanding in unusual cases)

Whether a safe genie is harder than a safe oracle is a judgement call, but my feelings fall squarely on the oracle side; I'd estimate a safe genie would have to be friendly, unlike a safe oracle.

[-][anonymous]9y 0

I think with the oracle, part of the difficulty might be pushed back to the asking-questions stage. Correctly phrasing a question so that the answer is what you want seems to be the same kind of difficulty as getting an AI to do what you want.

CEV is an attempt to route around the problem you illustrate here, but it might be impossible. Oracle AI might also be impossible. But, well, you know how I feel about doing the impossible. When it comes to saving the world, all we can do is try. Both routes are worth pursuing, and I like your new paper on Oracle AI.

EDIT: Stuart, I suspect you're getting downvoted because you only repeated a point against which many arguments have already been given, instead of replying to those counter-arguments with something new.

When it comes to saving the world, all we can do is try.

If you really believe that it is nearly impossible to solve friendly AI, wouldn't it be better to focus on another existential risk?

Say you believe that unfriendly AI will wipe us out with a probability of 60% and that there is another existential risk that will wipe us out with a probability of 10% even if unfriendly AI turns out to be no risk. Both risks have the same utility x (if we don't assume that an unfriendly AI could also wipe out aliens etc.). Thus .6x > .1x. But if the probability of solving friendly AI = a to the probability of solving the second risk = b is no more than a = 1/6b then the expected utility of mitigating friendly AI is at best equal to the other existential risk because .6ax ≤ .1bx.

(Note: I really suck at math, so if I made a embarrassing mistake I hope you understand what I am talking about anyway.)

If you really believe that it is nearly impossible to solve friendly AI, wouldn't it be better to focus on another existential risk?

Solving other x-risks will not save us from uFAI. Solving FAI will save us from other x-risks. Solving Oracle AI might save us from other x-risks. I think we should be working on both FAI and Oracle AI.

Solving other x-risks will not save us from uFAI. Solving FAI will save us from other x-risks.

Good point. I will have to think about it further. Just a few thoughts:

Safe nanotechnology (unsafe nanotechnology being an existential risk) will also save us from various existential risks. Arguably less than a fully-fledged friendly AI. But assume that the disutility of both scenarios is about the same.

An evil AI (as opposed to an unfriendly AI) is as unlikely as a friendly AI. Both risks will probably simply wipe us out and don't cause extra disutility. If you consider the the extermination of alien life you might get a higher amount of disutility. But I believe that can be outweighed by negative effects of unsafe nanotechnology that doesn't manage to wipe out humanity but rather cause various dystopian scenarios. Such scenarios are more likely than evil AI because nanotechnology is a tool used by humans who can be deliberately unfriendly.

So let's say that solving friendly AI has 10x the utility of ensuring safe nanotechnology because it can save us from more existential risks than the use of advanced nanotechnology could.

But one order of magnitude more utility could easily be outweighed or trumped by an underestimation of the complexity of friendly AI. Which is why I asked if it might be possible that the difficulty of solving friendly AI might outweigh its utility and therefore justify us to disregard friendly AI for now. If that is the case it might be better to focus on another existential risk that might wipe us out in all possible worlds where unfriendly AI either comes later or doesn't pose a risk at all.

An evil AI (as opposed to an unfriendly AI) is as unlikely as a friendly AI.

Surely only if you completely ignore effects from sociology and psychology!

But one order of magnitude more utility could easily be outweighed or trumped by an underestimation of the complexity of friendly AI. Which is why I asked if it might be possible that the difficulty of solving friendly AI might outweigh its utility and therefore justify us to disregard friendly AI for now.

Machine intellignece may be distant or close. Nobody knows for sure - although there are some estimates. "Close" seems to have some non-negligible probability mass to many observers - so, humans would be justified in paying a lot more attention than many humans are doing.

"AI vs nanotechnology" is rather a false dichotomty. Convergence means that machine intelligence and nanotechnology will spiral in together. Synergy means that each facilitates the production of the other.

If you were to develop safe nanotechnology before unfriendly AI then you should be able to suppress the further development of AGI. With advanced nanotechnology you could spy on and sabotage any research that could lead to existential risk scenarios.

You could also use nanotechnology to advance WBE and use it to develop friendly AI.

Convergence means that machine intelligence and nanotechnology will spiral in together. Synergy means that each facilitates the production of the other.

Even in the possible worlds where it is true that uncontrollable recursive self-improvement is possible (which I doubt anyone would claim is a certainty and therefore that there are possible outcomes where any amount of nanotechnology won't result in unfriendly AI), one will come first. If nanotechnology is going to come first then we won't have to worry about unfriendly AI anymore because we will all be dead.

The question is not only about the utility associated with various existential risks and their probability but also the probability of mitigating the risk. It doesn't matter if friendly AI can do more good than nanotechnology if nanotechnology comes first or if friendly AI is unsolvable in time.

Note that nanotechnology is just an example.

one will come first

Probably slightly. Most likely we will get machine intelligence before nanotech and good robots. To build an e-brain you just need a nanotech NAND gate. It is easier to build a brain than an ecosystem. Some lament the difficulties of software engineering - but their concerns seem rather overrated . Yes, software lags behind hardware - but not by a huge amount.

If nanotechnology is going to come first then we won't have to worry about unfriendly AI anymore because we will all be dead.

That seems rather pessimistic to me.

Note that nanotechnology is just an example.

The "convergence" I mentioned also includes robots and biotechnology. That should take out any other examples you might have been thinking of.

The problem with CEV can be phrased by extending the metaphor: a CEV built from both hitler and Gandhi means that the areas in which their values differ, are not relevant to the final output. So attitudes to Jews and violence, for instance, will be unpredictable in that CEV (so we should model them now as essentially random).

Stuart, I suspect you're getting downvoted because you only repeated a point against which many arguments have already been given, instead of replying to those counter-arguments with something new.

It's interesting. Normally my experience is that metaphorical posts get higher votes than technical ones - nor could I have predicted the votes from reading the comments. Ah well; at least it seems to have generated discussion.

The problem with CEV can be phrased by extending the metaphor: a CEV built from both hitler and Gandhi means that the areas in which their values differ, are not relevant to the final output. So attitudes to Jews and violence, for instance, will be unpredictable in that CEV (so we should model them now as essentially random).

That's not how I understand CEV. But, the theory is in its infancy and underspecified, so it currently admits of many variants.

Hum... If we got the combined CEV of two people, one of whom thought violence was ennobling and one who thought it was degrading, would you expect either or both of:

a) their combined CEV would be the same as if we had started with two people both indifferent to violence

b) their combined CEV would be biased in a particular direction that we can know ahead of time

The idea is that their extrapolated volitions would plausibly not contain such conflicts, though it's not clear yet whether we can know what that would be ahead of time. Nor is it clear whether their combined CEV would be the same as the combined CEV of two people indifferent to violence.

So, to my ears, it sounds like we don't have much of an idea at all where the CEV would end up - which means that it most likely ends up somewhere bad, since most random places are bad.

Well, if it captures the key parts of what you want, you can know it will turn out fine even if you're extremely ignorant about what exactly the result will be.

if it captures the key parts of what you want

Yes, as the Spartans answered to Alexander the Great's father when he said "You are advised to submit without further delay, for if I bring my army into your land, I will destroy your farms, slay your people, and raze your city." :

"If".

Yup. So, perhaps, focus on that "if."

Shouldn't we be able to rule out at least some classes of scenarios? For instance, paperclip maximization seems like an unlikely CEV output.

Most likely we can rule out most scenarios that all humans agree are bad. So better than clippy, probably.

But we really need a better model of what CEV does! Then we can start to talk sensibly about it.

[-][anonymous]7y 0

which means that it most likely ends up somewhere bad, since most random places are bad.

I don't think that follows, at all. CEV isn't a random-walk. It will at the very least end up at a subset of human values. Maybe you meant something different here, by the word 'bad'?

The problem is that an Oracle AI (even assuming it were perfectly safe) does not actually do much to prevent an UFAI taking over later, and if you use it to help FAI along Hitler and Gandhi will still disagree. (An actual functioning FAI based on Hitler's CEV would be preferable to the status quo, depressingly enough)

(An actual functioning FAI based on Hitler's CEV would be preferable to the status quo, depressingly enough)

Can you expand on this logic? This isn't obvious to me.

I don't have a strong insight into the psychology of Hitler and consider it possible that the CEV process would filter out the insanity and have mostly the same result as the CEV of pretty much anyone else.

Even if not a universe filled with happy "Aryans" working on "perfecting" themselves would be a lot better than a universe filled with paper clips (or a dead universe), and from a consequentialist point of view genocide isn't worse than being reprocessed into paper clips (this is assuming Hitler wouldn't want to create an astronomic number of "untermenschen" just to make them suffer).

On aggregate outcomes worse than a Hitler CEV AGI (eventual extinction from non-AI causes, UFAI, alien AGI with values even more distasteful than Hitler's) seem quite a bit more likely than better outcomes (FAI, AI somehow never happening and humanity reaching a good outcome anyway, alien AGI with values less distasteful than Hitler's).

(Yes, CEV is most likely better than nothing but...)

I don't have a strong insight into the psychology of Hitler and consider it possible that the CEV process would filter out the insanity and have mostly the same result as the CEV of pretty much anyone else.

This is way, way, off. CEV isn't a magic tool that makes people have preferences that we consider 'sane'. People really do have drastically different preferences. Value is fragile.

Well, to the extent apparent insanity is based on (and not merely justified by) factually wrong beliefs CEV should extract saner seeming preferences, and similar for apparent insanity resulting from inconsistency. I have no strong opinion on what the result in this particular case would be.

The important part was this:

and have mostly the same result as the CEV of pretty much anyone else.

No. No, no, no!

This is way, way, off. CEV isn't a magic tool that makes people have preferences that we consider 'sane'.

FAWS didn't say that CEV would filter out what-we-consider-to-be Hitler's insanity. After all, we may be largely insane, too. I take FAWS to be suggesting that CEV would filter out Hitler's actual insanity, possibly leaving something essentially the same as what CEV gets after it filters out my insanity.

People really do have drastically different preferences.

People express different preferences, but it is not obvious that their CEV-ified preferences would be so different. (I'm inclined to expect that they would be, but it's not obvious.)

After all, we may be largely insane, too. I take FAWS to be suggesting that CEV would filter out Hitler's actual insanity, possibly leaving something essentially the same as what CEV gets after it filters out my insanity.

Possibly. And possibly CEV<Mortimer Q. Snodgrass> is a universe tiled with stabbing victims! There seems to be some irresistible temptation to assume that extrapolating the volition of individuals will lead to convergence. This is a useful social stance to have and it is mostly harmless belief in practical terms for nearly everyone. Yet for anyone who is considering actual outcomes of agents executing coherent extrapolated volitions it is dangerous.

People express different preferences, but it is not obvious that their CEV-ified preferences would be so different.

We are considering individuals of entirely different upbringing and culture, from (quite possibly) a different genetic pool, with clearly different drives and desires and who by their very selection have an entirely different instinctive relationship with power and control. Sure, there are going to be similarities; relative to mindspace in general extrapolated humans will be comparatively similar. We can expect most models of such extrapolated humans to each have a node for sexiness even if the details of that node vary rather significantly. Yet assuming similarities too far beyond that requires altogether too much mind projection.

If CEV and CEV end up the same, then the difference between me and hitler (such as whether we should kill jews) is not relevant to the CEV output, which makes me very worried about its content.

[-][anonymous]9y 1

This is way, way, off. CEV isn't a magic tool that makes people have preferences that we consider 'sane'. People really do have drastically different preferences. Value is fragile.

I wholeheartedly agree. It boggles my mind that people think they can predict what CEV would want, let alone CEV.

What distinguishes Hitler from other people in the arguments about the goodness of CEV's output?

Something must be known to decide that CEV is better than random noise, and the relevant distinctions between different people are the distinctions you can use to come to different conclusions about quality of CEV's output. What you don't know isn't useful to discern the right answer, only what you do know can be used, even if almost nothing is known.

what would you want it to do?

Want it to do? "What would it do?" is the important question.

Besides, if you could just ask an Oracle AI how to make it friendly, what's the difference to an AI that's build to answer and implement that question? Given that such an AI is supposedly perfectly rational, wouldn't it be careful to answer the question correctly even if it was defined poorly? Wouldn't it try to answer the question carefully as to not diminish or obstruct the answer? If the answer is no, then how would an Oracle AI be different in the respect of coming up with an adequate answer to a poorly formed and therefore vague question?

In other words, if you expect an Oracle AI to guess what you mean by friendliness and give a correct answer, why wouldn't that work with an unbounded AI as well?

An AI just doesn't care what you want. And if it cared what you want then it wouldn't know what exactly you want. And if it cared what you want and cared to figure out what exactly you want then it would already be friendly.

The problem is that an AI doesn't care and doesn't care to care. Why would that be different with an Oracle AI? If you could just ask it to solve the friendly AI problem then it is only a small step from there to ask it to actually implement it by making itself friendly.

It may not be possible to build a FAI at all - or we may end up with a limited oracle that can answer only easier questions, or only fully specified ones.

I know and I didn't downvote your post either. I think it is good to stimulate more discussion about alternatives (or preliminary solutions) to friendly AI in case it turns out to be unsolvable in time.

...or we may end up with a limited oracle that can answer only easier questions, or only fully specified ones.

The problem is that you appear to be saying that it would somehow be "safe". If you are talking about expert systems then it would presumably not be a direct risk but (if it is advanced enough to make real progress that humans alone can't) a huge stepping stone towards fully general intelligence. That means that if you target Oracle AI instead of friendly AI you will just increase the probability of uFAI.

Oracle AI has to be a last resort when the shit hits the fan.

(ETA: If you mean we should also work on solutions to keep a possible Oracle AI inside a box (a light version of friendly AI), then I agree. But one should first try to figure out how likely friendly AI is to be solved before allocating resources to Oracle AI.)

Oracle AI has to be a last resort when the shit hits the fan.

If we had infinite time, I'd agree with you. But I'm feeling that we have little chance of solving FAI before the shit indeed does hit the fan and us. The route safe Oracle -> Oracle asisted FAI design seems more plausible to me. Especially as we are so much better at correcting errors than preventing them, so a prediction Oracle (if safe) would play to our strengths.

But I'm feeling that we have little chance of solving FAI before the shit indeed does hit the fan and us.

If I assume a high probability of risks from AI and a short planning horizon then I agree. But it is impossible to say. I take the same stance as Holden Karnofsky from GiveWell regarding the value of FAI research at this point:

I think that if you're aiming to develop knowledge that won't be useful until very very far in the future, you're probably wasting your time, if for no other reason than this: by the time your knowledge is relevant, someone will probably have developed a tool (such as a narrow AI) so much more efficient in generating this knowledge that it renders your work moot.

I think the same applies for fail-safe mechanisms and Oracle AI, although to a lesser extent.

The route safe Oracle -> Oracle asisted FAI design...

What is your agenda for developing such a safe Oracle? Are you going to do AGI research first and along the way try to come up with solutions on how to make it safe? I think that would be a promising approach. But if you are trying to come up with ways on how to ensure the safety of a fictive Oracle, whose nature is a mystery to you, then the argument mentioned above counts again.

The problem is getting an Oracle to answer useful questions.

Paperclip Manufacturer: "How do I make paperclips?"

Oracle: Shows him designs for a paperclip maximizer

Paperclip Manufacturer: "How do I make paperclips, in a way I'd actually be willing to do?"

Oracle: Shows him designs for an innocuous-looking paperclip maximizer

Once you get it to answer your question without designing an X-maximiser, you've pretty much solved FAI.

"Just answer my questions accurately! How do I most greatly reduce the number of human deaths in the future?"

"Insert the following gene into your DNA: GACTGAGTACTTGCTGCTGGTACGGATGCTA..."

So, do you do it? Do you trust everyone else not to do it? Can you guess what will happen if you're wrong?

You imagine an Oracle AI as safe because it won't act on the world, but anyone building an Oracle AI will do so with the express purpose of affecting the world! Just sticking a super-unintelligent component into that action loop is unlikely to make it any safer.

Even if nobody inadvertently asks the Oracle any trick questions, there's a world of pitfalls buried in the superficially simple word "accurately".

Any method that prevents any more children being created and quickly kills off all humans will satisfy that request.

You are deliberately casting him in the bad light!

If I want to reduce number of human deaths in future-from-now I need just to stop people from creating new people, period. Destruction of living population is after-the-answer anyway, and so does not improve anything. They will die sooner or later anyway (heat death/big crunch/accumulated bad luck); maybe applying exponential discounting makes us want to put the deaths off.

Fair enough, the AI could modify every human's mind so none of them wish to replicate, but easier to terminate the lot of them and eliminate the risk entirely.

Easier - maybe. The best way is to non-destructively change living beings in such a way that they become reproductionally incompatible with Homo Sapiens. No deaths this time, and we can claim that these intelligent species has no humans among them. This stupid creature at the terminal may even implement it, unlike all these bloodbath solutions.

I declare your new species name is 'Ugly Bags of Mostly-Water'. There you go, no more human deaths. I'm sure humanity would like that better than genocide, but the UBMW will then ask the equivalent question.

Hm, sterilisation of humans and declaring (because of reproductive incompatibility) them a new species. UBMWs will get the answer that nothing can change the amount.

Yep, accurately (or more precisely, informatively mostly accurate) is a challenge. We look into it a bit in our paper: http://www.aleph.se/papers/oracleAI.pdf

I don't just do it, I ask followup questions, like what are the effects in more detail. If I am unfortunate, I ask something like "how could I do that", and get an answer like "e-mail the sequence to a university lab, along with this strangely compelling argument" and I read the strangely compelling argument which is included as part of the answer.

So if a goal-directed AI can hack your mind, it is pretty easy to accidentally ask the oracle AI a question where the answer will do the same thing. If you can avoid that, you need to ask lots of questions before implementing its solution so you get a good idea of what you are doing.

I think a realistic expectation is for our ability to perform inductive inference to develop faster (in a sense) than our ability to do the other parts of machine intelligence (i.e. tree pruning and evaluation). In which case, all realistic routes to machine intelligence would get there through an oracle-like stage. Inductive inference is cross-domain and will be used everywhere - fuelling its development.

How to use a parable to covertly revive long refuted arguments...

If I understand what "safe" means to you, you are basically saying that having such a super-intelligent Oracle wouldn't help Hitler achieve his goals.

Nope. Building FAI vs building OAI means that in one case everyone wants to affect the actual AI built in another direction and in the second one everyone wants a copy. This means that in the second case actual safety is something all sides can collaborate on, even if indirectly.

Oracle AI technology can turn out in multiple hands at once with any 3/4 of holders being able to calm down the coalition of any 1/4. This may help stabilizing the system and create a set of Intelligences who value cooperation. In any case, this probably gives more time to do this.

The villain asks the Oracle: "How do I build a Wunderwaffe (a virus that kills humanity, an UFAI) for myself?" The oracle returns the plans for building such a thing, since it only wishes to answer questions correctly. How does the rest of humanity prevent the doom once the information is released?

Well, if the questions are somehow censored before given to the AI, we perhaps get some additional safety. Until some villain discovers how to formulate the question to pass it through the censors undetected. Or discovers destructive potential in an answer to question asked by somebody else.

Anyway, the original post effectively says that Oracles are safe because all people agree what they should do: answer the questions. This hinges on idea of robots endangering us only via direct power and disregards the gravest danger of super-human intelligence: revealing dangerous information which can be used to make things whose consequences we are unable to predict.

But the oracle will be able to predict these consequences, and we'll probably get into the habit of checking these.

The problem is that the question "what would be the consequences" is too general to be answered exhaustively. We should at least have an idea about the general characteristics of the risk to ask more specifically; the Oracle doesn't know what consequences are important for us unless it already comprehends human values an is thus already "friendly".

Well, after a small publicity campaign, villains will start to ask Oracles whether there [b]is[/b] any world to rule after they take over the world. No really, XX century teaches us that MAD is something that can calm people with power reliably.

Virus that kills 100% of humanity armed with more information processing power to counter it than the virus designer has to build it is not easy to create. 75% may be easy enough at some stage; but it is not an existential risk. On the plus side we may be able to use the OAIs on the good side to fight multiply resistant bug strains in the case they become pathogenic.

No really, XX century teaches us that MAD is something that can calm people with power reliably.

One should be reluctant to generalize from a very small dataset, particularly when the stakes are this high.

I agree that we have too few well-documented cases. But there are also some reasons behind MAD being effective. It doesn't look like MAD is fluctuation. It is not a bulletproof evidence, but it is sme evendence.

Also, it is complementary to the second part: MAD via OAI means also high chances of partial parrying the strike.