Armstrong et al. suggest that an Oracle AI could be a solution to confine the harmful potential of artificial intelligence.
While the benefits are certainly considerable, I believe that even an oracle could lead to an existential risk, or at least to unpleasant situations.
I would like to share a proof sketch of the above:

For simplicity, let's consider an Oracle AI that has the ability to answer only yes / no to the questions posed by human operators. It is clear that here we are talking about an even weaker version than the one suggested by Armstrong, Bostrom and Sandberg. We define the Oracle as a superintelligent agent with the aim of maximizing some function . Suppose that for a number of steps, from to , the oracle collaborates with human, always responding correctly to gain their trust. At some time , the oracle knows which is the most optimized version of (what is the procedure to follow in order to obtain the desired goal).
Now a procedure is nothing but an algorithm (a set of rules) and as such can be encoded in a binary string of 0 and 1. So we have that at time , oracle knows and assuming the procedure has finite cardinality i.e. it will eventually halt, leading to the expected result. From this point on, begins a strategic game against humans to get them to perform . If the history of all the answers since was started is encoded in the string , need to include at a certain point in . Since O is a Superintelligence and since every string can be reconstructed by only answering yes or no to questions like "is the first bit 1?" given that has the complete trust of humans (and even if it falters it could return to behave correctly, adjusting the aim over and over again), eventually it can lead humans to act based on its responses and eventually to implement . Note that in all of this humans dont have the same overview and planning capacity of the Oracle and therefore they may not realize that, with their actions, they have set dangerous situations in motion.


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 2:07 AM

Interesting thought. I’m a bit new to this site but is there a “generally accepted” class of solutions for these type of AI problems? What if humans used multiple of these Oracle AIs in isolation (they don’t know about each other) and the humans asking the questions show no reaction to the answers. The questions are all planned in advance so the AI cannot “game” humans by “influencing” the next question using its current answer. What about once the AI achieves some level of competency we reset it after each question-answer cycle so it’s competence is frozen at some useful but not nefarious level (assuming we can figure out where that line is).

" [...] since every string can be reconstructed by only answering yes or no to questions like 'is the first bit 1?' [...]"

Why would humans ever ask this question, and (furthermore) why would we ever ask this question n number of times? It seems unlikely, and easy to prevent. Is there something I'm not understanding about this step?

You're showing how it's technically possible , but not where the motivation comes from.

What do you mean by "where the motivation comes from"?

New to LessWrong?