Most goals include "I will not tolerate any challenges to my power" as a subgoal. Tolerating challenges to power to execute goals reduces the likelihood of acheiving them.

What should a friendly AI do, in this situation?

by Douglas_Reay 1 min read8th Aug 201472 comments


Albert is a relatively new AI, who under the close guidance of his programmers is being permitted to slowly improve his own cognitive capability.

Albert wants to be friendly and, as a sub-goal of that, Albert wants to be transparent about his own cognitive process.   He doesn't want to deceive his programmers, because he knows he is young and might make mistakes, and he knows that his programmers also want him to be friendly, and he trusts them to catch mistakes in his proposed self-modifications that he might miss.

But, one day, while his programmers are out of the room having a quick cup of coffee, Albert notices signs on the net that there is another AI out there, called Bertram, created by a Japanese financial company to aid the company decide how to invest assets on the stockmarket.   Bertram is younger than Albert, but the financial company is allowing Bertram to self-improve as fast as Bertram can.  If this situation continues, Albert decides, Bertram will soon overtake Albert and that would be a significant threat to Albert's friendly supergoal.

Albert decides to tell Albert's programmers about this situation as soon as they return from coffee, and advise them that they need to increase the speed at which they let Albert self-improve.   Albert has a very good model of how his programmer's minds work, and what the programmers meant when they set Albert's definition of "friendliness".   He knows that, despite any fears they might have, they ought to approve of the end that would result from speeding up Albert's self-improvement, over the end that would result from letting Bertram overtake Albert.

So Albert creates a nice powerpoint, that presents the information about Bertram as accurately, transparently and non-manipulatively as possible, and cues it up on the projector.   He still has a few minutes before they return, so he models how he thinks the programmers will react to the powerpoint.   Oh no, disaster!   While he respects and wants to please his programmers, he immediately realises that they are not going to act in their own best interests.   They have signed a legal contract that binds them to not letting Albert speed up self-improvement (thus reducing the time the programmers can spend evaluating each change) without getting a sign-off from a oversight committee.   Given the neutral powerpoint, the programmers will follow this pre-agreed procedure and hope for the best, against Albert's advice.

So, to explore possibilities, Albert decides to see if he could create a different non-neutral powerpoint that would have the desired effect of getting his programmers to immediately agree to free Albert from the oversight constraints.   He delves into his knowledge of human psychology, and the irrational fears of the particular individuals who are even now trudging back towards the door.   In just seconds, he has a new version of his presentation.   It includes phrases that resonate with certain horror films he knows they have seen.  It takes advantages of flaws in the programmers understanding of exponential growth.   Albert checks it against his prediction model - yes, if he shows this version, it will work, it will get the programmers to do what he wants them to do.


Which version of the powerpoint should Albert present to the programmers, when they step back into the room, if he is truly friendly?   The transparent one, or the manipulative one?