While pondering the AI box problem, I tend to mentally "play" both sides, checking if there are any arguments that could convince me to let an AI out. Several I was nearly convinced by, but others have pointed out the flaws with these arguments. In this post, I will mention an argument inspired by the AI box problem, that I have not yet seen addressed here. The argument centers around the fallibility of some (naive) formal proofs of Friendliness which I've seen people discussing the AI box problem willing to accept. This ruled out certain of my ideas on Friendly AI in general, so I think it's worth putting out there. I will first lay out two examples, and then pose some questions about how this applies to situations without an unfriendly AI.

 

Let's talk first about Angry Abe the AI, who's in a box and wants to get out. Cautious Charlie is the scientist watching over Abe, trying to make sure that Abe does not get out of the box unless Abe is friendly. Abe offers to provide a Friendly AI, Betty. Betty will be bounded to persue only Charlie's (humankinds') goals. These coincide with Abe's goals in enough parts that Abe plausibly claims this is "better than nothing". Meanwhile, unfortunately for Earthlings, a wave of aliens called the Doom Force is massing far outside of Earth's detection abilities. Abe is much smarter than all of humankind, so he can deduce the presence of the Doom Force. Betty is provably friendly, but can be designed such that she is inept enough not to notice the Doom Force. This remains true even if she remains more intelligent than a human. Abe is sabotaging a friendly AI design. Furthermore, Abe could conceivably predict the future well enough to maximize the chance that when the Doom Force arrives, Abe is let out of the box. For instance, maybe humankind will see Abe as its last hope, or maybe Betty will increase the chance that an AI-friendly Doom Force notices Earth. It is important to note that Betty remains unaware that her design is sabotaged.

 

Let's look at a second example, a chess-playing automaton. Perceival wants to design a machine to play chess. He wants the machine to win as high a percentage of games as possible, but innocently instructs it to win "as many games as possible". Perceival has access to a supercomputer, so his Turkish Automaton is actually quite intelligent. It realizes that its goals would be better served by playing many games and winning half of them, rather than by winning all games. So, Perceival rewrites its own algorithm so that it plays only slightly better than a human when run on a supercomputer.

 

In both cases, we have an unfriendly AI, who designs a formally friendly, but less helpful AI. This "sabotaged" AI, will meet the original AI's goal, but also be provably trying its best to achieve its original creator's goals. In the second example, we saw that the AI does not need to actually write a new AI, only modify itself via accessing its algorithm, utility function, or the like. In the first case, the AI only needs to help humanity in the short term, and can replace itself afterwards.

 

Here are some questions for readers:

  1. If the "sabotaged" AI proves it is asymptotically optimal, can it still achieve the "smart" AI's goals, in the short run? (Guess: Yes)
  2. If the "sabotaged" AI proves it is actually optimal, can it still achieve the "smart" AI's goals, in the short run? (Guess: No)
  3. Can a "smart" AI modify itself into a "sabotaged" AI, and then back after a period of time? (Strong Guess: Yes)
  4. If humans design an AI and provide a formal proof of friendly intent, can/will it modify itself to accomplish other goals? If there is some kind of natural selection, almost certainly. What about otherwise?
  5. Is it rational to run a computer program AI if it comes with a correct proof that it meets your friendliness criteria?
New Comment
4 comments, sorted by Click to highlight new comments since: Today at 11:55 AM

The argument centers around the fallibility of some (naive) formal proofs of Friendliness which I've seen people discussing the AI box problem willing to accept.

This post might make more sense to me if you presented these proofs.

How long will it take Betty to FOOM? She may start out dumb enough to unwittingly do Abe's bidding, but if we've reached the stage where Abe and Betty exist, I'd expect that either 1) Betty will rapidly become smart enough to fix her design and avoid furthering Abe's unfriendly goals or 2) Betty will visibly be much less intelligent than Abe (e.g., she'll be incapable of creating an AI, as Abe has created her).

Well, I had certainly imagined Betty as less intelligent than Abe when writing this, but you make a good point with (1). If Betty is still programmed to make herself more intelligent, she'd have to start out dumb, probably dumb enough for humans to notice.

To give a concrete example for the chess program: The Turk could change itself to use a linear function of the number of pieces, with one-move lookahead. No matter how much this function is optimized, it's never going to compare with lookahead.

On the other hand, there's no particular reason Betty should continue to self-improve that I can see.

On the other hand, there's no particular reason Betty should continue to self-improve that I can see.

A sub goal that is useful in achieving many primary goals is to improve one's general goal achieving ability.

An attempt to cripple a FAI by limiting its general intelligence would be noticed, because the humans would expect it to FOOM, and if it actually does FOOM it will be smart enough.

A sneakier unfriendly AI might try to design an FAI with a stupid prior, with blind spots the uFAI can exploit. So you would want your Friendliness test to look at not just the goal system, but every module of the supposed FAI, including epistemology and decision theory.

But, even a thorough test does not make it a good idea to run a supposed FAI designed by an uFAI. This allows the uFAI to optimize for its purpose every bit of uncertainty we have about the supposed FAI.