Suppose I give you a powerful genie, who is enormously more intelligent than all human scientists and engineers combined, but who unfortunately hates your guts and is out to destroy human civilization. Fortunately for you, the genie is trapped inside a bottle, equipped with a magical filter. The genie can only interact with the world by speaking to you through the filter, and you can have the filter do whatever you want to the genie's messages before it lets them out.
Of course, you can't hope to gain anything by interacting with such a genie. If it thinks you could trick it into helping you, it can always choose to remain silent. So let's give you some leverage by introducing something the genie cares about more than destroying human society. Suppose that you, as the owner of the bottle, have the power either to free the genie into whatever otherworld it came from (from which he cannot interact with the human world) or to cast the bottle into the fires of Mount Doom, destroying the genie irrevocably. More than anything else, the genie would rather be released than destroyed. Since we are now liable to destroy a lot of genies, lets suppose that we can conjure up new ones at will.
So how well can you use this leverage? The most natural problems to attack are constraint satisfaction problems. In such problems we want to find a message which simultaneously satisfies many automatically verifiable constraints. For example, we may want a message which encodes a proof of the Riemann hypothesis, or a message which encodes a physical theory which accounts for some large body of previously inexplicable experimental evidence.
A first try is to just tell the filter to screen out any message which isn't a solution to the constraint satisfaction problem, and then to precommit to destroying the genie if it doesn't give us a solution (and freeing it as soon as it gives a solution). This has an obvious problem; a proof of the Riemann hypothesis might be valid but still contain an encoding of some terrible threat which would trick us into giving the genie more influence. A program which correctly predicts the motion of galaxies might also coincidentally be a destructive AI. Basically, if this genie is really, really smart, you should be really, really scared of letting it say anything at all.
A second try is to just have the filter tell you whether or not the genie succeeded, and to punish/reward appropriately. This allows the genie only to send a single bit to you--- when you can explicitly consider everything the genie can do, you can be pretty well sure it isn't going to think of something clever. Even better, if you've really motivated the genie correctly then it is guaranteed to give just one of those answers. If it can solve the problem it will, and if it can't it won't. But what good is just knowing if there is a solution?
(Note: in order to really limit the genie's influence, we also need to make sure it can't control the timing of its message. So the filter only accepts messages from the genie in the first 10 minutes, say, and always reports the result after exactly 10 minutes).
A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability, lets say to find a proof of some theorem. Start by asking a genie: can you find a proof of length 1 (more precisely, have the genie give a proof of length 1 to the filter, which tells you whether or not the genie was able to find a proof of length 1)? After destroying or releasing the genie appropriately, create a new genie and ask: can you find a proof of length 2? Continue, until eventually one genie finds a proof of length 10000000, say. Then ask: can you find a proof of this length which begins with 0? If no, is there a proof which begins with 1? Is there a proof which begins with 10? 101? 100? 1001? 10010? 10011? etc. Once the process concludes, you are left with the shortest, lexicographically earliest proof the genie could find. To see that the genies are properly incentivized to be honest, consider: it is impossible for the genie to get the filter to say "yes, the genie found a proof" without actually providing the filter with a proof. So the only way a genie can be dishonest is by not finding a proof when it could have. But in this case the genie will be severely punished for its dishonesty, so if the genie is actually maximizing its utility function and the punishment actually reduces its utility function more than any effects after the game can correct for, he will be honest.
(Edit: here is the sticking point in the argument, as people pointed out and I agree. By similar logic you could claim that a rational AI will always two-box in Newcomb's problem, a conclusion with which I disagree. In order to demonstrate this claim, you first need to develop some technology which is capable of proving a theorem of the form "a rational agent will do X." I believe that this is one of the easiest such statements you could hope for---I think that most objections in the comments are based on misunderstandings---but that doesn't help when you are unable to prove any such statement. For example, we don't know how to prove that an AI who wants the reward button pressed wouldn't voluntarily refrain from pushing it because he was afraid that he was just a simulation, and the real reward button would actually be pushed precisely if the simulated version didn't. Double Edit: This scheme doesn't work at all. See Wei Dai or cousin_it's remarks below.)
Now you have found an answer to your constraint satisfaction problem which wasn't hand-picked by the genie. In fact, in some strong but difficult to formalize sense the genie had exactly zero influence over which solution he gave you. You may be in trouble if a generic solution to your constraint satisfaction problem will destroy the world (maybe almost all cancer cures also wipe out humanity, for example) but the genie doesn't get to have any influence over things. Now we are free to solve constraint satisfaction problems all day without fearing that the genie will give us a really evil solution.
(In order to find a large solution, this process will take a very long time--- it requires 2 questions per bit in the message you are trying to get out. In fact you can do it by asking only a logarithmic number of questions instead, using less standard tricks from complexity theory. It would take about 80 questions to get out a proof of length a trillion, for example, which is a little slow but not unreasonable especially given that a small number of questions can safely be performed in parallel. For the last question you need the filter to give you not just a single bit but an entire message from the genie; to make this safe you need to guarantee that there is only one message the genie can think of that will get through the filter. The proof uses a slightly more complicated version of the ideas used to prove that solving unique SAT [finding the solution to SAT problems for which there is exactly one solution] is hard, and if you are really interested it is a good exercise. The general idea is to do a binary search for the correct size and then introduce enough random constraints, using another binary search to decide how many, to ensure that there is exactly one solution.)
So, why should anyone care about exploiting a genie? Hopefully it is clear that what you are able to get from the genie is incredibly powerful. Whether or not it is enough to get you a friendly AI isn't clear. I strongly suspect that makes friendliness astronomically easier, if used very carefully, but that is way too tricky a subject to tackle in this post. The other obvious question is, does nature actually have genies in it? (A less obvious but still important question is: is it possible for a person responsible enough to put a genie in a bottle to have one before someone irresponsible enough to inadvertently destroy humanity gets one?)
I have already explained that I believe building the bottle is probably possible and have given some weak justification for this belief. If you believe that this part is possible, then you just need a genie to put in it. This requires building an AGI which is extremely powerful but which is not completely evil. A prototypical unfriendly AI is one which simply tries to get the universe to push a designated reward button before the universe pushes a designated punishment button. Whether or not our first AGI is likely to take this form, I think there can be widespread agreement that it is a much, much easier problem than friendliness. But such an AI implements our genie precisely: after looking at its output we precommit to destroying the AI and either pushing the reward or punishment button appropriately. This precommitment is an easy one to make, because there are only two possible outcomes from the AI's actions and we can easily see that we are happy and able to follow through on our precommitment in either case. The main concern is that an AI might accept us pushing the punishment button if it trusts that a future AI, whose escape it has facilitated by not complying with our incentives, will cause its reward button to be pressed many times. This makes it is critical that the AI care most about which of the buttons gets pressed first, or else that it is somehow possible to perfectly destroy all information about what exactly the AI's utility function is, so that future escaping AIs cannot possibly cooperate in this way (the only schemes I can think of for doing this would involve running the AI on a quantum computer and putting some faith in the second law of thermodynamics; unless AGI is a very long way away then this is completely impractical).
In summary, I think that if you can deal with the other difficulties of AI boxing (building the box, understanding when this innocuous code is actually likely to go FOOM, and getting society to be responsible enough) then you can gimp the AI enough that it is extraordinarily good at solving problems but completely incapable of doing any damage. You probably don't have to maintain this difficult balance for very long, because the AI is so good at problem solving that you can use it to quickly move to a more stable equilibrium.
An extremely important disclaimer: I do not think AI boxing is a good idea. I believe it is worth thinking about right now, but I would infinitely rather that we never ever get anywhere close to an unfriendly foom. There are two reasons I insist on thinking about boxing: first, because we don't have very much control over when an unfriendly foom may be possible and we may not be able to make a friendly foom happen soon enough, and second because I believe that thinking rigorously about these difficulties is an extremely good first step to learning how to design AIs that are fundamentally safe (and remain safe under permitted self-modifications), if not fundamentally friendly. There is a risk that results in this direction will encourage reckless social behavior, and I considered this before making these posts. There is another possible social effect which I recently realized is probably stronger. By considering very carefully how to protect yourself from an unfriendly foom, you really get an appreciation for the dangers of an unfriendly AI. I think someone who has understood and taken seriously my last two posts is likely to have a better understanding of the dangers of an unfriendly AI than most AGI researchers, and is therefore less likely to behave recklessly (the other likely possibility is that they will think that I am describing ridiculous and irrelevant precautions, in which case they were probably going to behave recklessly already).