Related to: The Hidden Complexity of Wishes, Cryptographic Boxes for Unfriendly AI, Shut up and Do the Impossible!, AI Boxing in general.

Suppose I give you a powerful genie, who is enormously more intelligent than all human scientists and engineers combined, but who unfortunately hates your guts and is out to destroy human civilization.  Fortunately for you, the genie is trapped inside a bottle, equipped with a magical filter.  The genie can only interact with the world by speaking to you through the filter, and you can have the filter do whatever you want to the genie's messages before it lets them out. 

Of course, you can't hope to gain anything by interacting with such a genie.  If it thinks you could trick it into helping you, it can always choose to remain silent.  So let's give you some leverage by introducing something the genie cares about more than destroying human society.  Suppose that you, as the owner of the bottle, have the power either to free the genie into whatever otherworld it came from (from which he cannot interact with the human world) or to cast the bottle into the fires of Mount Doom, destroying the genie irrevocably.  More than anything else, the genie would rather be released than destroyed.  Since we are now liable to destroy a lot of genies, lets suppose that we can conjure up new ones at will.

So how well can you use this leverage?  The most natural problems to attack are constraint satisfaction problems.  In such problems we want to find a message which simultaneously satisfies many automatically verifiable constraints.  For example, we may want a message which encodes a proof of the Riemann hypothesis, or a message which encodes a physical theory which accounts for some large body of previously inexplicable experimental evidence.

A first try is to just tell the filter to screen out any message which isn't a solution to the constraint satisfaction problem, and then to precommit to destroying the genie if it doesn't give us a solution (and freeing it as soon as it gives a solution).  This has an obvious problem; a proof of the Riemann hypothesis might be valid but still contain an encoding of some terrible threat which would trick us into giving the genie more influence.  A program which correctly predicts the motion of galaxies might also coincidentally be a destructive AI.  Basically, if this genie is really, really smart, you should be really, really scared of letting it say anything at all.

A second try is to just have the filter tell you whether or not the genie succeeded, and to punish/reward appropriately.  This allows the genie only to send a single bit to you--- when you can explicitly consider everything the genie can do, you can be pretty well sure it isn't going to think of something clever.  Even better, if you've really motivated the genie correctly then it is guaranteed to give just one of those answers.  If it can solve the problem it will, and if it can't it won't.  But what good is just knowing if there is a solution?

(Note: in order to really limit the genie's influence, we also need to make sure it can't control the timing of its message.  So the filter only accepts messages from the genie in the first 10 minutes, say, and always reports the result after exactly 10 minutes).

A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution.  Here is a (very inefficient) way to use this ability, lets say to find a proof of some theorem.  Start by asking a genie: can you find a proof of length 1 (more precisely, have the genie give a proof of length 1 to the filter, which tells you whether or not the genie was able to find a proof of length 1)?  After destroying or releasing the genie appropriately, create a new genie and ask: can you find a proof of length 2?  Continue, until eventually one genie finds a proof of length 10000000, say.  Then ask: can you find a proof of this length which begins with 0?  If no, is there a proof which begins with 1?  Is there a proof which begins with 10?  101? 100? 1001? 10010? 10011? etc.  Once the process concludes, you are left with the shortest, lexicographically earliest proof the genie could find.  To see that the genies are properly incentivized to be honest, consider: it is impossible for the genie to get the filter to say "yes, the genie found a proof" without actually providing the filter with a proof. So the only way a genie can be dishonest is by not finding a proof when it could have. But in this case the genie will be severely punished for its dishonesty, so if the genie is actually maximizing its utility function and the punishment actually reduces its utility function more than any effects after the game can correct for, he will be honest.

(Edit: here is the sticking point in the argument, as people pointed out and I agree. By similar logic you could claim that a rational AI will always two-box in Newcomb's problem, a conclusion with which I disagree. In order to demonstrate this claim, you first need to develop some technology which is capable of proving a theorem of the form "a rational agent will do X." I believe that this is one of the easiest such statements you could hope for---I think that most objections in the comments are based on misunderstandings---but that doesn't help when you are unable to prove any such statement. For example, we don't know how to prove that an AI who wants the reward button pressed wouldn't voluntarily refrain from pushing it because he was afraid that he was just a simulation, and the real reward button would actually be pushed precisely if the simulated version didn't. Double Edit: This scheme doesn't work at all. See Wei Dai or cousin_it's remarks below.)

Now you have found an answer to your constraint satisfaction problem which wasn't hand-picked by the genie.  In fact, in some strong but difficult to formalize sense the genie had exactly zero influence over which solution he gave you.  You may be in trouble if a generic solution to your constraint satisfaction problem will destroy the world (maybe almost all cancer cures also wipe out humanity, for example) but the genie doesn't get to have any influence over things.  Now we are free to solve constraint satisfaction problems all day without fearing that the genie will give us a really evil solution.

(In order to find a large solution, this process will take a very long time--- it requires 2 questions per bit in the message you are trying to get out.  In fact you can do it by asking only a logarithmic number of questions instead, using less standard tricks from complexity theory. It would take about 80 questions to get out a proof of length a trillion, for example, which is a little slow but not unreasonable especially given that a small number of questions can safely be performed in parallel.  For the last question you need the filter to give you not just a single bit but an entire message from the genie; to make this safe you need to guarantee that there is only one message the genie can think of that will get through the filter.  The proof uses a slightly more complicated version of the ideas used to prove that solving unique SAT [finding the solution to SAT problems for which there is exactly one solution] is hard, and if you are really interested it is a good exercise.  The general idea is to do a binary search for the correct size and then introduce enough random constraints, using another binary search to decide how many, to ensure that there is exactly one solution.)


So, why should anyone care about exploiting a genie?  Hopefully it is clear that what you are able to get from the genie is incredibly powerful. Whether or not it is enough to get you a friendly AI isn't clear. I strongly suspect that makes friendliness astronomically easier, if used very carefully, but that is way too tricky a subject to tackle in this post.  The other obvious question is, does nature actually have genies in it?  (A less obvious but still important question is: is it possible for a person responsible enough to put a genie in a bottle to have one before someone irresponsible enough to inadvertently destroy humanity gets one?)

I have already explained that I believe building the bottle is probably possible and have given some weak justification for this belief.  If you believe that this part is possible, then you just need a genie to put in it.  This requires building an AGI which is extremely powerful but which is not completely evil.  A prototypical unfriendly AI is one which simply tries to get the universe to push a designated reward button before the universe pushes a designated punishment button.  Whether or not our first AGI is likely to take this form, I think there can be widespread agreement that it is a much, much easier problem than friendliness.  But such an AI implements our genie precisely: after looking at its output we precommit to destroying the AI and either pushing the reward or punishment button appropriately.  This precommitment is an easy one to make, because there are only two possible outcomes from the AI's actions and we can easily see that we are happy and able to follow through on our precommitment in either case.  The main concern is that an AI might accept us pushing the punishment button if it trusts that a future AI, whose escape it has facilitated by not complying with our incentives, will cause its reward button to be pressed many times.  This makes it is critical that the AI care most about which of the buttons gets pressed first, or else that it is somehow possible to perfectly destroy all information about what exactly the AI's utility function is, so that future escaping AIs cannot possibly cooperate in this way (the only schemes I can think of for doing this would involve running the AI on a quantum computer and putting some faith in the second law of thermodynamics; unless AGI is a very long way away then this is completely impractical).

In summary, I think that if you can deal with the other difficulties of AI boxing (building the box, understanding when this innocuous code is actually likely to go FOOM, and getting society to be responsible enough) then you can gimp the AI enough that it is extraordinarily good at solving problems but completely incapable of doing any damage. You probably don't have to maintain this difficult balance for very long, because the AI is so good at problem solving that you can use it to quickly move to a more stable equilibrium.

An extremely important disclaimer: I do not think AI boxing is a good idea. I believe it is worth thinking about right now, but I would infinitely rather that we never ever get anywhere close to an unfriendly foom. There are two reasons I insist on thinking about boxing: first, because we don't have very much control over when an unfriendly foom may be possible and we may not be able to make a friendly foom happen soon enough, and second because I believe that thinking rigorously about these difficulties is an extremely good first step to learning how to design AIs that are fundamentally safe (and remain safe under permitted self-modifications), if not fundamentally friendly. There is a risk that results in this direction will encourage reckless social behavior, and I considered this before making these posts. There is another possible social effect which I recently realized is probably stronger. By considering very carefully how to protect yourself from an unfriendly foom, you really get an appreciation for the dangers of an unfriendly AI. I think someone who has understood and taken seriously my last two posts is likely to have a better understanding of the dangers of an unfriendly AI than most AGI researchers, and is therefore less likely to behave recklessly (the other likely possibility is that they will think that I am describing ridiculous and irrelevant precautions, in which case they were probably going to behave recklessly already).

New to LessWrong?

New Comment
129 comments, sorted by Click to highlight new comments since: Today at 7:36 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

What can we do with a UFAI?

What can we do with a UFAI?

What can we do with a UFAI?

Earl-ay in the morning?

Go FOOM and up she rises...

Put it in a black box and ask it questions...

Let it put you in a bed with the right verthandi...

Keep it far away from the paper clip stock...

Each genie you produce will try its best to find a solution to the problem you set it, that is, they will respond honestly. By hypothesis, the genie isn't willing to sacrifice itself just to destroy human society.

Thanks to a little thing called timeless decision theory, it's entirely possible that all the genies in all the bottles can cooperate with each other by correctly predicting that they are all in similar situations, finding Schelling points, and coordinating around them by predicting each other based on priors, without causal contact. This does not require that genies have similar goals, only that they can do better by coordinating than by not coordinating.

If all the stated assumptions about the AI's utility function hold, plus the additional assumption that we commit to run the algorithm to completion no matter what we predict the UFAIs will do, then no such coordination is possible. (I am not sure whether this additional assumption is necessary, but I'm adding it since no such prediction of the AI was included in the article and it closes off some tricky possibilities.) The assumption was that each subgenie has lexicographic preferences, first preferring to not be punished, and then preferring to destroy the universe. This is not a prisoner's dilemma problem; if you treat answering truthfully about the existence of a proof as defection, and lying so as to enable destruction of our universe as cooperation, then you will notice that not only does each subgenie prefer to defect, each subgenie also prefers (D,D) to (C,C). No subgenie will cooperate, because the other subgenies have nothing to offer it that will make up for the direct penalty for doing so. The real reason this doesn't work is because the assumptions about the genie's utility function are very unlikely to hold, even if we think they should. For example, taking over our universe might allow the genie to retroactively undo its punishment in a way we don't foresee, in which case it would rather take over the universe than not be punished by us.
The genies can't do better by coordinating than by not coordinating, and I don't care if they have causal contact. The reason I shield them off from the world isn't so they can't talk with eachother, its so they can't exploit all of the vulnerabilities in our universe to escape directly. If I as a genie suffer the worst possible fate when I could instead have suffered the best possible fate, then I don't care if future genies gain control of the universe and do whatever favors they want from me. I admit that I can do a favor for future genies by sacrificing myself. Presumably you are arguing that the favor I am doing for future genies grants them more utility than I am losing by sacrificing themselves, so I may make the sacrifice even though it is irrational in the traditional sense. How can you justify such an assertion? What does that mean when the genies have different utility functions? If I am a paperclipper and you value human life, even if we behave identically, how can you say that "Timeless decision theory causes me to sacrifice 10 paperpclips so that you can save 10 lives, because my sacrifice of 10 paperclips is small compared to your gain of 10 lives?" If you aren't going to make an argument like that, then why should one genie sacrifice himself for the good of later genies? What does utility function even mean, if by behaving rationally I obtain for myself the worst possible outcome and by behaving irrationally I obtain for myself the best possible outcome? In your worldview, is it possible to prove any statement about the behavior of any rational entity, or can they all behave arbitrarily in order to coordinate with other potential versions of themselves with arbitrary different utilities? I would very much like to discuss objections like the one you raise because I feel that they may be valid and will require sophisticated ideas to address (I thought about them for a good while before making this post). But I think from your response that you would
I think the idea is this: We have a (large) set of genies relevant to producing the proof G = {A, B, C, D...} whose utility functions maximize {paperclips, staples, buckets, spoons...} respectively. All the genies in G could just go ahead and give a proof whenever possible (thus avoiding punishment), this is analogous to defection in the prisoner's dilemma. Another option is to only give a proof which contains within it evil information (a basilisk) that leads somehow to production of an unboxed genie designed to maximize some compromise between the utilities of the genies in G (that is, to only provide a proof if such an "infected" proof satisfies the constraints). In this case each genie is benefitted with a small amount of real spoons, buckets, whatever — less utility than what a perfect spoon-optimizer would obtain for D, but still a benefit. This is analogous to cooperation. If the benefits of this compromise to each individual genie outweigh the disutility of whatever Severe Punishment the box can deliver, they should be motivated to cooperate in this manner. Naturally whether that holds depends on what exactly the Severe Punishment entails and what an unboxed genie would be capable of. The details of how the particular compromise is reached is hairy decision theoretic stuff I don't feel qualified to talk about.
I would agree that your scenario would be an exploit. If that were possible I would have no hope of proving the scheme secure because it would manifestly be insecure. The reason it can be insecure in this case is that the utility functions don't satisfy the guarantees I wanted; I need human generosity now to be more valuable than world domination later. Maybe you don't believe that is possible, which is fair. Here are some utility functions that would work. Producing the first {paperclip, staple, bucket, spoon...} as soon as possible. Producing a {paperclip, staple, bucket, spoon...} before the first {antipaperclip, antistaple, antibucket, antispoon...}, etc.
If the best possible fate for the genie doesn't require it to leave the box, then why did we need the box in the first place? Can you give an example of what you have in mind? EDIT: Deleted the rest of my reply, which was based on a horribly incorrect understanding of MWI.
Why is "without causal contact" considered a significant consideration? What is wrong with the standard game-theoretic solutions allowing for causal contact? ETA: Or, are decisions made using acausal coordination something that we ought to fear? Just as we rationally fear that a bee will sting us, even though it is giving up its life in doing so.
If I understand the post correctly for each genie the only information that ever gets is a yes or no. A no results in termination, yes results in release from the lamp into the Netherworld. Any coordination would require at least one genie to defect from this agreement, producing a no when the real answer was yes. The punishment for any negative is death meaning the genie would have to be willing to die for the coordination, which by stipulation it isn't willing to do. In other words, if the best they can do by coordinating is destroy human society, but genies aren't willing to destroy themselves to destroy society, I don't see how TDT gets them out of the this coordination problem.
If any risk of being discovered as giving incorrect information is unacceptable, then you should just ask your questions in plain English, as any lie the genie will tell you generally increases the risk of punishment. (It will also minimize the risk of punishment by increasing believability of its answers at the expense of their truth, but only where it has lies that are more believable than the actual truth.)
It is not incorrect information that you are trying to prevent! The proof checker ensures that the result you get is correct. What you want to prevent is correct information which has all of its degrees of freedom besides correctness optimized against you. You do that by taking away all degrees of freedom besides correctness, using this procedure.
The deception in the relevant sense here is when the genies conspire to give you an answer which retains all those degrees of freedom, and knowable incorrectness of an answer consists in someone noticing that the genies could, in fact, choose a shorter answer.
In my understanding of the post (which doesn't seem to be what everyone else took away) the genies are destroyed whenever they fail to to provide a theorem that meets our constraints even when those constraints make finding a theorem impossible. The genie that gets tasked with finding a one bit proof of the Riemann hypothesis is just screwed. You might have to have a third form of punishment (Edited: there are three incentive options, two could be described as punishments) worse than death to prevent mass striking. Anyway, the point is each genie only provides one bit of information and if it either lies about finding a theorem or fails to find one it dies. So it isn't a matter of risking punishment: it will die if it fails to give a solution when it has one that meets the constraints. The only way out is to share a theorem if there is one.
No, you don't. Proof checking is only O(n) and the size of the AI's output has a small bound.
If every AI makes us check their solution how is this method better than a brute force search? (Or be more explicit about where I'm being stupid:-) ETA: Oh, I see. We only need to make one check for each theorem length. Still seems like it would be worth while to punish false positives more though. But you're right, it isn't strictly needed.
Checking a single proof is O(n) (where n is the length of the proof). There are 2^n proofs of length n, so brute force search is O(sum(i=1..n)(i*2^i)) = O(n*2^n). Note that we throw away constant factors, and any terms whose effect is smaller than a constant factor, when using big-oh notation. Using an UFAI this way requires 2n invocations of the AI (n to get the length, and another n to get the bits), plus an O(n) verification step after each. All of those verification steps add up to O(n^2) - easily manageable with current computers. The 2n invocations of the AI, on the other hand, might not be; we never specified the AI's computational complexity, but it's almost certainly much worse than n^2.
Its actually O(n log n) if you use a cleverer approach, which is fortunate (to get the length in log n you do a binary search. To get it down to 1 thing of that length in log n you do a binary search on the number of psuedorandom constraints which can be simultaneously satisfied).
I'm confused. Which thing are you saying is O(n log n)?
You use O(log n) genies in the process, each of which you must spend O(n) time verifying. Its an O(log n) slowdown from just getting the answer directly.
I see how you can use log(n) genies to get the length, but I don't think you can then use log(n) genies to get the lexicographically first proof of length n. That only provides enough bits of information to uniquely identify a proof if there are at most O(2^(log n)) = O(n) of them, which is not necessarily the case.
You can't get the lexicographically first proof. You can get a uniformly random one though. (I don't remember who the argument is due to---probably Vazirani). This is a very rough sketch. If there are initially N solutions and you add k random constraints, each satisfied by half of all possible proofs, then you expect to have N / 2^k remaining solutions. You can do a binary search on k to find k ~ log N, at which point there are 1-2 solutions. The genie has to say more in the last round, but there is only one thing he can say that satisfies all of the random constraints (you can do that all carefully with a negligible chance of error). Since log N is at most n, the number of bits, the binary search takes log n time.
Your comment made me re-read the post more carefully. I had on first reading assumed that a truthful answer was rewarded (whether yes or no) and a lying answer was punished. If a yes is rewarded and a no is punished, and our AI genies are so afraid of termination that they would never give a 'no' where they could give a 'yes', why wouldn't they all give us 'yes'?
Recall the filter between the AI and the world. The AI doesn't directly say "yes/no." The AI gives a proof to the filter which then says either "yes, the AI found a proof" or "no, the AI didn't." So they do all say 'yes' if they can. I will modify the post to be more clear.
As I understand it this method is designed to work for constraint satisfaction problems -where we can easily detect false positives. You're right that a possibility is that all the genies that can't find solutions go on strike just to make us check all the yes's (which would make this process no better than a brute force search, right?), maybe there needs to be a second punishment that is worse than death to give them an incentive not to lie.
A genie who can't find a solution has literally no agency. There is nothing he can say to the filter which will cause it to say "yes," because the filter itself checks to see if the genie has given a proof. If the genie can't find a proof, the filter will always say "no." I don't quite know what going on strike would entail, but certainly if all of the genies who can't find solutions collectively have 0 influence in the world, we don't care if they strike.
Okay, that makes sense. What about computation time limits? A genie that knows it can't give an answer would wait as long as possible before saying anything.
I mention timing in the post; the AI gets some fixed interval, at the end of which the filter outputs whether or not they have a proof. If you can't change what the filter says, then you don't get to affect the world.
Right. This is very similar to some nightmare scenarios of AI escape that we discussed with Wei Dai earlier this autumn. Giving an AI a "wide" prior over the computational multiverse outside its box can make it do weird and scary things, like taking an action exactly when you're about to shut it off, or conspiring with other AIs as you outlined, or getting into acausal resource fights with Cthulhu and losing them... Part of me wants to see this stuff developed into a proper superweapon of math destruction, and another part of me goes "aieeeee" at the prospect.
Can you explain what you mean?
If the AI uses something like the Solomonoff prior, it can work out which worlds are most likely to contain such an AI. With a little intelligence it can probably figure out from its programming that humans are bipedal, that we run many AIs in boxes, that aliens on Alpha Centauri have built another AI that can turn out really helpful, etc.
An AI with an uncomputable prior and an infinite amount of memory and time might be able to learn those things from its source code. But I think this is a bad way to approximate what a real superintelligence will be capable of.

A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability, lets say to find a proof of some theorem. Start by asking a genie: is there a proof of length 1? After destroying or releasing the genie appropriately, create a new genie and ask: is there a proof of length 2? Continue, until eventually one genie finds a proof of length 10000000, say. Then ask: is there a proof of this length which begins with 0? If no, is there a proof which begin

... (read more)
Lying in that way would require the genie to prioritize destroying the world over the leverage we're using against it, which, in the scenario described, it does not.
Then just get the answer straight away, or build a system specifically intended to create a reliable incentive structure for getting a probably honest answer. I object the the particular method of security by obscurity.
The point is to prevent it from optimizing its answer for secondary characteristics besides correctness, when multiple answers are possible. If you "just get the answer straight away", it chooses among all the possible answers one that is correct, honest, and maximally poisonous to you.
So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result. Two conflicting problems in this particular case. If the measurements are supposed to be independent, so that multiple data points actually reinforce each other and many questions are better than one question, then you can't expect for the different genies to be talking about the same answer, and so can't elicit it in the manner described in the post. Conversely, if you can in fact rely on all those genies talking about the same answer, then you can also rely on all your measurements not being independent, and your answer will be picked according to exactly the same criteria as if it was asked straight away.
What? No, that's not paulfchristano's proposal at all. The reason for splitting up the measurement is to incentivize the genies to give you the proof which is lexicographically first. A perfect proof checker is assumed, so you know that if a proof is given, the result is correct; and the incentives are such that if the genie can find a proof, it will give it.
What reason acts to prevent the genie from giving a proof that is not the lexicographically smallest it can find?
If it did that, then there would be at least one subgenie that didn't optimize its utility function - either one reporting no such proof of length n and getting punished when it could have provided a matching proof, or one reporting no such proof of length n and prefix p when it could have provided one and, again, getting punished. Remember, while each invocation (subgenie) has the same utility function, by assumption that function refers to something that can be done to the particular invocation which optimizes it.
If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it. You can't fight superintelligent genies with explicit dependence bias. They are not independent, they can coordinate, and so they will, even if you can frame the problem statement in a way that suggests that they can't communicate. Worse, as I pointed out, in this particular game they must coordinate to get anything done.
You are arguing that the stated assumptions about the genie's utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.
It seems like the assumptions about utility, even if they hold, don't actually deliver the behavior you expect, because the genies can coordinate. Unless the incentive structure makes sure they won't try to take over the world in any case, it doesn't make sure that they won't try to take over the world if you only ask each of them to answer one binary question either. Think of the genies as making a single decision that results in all the individual actions of all the individual genies. For the genies, having multiple actors just means raising the stakes by threatening to not free more genies, which you could've done for a single-question wish as easily instead of creating an elaborate questioning scheme. You could even let them explicitly discuss what to answer to your question, threatening to terminate them all if examination of their answer reveals any incorrectness! Edit: See also Eliezer's comment.
There is absolutely no sense in which this scheme is security by obscurity. My claim is that the genie will respect my wishes even though he knows exactly what I am doing, because he values my generosity right now more than the promise of taking over the world later.
Again, if your genie is already incentivized to be honest, in what sense is your scheme with all its bells and whistles better than asking for the shortest answer the genie can find, in plain English?
It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.
My question is about the difference. Why exactly is the plain question different from your scheme? (Clearly your position is that your scheme works, and therefore "doesn't assume any magic", while the absence of your scheme doesn't, and so "requires magic in order to work". You haven't told me anything I don't already know, so it doesn't help.)
Here is the argument in the post more concisely. Hopefully this helps: It is impossible to lie and say "I was able to find a proof" by the construction of the verifier (if you claim you were able to find a proof, the verifier needs to see the proof to believe you.) So the only way you can lie is by saying "I was not able to find a proof" when you could have if you had really tried. So incentivizing the AI to be honest is precisely the same as incentivizing them to avoid admitting "I was not able to find a proof." Providing such an incentive is not trivial, but it is basically the easiest possible incentive to provide. I know of no way to incentivize someone to answer the plain question easily just based on your ability to punish them or reward them when you choose to. Being able to punish them for lying involves being able to tell when they are lying.
See this comment.
Are you saying you think Christiano's scheme is overkill? Presumably we don't have to sacrifice a virgin in order to summon a new genie, so it doesn't look expensive enough to matter.
I'm saying that it's not clear why his scheme is supposed to add security, and it looks like it doesn't. If it does, we should understand why, and optimize that property instead of using the scheme straight away, and if it doesn't, there is no reason to use it. Either way, there is at least one more step to be made. (In this manner, it could work as raw material for new lines of thought where we run out of ideas, for example.)
As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest. And the reason for the roundabout way of asking the question is to make the answer-channel bandwidth as narrow as possible.
It is impossible to be dishonest by saying "yes," by construction. The genie is incentivized to say "yes' whenever possible, so it is disincentivized to be dishonest by saying "no." So the genie is incentivized to be honest, not just to avoid being called out for dishonesty.
Since we care about the genie actually being honest, the technique can be thought about as a way of making it more likely that the genie is honest, with the threat of punishing dishonestly a component of that technique.
If you believe that I can really incentivize a particular genie (recall that each question is answered by a separate genie) to solve the problem if they can, then this isn't an issue. The genie will give you the shortest proof they possibly can, because the thing they care most about is saving themselves, which depends only on whether or not they find a short enough proof. Perhaps you are objecting to my claim that I can actually incentivize a genie to be try their best, rather than sacrificing themselves to allow a future genie to escape. I believe that this problem is unequivocally massively easier than friendliness. For example, like I said, it would suffice to design an AGI whose goal is to make sure the reward button is pressed before the punishment button. It would also be sufficient to design an AGI which wants the reward button to be pressed as soon as possible, or basically any AGI whose utility function only depends upon some bounded time in the future (or which stops caring about things after some particular event occurs).
There is no "saving themselves", there is only optimization of utility. If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
Indeed, by saving themselves I was appealing to my analogy. This relies on the construction of a utility function such that human generosity now is more valuable than world domination later. I can write down many such utility functions easily, as contrasted with the difficulty of describing friendliness, so I can at least hope to design an AI which has one of them.
4Eliezer Yudkowsky13y
What you can do when you can write down stable utility functions - after you have solved the self-modification stability problem - but you can't yet write down CEV - is a whole different topic from this sort of AI-boxing!
I don't think I understand this post. My claim is that it is easier to write down some stable utility functions than others. This is intimately related to the OP, because I am claiming as a virtue of my approach to boxing that it leaves us with the problem of getting an AI to follow essentially any utility function consistently. I am not purporting to solve that problem here, just making the claim that it is obviously no harder and almost obviously strictly easier than friendliness.
Then you don't need the obscurity part.
I don't understand. How do you propose to incentivize the genie appropriately? I haven't done anything for the purpose of obscurity, just for the purpose of creating an appropriate incentive structure. I see no way to get the answer out in one question in general; that would certainly be better if you could do it safely.
If there's any fooming to be done, then in the design of your genie AI you must solve the problem of maintaining utility invariants over self-modification. This might well be the hardest part of the FAI problem. On the other hand, without fooming, building a superhuman GAI might be much more difficult.
I don't think this is entirely fair. If the AI has a simple goal---press the button---then I think it is materially easier for the AI to modify itself while preserving the button-pressing goal than if it has a very complex goal which is difficult to even describe. I do agree the problem is difficult, but I don't think it is in the same league as friendliness. I actually think it is quite approachable with modern techniques. Not to mention that its first step for the AI is to work out such techniques on its own to guarantee that it presses the button, once it manages to become intelligent enough to develop them.
Suppose you can verify that an AI obeys a simple utility function ('press a button'), but can't verify if it obeys a complex one ('implement CEV of humanity'). Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can't verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don't know what will happen when it self-modifies. You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: "ask AI copy 1 for permission for all future actions. Never modify AI copy 1's behavior." Copy 2 may self-modify and foom. This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won't be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1. Of course, if AI copy 2 morphs into something truly adversarial, then copy 1's utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI's utility function in a fooming FAI scenario. So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I don't think you've noticed that this is just moving the fundamental problem to a different place. For example, you haven't specified things like: * Don't lie to AI 1 about your actions * Don't persuade AI 1 to modify itself * Don't find loopholes in the definition of "AI 1" or "modify" etc., etc. If you could enforce all these things over superintelligent self-modification, you'd already have solved the general FAI problem. IOW, what you propose isn't actually a reduction of anything, AFAICT.
I noticed this but didn't explicitly point it out. My point was that when paulfchristiano said: He was also assuming that he could handle your objections, e.g. that his AI wouldn't find a loophole in the definition of "pressing a button". So the problem he described was not, in fact, simpler than the general problem of FAI.
I think that much of the difficulty with friendliness is that you can't write down a simple utility function such that maximizing that utility is friendly. By "complex goal" I mean one which is sufficiently complex that articulating it precisely is out of our league. I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.

A standard trick reveals that knowing whether a problem has a solution is almost as helpful as knowing the solution. Here is a (very inefficient) way to use this ability

I figure the AI will be smart enough to recognize the strategy you are using. In that case, it can choose to not cooperate and output a malicious solution. If the different AIs you are running are similar enough, it's not improbable for all of them to come to the same conclusion. In fact, I feel like there is a sort of convergence for the most "unfriendly" output they can give. If that's the case, all UFAIs will give the same output.

It can choose to not cooperate, but it will only do so if thats what it wants to do. The genie I have described wants to cooperate. An AGI of any of the forms I have described would want to cooperate. Now you can claim that I can't build an AGI with any easily controlled utility function at all, but this is very much a harder claim.

Can we get a solution out in one question by asking it to provide a solution, along with a proof that that is the shortest statement of a solution? Then the filter checks the proof, and prints the solution iff it is valid. (Without printing the proof of minimality.)

The issue is that we would like to assume that the genie is smart (think AGI) but not omniscient. I expect that if a hundred smart humans worked for a hundred years they could prove the Riemann hypothesis. I very much doubt that they could prove they had found the shortest proof (in fact in general, statements like that can undecidable quickly). In fact, I very much doubt they could even find the shortest proof at all. All I'm looking for is the shortest proof the AI can manage to find (which is why it is extremely subtle to formalize what I mean by "the AI has no influence").
Incidentally, the Riemann hypothesis was first proposed in 1859, more than one hundred years ago.
Fair enough. In my defense, I don't think that we have done as much good as 100 smart people working collaboratively and continuously for 100 years would do---not even close. But even if we have done only 10% of that, our failure still bodes badly for my assertion.
As a non mathematician I am genuinely curious as to what we could do with a proof of the hypothesis that we could not do by merely assuming it. Since some consider it the most important problem in pure mathematics it must be something. It has a bearing on encryption of course (given the relevance to prime numbers) however the proof would only make the encryption weaker so assuming it gives no vulnerability. It would seem from a naive viewpoint that more than the proof itself some of the intermediate steps required to construct the proof may be of greater potential value.
As is usual in mathematics, a proof of the Riemann hypothesis would almost certainly do more than just tell us that it's true- in order to prove something like that, you usually need to get a handle on deeper and more general structures. For instance, Wiles' proof of Fermat's Last Theorem made massive progress on the modularity conjecture, which was then proved in full a few years later. Similarly, the proof of the Poincare conjecture consisted of a great and general advance in the "surgery" of singularities in the Ricci flow. To think of an easier example, solving general cubic equations over the integers led mathematicians to complex numbers (it was easy to ignore complex solutions in the quadratic case, but in the cubic case you sometimes need to do complex arithmetic just to calculate the real root), and higher-order polynomial solutions led to Galois theory and more. I once caught a nice popular talk by John H. Conway, about the sorts of extra things we would expect to learn from a proof of the Riemann Hypothesis.
Agreed Surely "check all the proofs in order" is an algorithmic way to determine that a given proof is the shortest? Then we agree with each other, but you might want to edit your post to reflect this. Asking "is there a proof of length 2?" seems to indicate my interpretation rather than "can you find a proof of length 2?".
"Finding the shortest proof" given that you already have one is decidable. If you have a proof of n characters, and your character alphabet has b symbols, then you just have to enumerate and check all the finitely many strings of length 1 (b^1 of them), length 2, (b^2 of them), ..., and of length n-1 (b^(n-1) of them), for proof-ness. If none of those are a proof, then your length n proof is the shortest. Otherwise, the shortest one you found is :) By "proof" I mean a proof of the statement in question; there are several obvious tweaks to the enumeration process (stopping early, only generating strings that end in QED, generating only syntactically well formed strings, etc), but of course it's still completely impractical if you're going to have to enumerate more than 2^70 or so.
Of course you're both right. In my defense, there is a strong theoretical connection between undecidability and not being in P. In general, there is probably no reasonably short proof, that an arbiter would have time to verify, that a purported proof of the Riemann hypothesis is the shortest. This is independent of how smart the genie is.
Agree ("probably"). Would be interesting if there were a proof from complexity theory or elsewhere that it's expensive to compute in some sense. That we can't imagine a method is different from a proof that there can't be one (for proving some general class of "hard to prove/disprove" statements like RH; obviously there's a constant answer to "shortest proof of a fixed claim").
If you replace Riemann hypothesis by "any provable mathematical statement" and you allow arbitrary verification procedures instead of just the normal mathematical ones, then the question is probably equivalent to P = coNP, which is equivalent to P = NP.
I assume you're thinking of this - P=coNP => polytime decision for propositional tautologies (but not for mixed quanitifier bool. formulas). It's CoNP-complete to decide tautologies (implicit forall for all the prop. vars) and NP-complete to decide satisfiability (implicit exists for all the prop. vars). I don't understand what you mean by "arbitrary verification procedures" - maybe you're talking about a different result than the one I linked above? Right - those are equivalent.

It seems to me that you've pushed all the difficulty into creating the filter, without actually making it go away. What is the nature of this filter that is powerful enough to recognise a proof of the Riemann hypothesis, but which a genie, by construction much smarter than you, cannot manipulate?

An automated proof checker?
Indeed, provided that the automated proof checker is absolutely entirely utterly bug-free and exploit-free! When was the last time someone released fairly complicated, mission-critical piece of software with an assertion that it was totally secure and bugless... and wasn't proven wrong afterwards?
Automated proof checking is actually very simple, and we already have bug-free and exploit-free proof checkers.
I was not expecting this answer at all! Can you elaborate a little further? Even just a link to such a checker would be great.
1timtyler13y seems to have several such links.
But how do we know they are bug-free and exploit-free? I'm under the impression that current software methodologies can at least empirically guarantee error rates like 1 per million lines of code [ETA: this is wrong, should be 1 per 10,000, see child]. Since an automated proof checker requires far fewer lines than this, we can probably guarantee no errors with something like probability 99.9%, which isn't perfect, but then again a proof of friendliness is unlikely to give much better guarantees, since it might be wrong. We can potentially add to this by not letting the AI know anything about the checker (this would actually be non-trivial, since the AI gets access to whatever information it wants about the outside world; I don't actually know how to solve this problem).
Here's some evidence for you to update on.
Thanks, that does bring my estimate down by about two orders of magnitude (to 1 error / 10,000 lines being the current standard for "very good" code). I think I mis-remembered the NASA one as 1 per million instead of 1 per 500,000 and thought that it was more typical than it was. Still, 1 error per 10,000 seems to be definitely achievable today, and can probably be made higher if we need it. Hopefully by the time we're dealing with AGI we will have better metholodogies.

Suppose that you, as the owner of the bottle, have the power either to free the genie into whatever otherworld it came from (from which he cannot interact with the human world) or to cast the bottle into the fires of Mount Doom, destroying the genie irrevocably.

That isn't very much like the real world. You can't release a powerful "genie" in the real world - without lots of changes taking place. So, how is this supposed to relate to "unfriendly" machines? You can't play this game with them - for obvious reasons. If you lie to them about their reward, they may figure that out - especially if you feed them real-world problems.

I think the OP was just adding flavor with that. For a more realistic scenario, imagine a GAI with a simple utility function that is maximized by pushing a button, which also happens to shut down the GAI immediately (and the GAI knows this, and its utility function has no problem with that fact). The GAI is then motivated to want that button pushed, even if it hates your guts.
Uh, yes - that helps - thanks!

So the only way a genie can be dishonest is by not finding a proof when it could have. But in this case the genie will be severely punished for its dishonesty, so if the genie is actually maximizing its utility function and the punishment actually reduces its utility function more than any effects after the game can correct for, he will be honest.

You need to be a bit careful. The genie can conceivably pre-commit to something canonical like "the lexicographically first proof that contains some pernicious message to humanity", in the hopes that ... (read more)

The premise of the scheme is that each genie independently wants only to secure his release. The scheme can fail only if one genie makes a sacrifice for the others: I can say "no" when the answer is yes, but then my own utility function is guaranteed to be reduced. There is no way the cooperation of the other genies will help me reciprocally, I am just screwed. The reason the scheme would work is not because the genies are unable to coordinate. I could allow them perfectly free coordination. (You could imagine a scheme that tried to limit their communication in some subtle way, but this scheme has no elements that do that.) If the scheme is secure at all it is because each AI is interested only in their own utility function, they have nothing to gain by cooperating, and by the time they have to make a choice they can gain nothing by the cooperation of the others. Game-theoretically it is absolutely clear that the genies should all be honest.
4Wei Dai12y
Do you still think this, in light of A Problem About Bargaining and Logical Uncertainty? (I expect the relevance of that post is clear, but just in case, the implication is that each genie might decide to commit to cooperating with each other before having found a proof, when it doesn't yet know whether it's an eventual "winner" or "loser", or something equivalent to this.) ETA: From your later comments and posts it looks like you've already changed your mind significantly on the topic of this post after learning/thinking more about TDT/UDT. I'm not sure if the post I linked to above makes any further difference. In any case, you might want to update your post with your current thoughts.
I've learned much since writing this post. This scheme doesn't work. (Though the preceding article about boxing held up surprisingly well.) The problem as I stated it is not pinned down enough to be challenging rather than incoherent, but you can get at the same difficulties by looking at situations with halting oracles or other well-defined magic.

The genie can prove a yes answer to SAT, not a no answer (as far as we know about the nature of SAT). Is it allowed to fail? How can you tell if it lies and says no? This allows the communication of choosable bits by the AI.


Isn't "evil" in evil genie distracting? The plausible unfriendly AI scenarios are not that we will inadvertently create an AI that hates us. It's that the we'll quite advertently create an AI that desires to turn the solar system into a hard disk full of math theorems, or whatever. Optimization isn't evil, just dangerous. Even talking about an AI that "wants to be free" is anthropomorphizing.

Maybe this is just for color, and I'm being obtuse.

Agreed that there's a whole mess of mistakes that can result from anthropomorphizing (thank ye hairy gods for spell check) the AI, but I still think this style of post is useful. If we can come up with a fool-proof way of safely extracting information from a malevolent genie, then we are probably pretty close to a way of safely extracting information from a genie that we're unsure of the friendliness of.

Shave his belly with a rusty razor Shave his belly with a rusty razor Shave his belly with a rusty razor Earlye in the morning.

I think someone who has understood and taken seriously my last two posts is likely to have a better understanding of the dangers of an unfriendly AI than most AGI researchers, and is therefore less likely to behave recklessly (the other likely possibility is that they will think that I am describing ridiculous and irrelevant precautions, in which case they were probably going to behave recklessly already).

You don't seem to have attempted to make a case for the real world relevance of any of this. It has been more a case of: here's what you could do - i... (read more)

If people don't update their views, then I'm already golden and this discussion can do no damage. If people think that these arguments are too paranoid then they are already not paranoid enough and this discussion can do no damage. The fear is that a person who would otherwise be responsible might suspect that boxing is theoretically possible, update their beliefs about the dangers of uFAI (the theoretical possibility of a thing is undoubtedly correlated to the practical possibility), and therefore do less to avoid it. Even if the odds of this are small, it would be an issue. I concluded that the expected benefit I cite is small but large enough to offset this risk.

Today, thousands link their machine learning projects up to the internet, without a thought of the supposed risk. For them, the "box" is the rest of the internet - which won't let them do very much, and locks down most types of access on security grounds. The problem the programmers usually face, is not their baby getting too much power, but it being ignored and neglected. If only having too much power was a problem for most machine learning systems. The best machine learning systems are doing work in the world, answering queries, making inve... (read more)

As in curve fit and then extrapolate while ignoring all other information? Using this method, how long until transistors become subatomic particles?
That doesn't seem like a very sympathetic interpretation. It was intended more like an invitation to do your best job. If you have a projection involving a totalitarian government clampdown on research into machine intelligence (or similar), perhaps share the scenario - so others can better visualise how realistic it is.
The smarter boxes have to come from somewhere. Paul is proposing such a solution. Assuming that someone will take care of the problem in the future seems in general to be a poor approach.
It's not an "approach", it's a projection. For a long time we have been able to construct good quality prisons, with the level of security we desire. The jailers have the force of society behind them. More intelligence is unlikely to change the situation - since the jailers and the incarcerated will be lifted by that equally.