Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Related to: Shut up and do the impossible!; Everything about an AI in a box.

One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us.  This gets around the incredible difficulty of developing a friendly AI, but it creates a new, apparently equally impossible problem. How do you design a box strong enough to hold a superintelligence?  Lets suppose, optimistically, that researchers on friendly AI have developed some notion of a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly. Now the problem is designing a box strong enough to hold an unfriendly AI until it modifies itself to be certifiably friendly (of course, it may have to make itself smarter first, and it may need to learn a lot about the world to succeed).

Edit: Many people have correctly pointed out that certifying friendliness is probably incredibly difficult. I personally believe it is likely to be significantly easier than actually finding an FAI, even if current approaches are more likely to find FAI first. But this isn't really the core of the article. I am describing a general technique for quarantining potentially dangerous and extraordinarily sophisticated code, at great expense. In particular, if we developed uFAI before having any notion of certifiable friendliness, then we could still use this technique to try and use the uFAI in a very limited way. It allows us to quarantine an AI and force everything it tries to say to us through an arbitrarily complicated filter. So, for example, if we are interested in solving some problem with a unique and verifiably correct answer, then we can ask the AI for this answer and only see the output if it passes the verification. We can also use a quarantined AI in more useful ways, at the expense of introducing some limited channels of influence (for example, if we ask for the solution to a problem which may have several solutions, we allow the AI to influence us by choosing which solution to provide)

Minimally, I hope this scheme at least suggests that putting an AI in a box may not be catastrophic. What to do with a boxed AI is another question. /Edit.

Concretely, suppose someone comes to you with the design of a recursively self-improving AI---call it Hal.  Unfortunately, Hal is not certifiably friendly.  Maybe it is friendly but not provably so, but more likely it is ambivalent to the human condition and is likely to destroy the universe if you just tell it to make a friendly AI.  Nevertheless, you think Hal is smart enough to solve the friendly AI problem for you.  All you need to do is give it access to all of the information about the world it wants and a couple of hours to run; after that, it should output a modified version of itself which is certifiably friendly.

You almost run Hal and ask it to design a friendly AI for you. After all, if you just run Hal and check to see if it outputs a certifiably friendly AI, what can it do other than refuse?  But after some more thought you realize this is overly optimistic.  Maybe you can avoid giving it an output channel that a human will have to read, thereby eliminating the most obvious failure mode of a boxed AI, but you still have to give it access to a computer to run on.  Computation produces electromagnetic radiation, heat, etc., and it is honestly hard to say what a superintelligence could do with such a channel.

We seem to be faced with an impossible problem.  We need to ask Hal to design us a certifiably friendly AI, but we also need to give him absolutely no ability to interact with the world destructively.

There is one field in particular which has a good history of solving obviously impossible problems: computational cryptography.  We want to implement the following obviously impossible device.  A cage into which we can place Hal so that he can think and examine all of the data we give him, but from which he cannot interact with his environment.  In the cage with him is a tamper-proof arbiter implementing our friendliness test.  Hal may give the arbiter the code to a single AI; if the arbiter confirms that this AI is certifiably friendly it will kill Hal, replace him with the friendly AI, and then allow the friendly AI to send a single message out of the cage.

To implement this impossible device we co-opt an important cryptographic primitive called public key homomorphic encryption, typically used for the impossible problem of private delegated computation.  To get a basic idea you can see the Wikipedia article on homomorphic encryption. I'll give the brief description here (you can skip the next two paragraphs if you are familiar with homomorphic encryption).

A public key encryption scheme has the following properties.  Anyone can encrypt a message using a publicly published key, but you cannot efficiently distinguish the encryptions of two different messages unless you know something about the secret key.  A fully homomorphic encryption scheme has the additional counter intuitive property that, although you cannot learn anything about the encrypted data without the secret key, you can still perform arbitrary operations on it.  In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted.  All of this can be done using only the public key.

The existence of a fully homomorphic encryption scheme was completely open until recently, when it was shown by Craig Gentry that such a scheme exists under the assumption that a certain problem on lattices is computationally intractable.  Known homomorphic encryption schemes are very inefficient and based on new assumptions, but we can realistically expect improvements in the coming years and decades both in speed and the strength of the necessary assumptions.

If you run a homomorphically encrypted AI, sufficiently isolated from the private key, you get the very strong guarantee that the AI's execution has no computationally detectable effect on the rest of the world.  If the AI was able to effect the world in a computationally detectable way, then we would have successfully broken the homomorphic encryption scheme by distinguishing the encrypted execution of the AI from the encrypted execution of some benign program.

Now we need to give the arbiter some means to allow a genuinely friendly AI access to the rest of the world, without giving Hal any illegitimate access.  We allow Hal access to an output channel whose outputs are homomorphically encrypted.  After some time, we destroy Hal and then homomorphically check to see if he has output the code of a certifiably friendly AI over the channel (we have to decode a single homomorphically encrypted bit to perform this check).  If he hasn't, then we do nothing---apparently Hal would rather not give us a friendly AI.  If he has output the code of a certifiably friendly AI, we homomorphically run it, obtain a homomorphic encryption of its output, and then decrypt only this output.

That's the entire scheme.  Whenever you have some code whose execution itself might do incredible damage, this precaution allows you to execute it in perfect safety (the main caveat is that you need an automatic way to screen the output which doesn't give the algorithm enough room to possibly do harm).

Note: If you are really exceptionally paranoid you may be able to obtain even better guarantees by developing more powerful cryptographic techniques tailored for this application.  In particular, the secret key used by the encryption procedure may be leaked to a sufficiently powerful AI through a side channel (the computations involving the key typically leak some data in the form of radiation, which might be recovered from sufficiently exhaustive sensor data by a sufficiently powerful AI).  By doing these computations on a quantum computer you can probably guarantee that no data was leaked. If you want to be even more fantastically paranoid (ie, if you want a formal proof of security, which is a rare treat in real problems) you will actually have to shred the key, guaranteeing that an AI will have to interact with the shredded key to recover it. If you do this, you have to use another new primitive to implement the final revelation of the friendly AI's message.

If you have any concerns about the security of this protocol, I would be happy to try and defend it and would be surprised and interested if I failed. Whether or not such a cryptographic box is really an interesting or important object is another question. (It is interesting to me as an unexpected application of cryptography).

New Comment
162 comments, sorted by Click to highlight new comments since: Today at 6:19 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly

The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.

In particular the sort of proof techniques I currently have in mind - what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.

To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI's self-modification, not to create Friendliness. People hear the word "prove" and assume that this is because you don't trust the progra... (read more)

The probability I assign to being able to build a friendly AI directly before being able to build a hostile AI is very low. You have thought more about the problem, but I'm not really convinced. I guess we can both be right concurrently, and then we are in trouble.

I will say that I think you underestimate how powerful allowing a superintelligence to write a proof for you is. The question is not really whether you have proof techniques to verify friendliness. It is whether you have a formal language expressive enough to describe friendliness in which a transhuman can find a proof. Maybe that is just as hard as the original problem, because even formally articulating friendliness is incredibly difficult.

Usually, verifying a proof is considerably easier than finding one - and it doesn't seem at all unreasonable to use a machine to find a proof - if you are looking for one.

the sort of proof techniques I currently have in mind ... would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence.

But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?

That would then be something you'd have to read and likely show to dozens of other people to verify reliably, leaving opportunities for all kinds of mindhacks. the OP proposal requires us to have an automatic verifier ready to run, that can return reliably without human intervention.
Actually computers can mechanically check proofs for any formal system.
Is there something missing from the parent? It does not seem to parse.
Yes, edited. thanks.
And upvoted. :)
Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced -- it doesn't have to examine an arbitrary program and try to proof friendliness from scratch. In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form "the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations".

The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.

A general theory of quarantines would nevertheless be useful.

Moral value can manipulate your concerns, even as you prevent causal influence. Maybe the AI will create extraordinary people in its mind, and use that as leverage to work on weak points of your defense. It's just too difficult, you are bound to miss something. The winning move is not to play.

I disagree. The weak point of the scheme is the friendliness test, not the quarantine. If I prove the quarantine scheme will work, then it will work unless my computational assumptions are incorrect. If I prove it will work without assumptions, it will work without assumptions.

If you think that an AI can manipulate our moral values without ever getting to say anything to us, then that is a different story. This danger occurs even before putting an AI in a box though, and in fact even before the design of AI becomes possible. This scheme does nothing to exacerbate that danger.

With a few seconds of thought it is easy to see how this is possible even without caring about imaginary people. This is a question of cooperation among humans. This is a good point too, although I may not go as far as saying it does nothing to exacerbate the danger. The increased tangibility matters.
I think that running an AI in this way is no worse than simply having the code of an AGI exist. I agree that just having the code sitting around is probably dangerous.
Nod, in terms of direct danger the two cases aren't much different. The difference in risk is only due to the psychological impact on our fellow humans. The Pascal's Commons becomes that much more salient to them. (Yes, I did just make that term up. The implications of the combination are clear I hope.)
Separate "let's develop a theory of quarantines" from "let's implement some quarantines." Christiano should take it as a compliment that his idea is formal enough that one could imagine proving that it doesn't work. Other than that, I don't see why your remark should go for "quarantining an AI using cryptography" and not "creating a friendly AI." Prove it. Prove it by developing a theory of quarantines.
I agree.
Sociopathic guardians woud solve that one particular problem (and bring others, of course, but perhaps more easily countered).
You are parrying my example, but not the pattern it exemplifies (not speaking of the larger pattern of the point I'm arguing for). If certain people are insensitive to this particular kind of moral arguments, they are still bound to be sensitive to some moral arguments. Maybe the AI will generate recipes for extraordinarily tasty foods for your sociopaths or get-rich-fast schemes that actually work or magically beautiful music.
Indeed. The more thorough solution would seem to be "find a guardian possessing such an utility function that the AI has nothing to offer them that you can't trump with a counter-offer". The existence of such guardians would depend on the upper estimations of the AI's capabilities and on their employer's means, and would be subject to your ability to correctly assess a candidate's utility function.
Very rarely is the winning move not to play. It seems especially unlikely to be the case if you are trying to build a prison.
4Eliezer Yudkowsky13y
For what?
The OP framed the scenario in terms of directing the AI to design a FAI, but the technique is more general. It's possibly safe for all problems with a verifiable solution.
People I don't trust but don't want to kill (or modify to cripple). A non-compliant transhuman with self modification ability may not be able to out-compete an FAI but if it is not quarantined it could force the FAI to burn resources to maintain dominance. But it is something we can let the FAI build for us.
At what point does a transhuman become posthuman?
Shrug. For the purposes here they could be called froogles for all I care. The quarantine could occur in either stage depending on the preferences being implemented.
You mean posthuman?
Could you elaborate on this? Your mere assertion is enough to make me much less confident than I was when I posted this comment. But I would be interested in a more object-level argument. (The fact that your own approach to building an FAI wouldn't pass through such a stage doesn't seem like enough to drive the probability "very low".)
The FAI theory required to build a proof for 1 would have to be very versatile, you have to understand friendliness very well to do it. 2 requires some understanding of the nature of intelligence. (especially if we know with well enough to put it in a box that we're building a superintelligence) If you understand friendliness that well, and intellignece that well, then friendly intelligence should be easy. We never built space suits for horses, because by the time we figured out how to get to the moon, we also figured out electric rovers.
Eliezer has spent years making the case that FAI is far, far, far more specific than AI. A theory of intelligence adequate to building an AI could still be very far from a theory of Friendliness adequate to building an FAI, couldn't it? So, suppose that we know how to build an AI, but we're smart enough not to build one until we have a theory of Friendliness. You seem to be saying that, at this point, we should consider the problem of constructing a certifier of Friendliness to be essentially no easier than constructing FAI source code. Why? What is the argument for thinking that FAI is very likely to be one of those problems were certifying a solution is no easier than solving from scratch? This doesn't seem analogous at all. It's hard to imagine how we could have developed the technology to get to the moon without having built electric land vehicles along the way. I hope that I'm not indulging in too much hindsight bias when I say that, conditioned on our getting to the moon, our getting to electric-rover technology first was very highly probable. No one had to take special care to make sure that the technologies were developed in that order. But, if I understand Eliezer's position correctly, we could easily solve the problem of AGI while still being very far from a theory of Friendliness. That is the scenario that he has dedicated his life to avoiding, isn't it?
You don't need to let the superintelligence give you an entire mind; you can split off subproblems for which there is a secure proof procedure, and give it those.
The most general form is that of a textbook. Have the AI generate a textbook that teaches you Friendliness theory, with exercises and all, so that you'll be able to construct a FAI on your own (with a team, etc.), fully understanding why and how it's supposed to work. We all know how secure human psychology is and how it's totally impossible to be fooled, especially for multiple people simultaneously, even by superintelligences that have a motive to deceive you.
The point of this exercise was to allow the unfriendly AI access not even to the most limited communication it might be able to accomplish using byproducts of computation. Having it produce an unconstrained textbook which a human actually reads is clearly absurd, but I'm not sure what you are trying to mock. Do you think it is impossible to formulate any question which the AI can answer without exercising undue influence? It seems hard, but I don't see why you would ridicule the possibility.
In first part of the grandparent, I attempted to strengthen the method. I don't think automatic checking of a generated AI is anywhere feasible, and so there is little point in discussing that. But a textbook is clearly feasible as means of communicating a solution to a philosophical problem.. And then there are all those obvious old-hat failure modes resulting from the textbook, if only it could be constructed without possibility of causing psychological damage. Maybe we can even improve on that, by having several "generations" of researchers write a textbook for the next one, so that any brittle manipulation patterns contained in the original break.
I'm sorry, I misunderstood your intention completely, probably because of the italics :) I personally am paranoid enough about ambivalent transhumans that I would be very afraid of giving them such a powerful channel to the outside world, even if they had to pass through many levels of iterative improvement (if I could make a textbook that hijacked your mind, then I could just have you write a similarly destructive textbook to the next researcher, etc.). I think it is possible to exploit an unfriendly boxed AI to bootstrap to friendliness in a theoretically safe way. Minimally, the problem of doing it safely is very interesting and difficult, and I think I have a solid first step to a solution.
It's an attractive idea for science fiction, but I think no matter how super-intelligent and unfriendly, an AI would be unable to produce some kind of mind-destroying grimoire. I just don't think text and diagrams on a printed page, read slowly in the usual way, would have the bandwidth to rapidly and reliably "hack" any human. Needless to say, you would proceed with caution just in case. I you don't mind hurting a few volunteers to defend humanity from a much bigger threat, it should be fairly easy to detect, quarantine and possibly treat the damaged ones. They, after all, would only be ordinarily intelligent. Super-cunning existential-risk malevolence isn't transitive. I think a textbook full of sensory exploit hacks would be pretty valuable data in itself, but maybe I'm not a completely friendly natural intelligence ;-) edit: Oh, I may have missed the point that of course you couldn't trust the methods in the textbook for constructing FAI even if it itself posed no direct danger. Agreed.
Consider that humans can and have made such grimoires; they call them bibles. All it takes is a nonrational but sufficiently appealing idea and an imperfect rationalist falls to it. If there's a true hole in the textbook's information, such that it produces unfriendly AI instead of friendly, and the AI who wrote the textbook handwaved that hole away, how confident are you that you would spot the best hand-waving ever written?
Not confident at all. In fact I have seen no evidence for the possibility, even in principle, of provably friendly AI. And if there were such evidence, I wouldn't be able to understand it well enough to evaluate it. In fact I wouldn't trust such a textbook even written by human experts whose motives I trusted. The problem isn't proving the theorems, it's choosing the axioms.
Can you think of any suitable candidates - apart from perhaps an oracle.
0Eliezer Yudkowsky13y
Subproblems like that seem likely to be rare.
Does this still hold if you remove the word "hostile", i. e. if the "friendliness" of the superintelligence you construct first is simply not known? This quote applies to you and your approach to AI boxing ; ) AI Boxing is a potentially useful approach, if one accepts that: a) AGI is easier than FAI b) Verification of "proof of friendliness" is easier than its production c) AI Boxing is possible As far as I can tell, you agree with a) and b). Please take care that your views on c) are not clouded by the status you have invested in the AI Box Experiment ... of 8 years ago. "Human understanding progresses through small problems solved conclusively, once and forever" - cousin_it, on LessWrong.
4Eliezer Yudkowsky13y
B is false. Heh. I'm afraid AIs of "unknown" motivations are known to be hostile from a human perspective. See Omohundro on the Basic AI Drives, and the Fragility of Value supersequence on LW.
This is... unexpected. Mathematical theory, not to mention history, seems to be on my side here. How did you arrive at this conclusion? Point taken. I was thinking of "unknown" in the sense of "designed with aim of friendliness, but not yet proven to be friendly", but worded it badly.
The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: 'Verification of "proof of friendliness" is easier than its production'. For that to work as a plan in context, the verification doesn't have to be infallible. It just needs not to have false positives. False negatives are fine - i.e. if a good machine is rejected, that isn't the end of the world.
You don't seem to want to say anything about how you are so confident. Can you say something about why you don't want to give an argument for your confidence? Is it just too obvious to bother explaining? Or is there too large an inferential distance even with LW readers?
-6Eliezer Yudkowsky13y
I'm not sure this is what D_Alex meant however a generous interpretation of 'unknown friendliness' could be that confidence in the friendliness of the AI is less than considered necessary. For example if there is an 80% chance that the unknown AI is friendly and b) and c) are both counter-factually assumed to true.... (Obviously the '80%' necessary could be different depending on how much of the uncertainty is due to pessimism regarding possible limits of your own judgement and also on your level of desperation at the time...)
The comment of Eliezer's does not seem to be mentioning the obvious difficulties with c) at all. In fact in the very part you choose to quote... ... it is b) that is implicitly the weakest link, with some potential deprecation of a) as well. c) is outright excluded from hypothetical consideration.
I expect if there were an actual demonstration of any such thing, more people would pay attention. As things stand, people are likely to look at your comment, concoct a contrary scenario - where *they don't have a proof, but their superintelligent offspring subsequently creates one at their request - and conclude that you were being over confident.

Let's consider a somewhat similar case.

You are an inventor. An evil dictator captures you, and takes you off to a faraway dungeon, where he tells you that he wants you to build him a superweapon. If you refuse to build the weapon, well, he has means of persuading you. If you still refuse, he will kill you.

Of course, when you finish building the doomsday device, your usefulness will be over, and he will probably kill you anyway.

Being a genius, you soon realise that you could build the dictator a doomsday device in about a week, with the materials lying around in your well-equipped dungeon workshop. However, this doesn't seem terribly optimal.

What do you do? You agree to build the doomsday device. But you claim it is very difficult. You spend a long time producing impressive prototypes that seem to suggest you're making progress. You claim that you need more resources. You carry on, making just enough apparent progress to stop him from killing you on the spot, but not enough to actually succeed. You build ever more expensive and complex non-working prototypes, hoping that in the course of this, you can either build yourself something suitable for breaking out of the dungeon or killing the evil dictator. Or hoping that perhaps someone will rescue you. At the very least you will have wasted the evil dictator's resources on doing something pointless - you will at least not die for nothing.

I suspect your imprisoned AI may choose to follow a similar course.

Or you build yourself a superweapon that you use to escape, and then go on to shut down your company's weapons division and spend your spare time being a superhero and romancing your assistant and fighting a pitched battle with a disloyal employee.

This reply and its parent comment constitute the "Iron Man Argument" against any kind of "put the AI in a box" approach to AGI and friendliness concerns. I predict it will be extremely effective.

That doesn't really work in this situation, since the AI building another AI won't get it out of the box. It's more like you can only design things, and the evil dictator can tell if it's not really a doomsday device and won't build it. Also, he's not going to be checking on your progress, so faking it is useless, but it's also unnecessary.
Note that none of these options are available to Paul's uFAI. It can't communicate with its jailers at all. The jailers won't know anything about what the uFAI is doing (beyond what their knowledge of its source code tells them) until the uFAI outputs a verifiably friendly source code. For all they know, the uFAI is just sitting in the corner and sulking.
This quite possibly the optimal response. If we don't get FAI out of the first attempt, we might try again with another similar design, possibly giving the uFAI more resources. This twin uFAI is much more likely to implement the uFAI's goals than any FAI it might create. Even a single bit is too much control if we need the bit to be set to 1.
The box automatically kills the AI after a fixed length of time. Furthermore, most of the suggestions in your post require significant interaction with the outside world, whereas the AI is only allowed to send 1 bit total before dying.
Not quite right. The AI, in theory, can only send humans 1 bit total before dying, but it can send millions of bits to the arbiter, which might be corruptible. I think the piece of this scheme that is doing the magic is the uncorruptible arbiter that perfectly distinguishes friendly from unfriendly code. We don't know anything about either of those two impossible problems yet.
I agree that coming up with a complex uncorruptible arbiter is difficult. However, I can imagine simple arbiters which might be helpful but which I can implement. For example, suppose I just want to find a simple explanation for some physical data. I could imagine an arbiter which simply elicits an explanation (in the form of a short program which correctly predicts the data) from the AI and then tells us either "yes, the AI found a very simple explanation" or "no, the AI didn't give any very simple explanation." I am working on a post describing one approach to getting useful data out of an unfriendly AI without letting him do anything bad.
OK, but why on earth would you think that the arbiter will be immune to the superintelligent suasion of the AI? Let's say you program it yourself on hardware you design yourself and you understand very well what the arbiter does under normal circumstances. The AI's new job is to figure out how to create abnormal circumstances under which it can break your arbiter and steal its resources. Suppose it succeeds -- how will you stop the arbiter from passing on a message from the AI that will corrupt your mind, which will then turn you into an unwitting agent of the ufAI, etc.? Think corny zombie movies, only in real life. I know the arbiter is only supposed to be able to send you one bit of information. But how do you propose to stop it from sending you more when it's no longer responding to your programming? Will you equip it with diodes that burn out immediately after being switched on?
The point is that the arbiter is simple, and that the AI's entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length. We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don't make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about. So, I'm the AI, and I'm stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right? So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture. I don't understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this. In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message. This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can't corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted. I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won't fail at his task.
Ah! Yes, that clarifies the situation. Thank you.
And the paper is too short for you to be able to argue persuasively or make art or whatever.
Well, that last part seems relatively easy to ensure: set up a two-layer arbiter, the first layer of which is the arbiter we've been talking about, and the second layer of which is outside the box and is just a stupid hardware switch that strips out all but the first bit of the first arbiter's message. If something inside the box can alter the second layer, then it can probably also directly alter the contents of my mind. You have to assume some limits or the whole exercise is pointless. None of which changes the fact that a message from the arbiter that says "yes, the AI found a very simple explanation" doesn't provide much evidence that the AI found a very simple explanation.
You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it's using to hack the other arbiters), and it is much less likely that that it's possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable. If that is somehow not true of the arbiter you are visualizing, then I haven't properly visualized this arbiter. I can't even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter. I should therefore drop out of this conversation and go back to listening.
The point is that you're uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don't share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25. In fact, it may be better than that - the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the "arbiter crashed" flag means you know the AI is untrustworthy.
I don't see how I could ever know -- or even be significantly confident -- that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn't correlate well with the probability that the AI can exploit arbiter Y. That is, I don't see how I could ever, even in principle, be significantly confident that their algorithms were "sufficiently different" for the probabilities to multiply. The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn't been corrupted. (ETA: That part has been clarified elsethread; thanks!) Now, maybe what I know doesn't matter: maybe we're not talking about provable security. If you're saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
Actually, that part's easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don't need an "arbiter" to copy a message, nor do you need to decrpt it.
... I don't get it.

So we posit that (P1) I have the source code for a superhuman non-provably-Friendly AI (call it Al) that I can run on my ubersecure Box.

Suppose I have high confidence that:

  • (P2) Al is willing to harm humanlike intelligences to achieve its goals.

  • (P3) Al can create humanlike intelligences that also run on Box.

  • (P4) Al can predict that I don't want humanlike intelligences harmed or killed.

I'm not even positing that P2-P4 are true, though they do seem to me pretty likely given P1, merely that I have high confidence in them, which seems almost guarantee... (read more)

Efficiency update,y.2012,no.5,content.true,page.1,css.print/issue.aspx :

Many homomorphic schemes exact a high price for security. During encryption, data undergo a kind of cosmic inflation: A single bit of plaintext may blow up to become thousands or even millions of bits of ciphertext. The encryption key can also become huge—from megabytes to gigabytes. Merely transmitting such bulky items would be costly; computing with the inflated ciphertext makes matters worse. Whereas adding or multiplying a few bits

... (read more)
Why is this voted to the top? There are many practical, efficient homomorphic encryption systems. The current problem is in integrating the various approaches together into a single performant, fully homomorphic encryption environment.
I don't understand what you are disagreeing with. To run an AI safely, we need a performant fully homomorphic encryption environment, and the quote is pointing out that we are currently far from that, and you seem to agree with this when you say that we do not have such a thing and that this is "the current problem".
^^ does not reflect the current state of the art. instead of emoting, why don't you tell us what is the start of the art? Are we up to ENIAC yet? Have we reached IBM System/360? Can we now run Apple IIs or wristwatches under homomorphic encryption?

I don't think you understand what a fully homomorphic encryption system is.

In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted.

Barak et al proved that you cannot encrypt source code and have someone else run it without decrypting it. Gentry's results do not contradict that proof. A fully homomorphic encryption system allows one do encrypt data and have that data oper... (read more)

I am sorry that my writing is unclear, but I think you should be a little more charitable.

In particular, note that I wrote explicitly "the output of the program will be homomorphically encrypted." This is exactly the difference between program obfuscation and homomorphic encryption. A homomorphic encryption scheme allows you to do exactly what I just did with it. In this case, you as the user still have access to the private key which can be used to decrypt the homomorphically encrypted outputs, once the AI has been safely dismantled.

The source code verifier for friendliness is basically the weakest possible thing a friendliness researcher could hope for. It doesn't verify all source code: it verifies source code of an arbitrarily special form. If you can ever prove that any AI is friendly, then you have to implement something at least this strong. I believe that the possibility of putting an AI in a box has been discussed at reasonable length here as a solution to precisely the problem I am considering. I think that trying to reduce the demands on friendliness to a minimum, and motivating the cryptographic tools we will need for that reduction, is worthwhile. After all... (read more)

This is a good observation -- I think the article would be stronger if you added it to the intro paragraph. As it is, I was a bit amused when reading it when you say "this creates a problem", and then mention the problem of the box, while the problem of the verifier seems immensely harder...
Please revise. What I quoted in the grandparent is not merely misleading, it is wrong. I feel like I was being at least somewhat charitable with adding "I don't think" to my comment, rather than just pointing out the inaccuracies. I still don't think it does. You are still referring to "a homomorphically encrypted AI", as if the AI itself is encrypted. Gentry's results indicate that you can encrypt the inputs to the AI, which might be enough to make your scheme work. However, they do not extend to encrypting the program itself. To be able to do that, you have to demonstrate a flaw in Barak's proof. Make it probabilistic, and I'll stop whining. The idea of a formal proof that a program is friendly seems bizarre to me, but I can accept that we need some way to verify to a high degree of certainty that a program is friendly, so I have to agree. This is interesting, and I wasn't aware of this. It seems natural for this to be the case, but why must it be the case? What?! I'm willing to drop all the other lines of discussion to focus on this, if you can demonstrate this.
You are confused about the difference between homomorphic encryption and obfuscation. The difference is NOT the difference between encrypting the program and encrypting the inputs. Those two are the same (this is perhaps the central observation of theoretical computer science), because you can just view a program as an input to the interpreter for whatever language it is written in. Homomorphic encryption does allow me to run a program without learning anything about the program. The difference is that a homomorphic encryption of a program only allows me to learn homomorphic encryptions of that program's outputs, while an obfuscated version of the program allows me to actually learn the program's outputs. To see that homomorphic encryption breaks the quantumness of computation, try and do a quantum computation homomorphically. The first thing you notice is that you can't even implement a Hadamard transform or a phase transform. This can be fixed with work. The second thing you notice is that it is impossible to uncompute garbage at the end of the computation, which seems to be a fundamental difficulty. Regarding my last claim, note that program obfuscation, in the sense of Barak's impossibility result you cited, would be completely adequate (just obfuscate the program that does what you want to do, with the secret key built in). Barak's impossibility proof does not rule out obfuscation quantum mechanically, or in all cases. I don't really care whether or not you believe I can do it in this case, but I thought I would add it to the discussion for completeness.
This is definitely true in some cases; even if we were talking about obfuscation, Barak only proves that a universal obfuscater is impossible. As you noted, if a program is homomorphically encrypted, we learn the program's length. We also learn how long it takes to execute certain operations. We can observe any side effects of the running program, which don't really change. If we are allowed to give it inputs that haven't been also homomorphically encrypted (I know you don't plan to), we might learn much more about the program. I'm not sure that any of those properties make what you propose a Bad Idea, or even render it unworkable, since our adversary is the AI, not the compiler/translator. On the other hand, I'm still not sure what we're gaining from the scheme. What are we scared of? I assume we are worried that our AI might write a virus for the verifier, but the AI can attempt that whether or not it has the key to decrypt its own output (though, I'll grant that it might be much more difficult). Once it does that, it can output an AI that it understands, even if the compiler/translator provably can't, and send the signal that the verifier was supposed to. So this is really my objection; a homomorphically encrypted program still understands itself, and its inputs and outputs (it couldn't work any other way). The guarantee is only that the interpreter running the program cannot decipher any of those without the key, and that doesn't seem to help in this case.
A homomorphically encrypted program doesn't really understand the environment it is running in, in that it is computationally incapable of having any effect on the world until you bring it the secret key. The AI can write a virus for the verifier in the sense of a program which causes the verifier to incorrectly accept (if our test for friendliness was not actually correct) but it can't write some code which, when the verifier attempts to verify it, does horrible damage to the universe. It can't do anything except try and trick the verifier. This seems to me like the best guarantee you could possibly hope for.
I don't understand something here. I read many of the articles on the homomorphic encryption research and skimmed the article itself, and the impression I had was that homomorphic encryption only meant that the encrypted data could be manipulated by an unencrypted program (a native program, machine code) to produce another batch of encrypted data. This is interesting, yes. But I don't see how it protects us from a sandboxed AI. You give the AI its homomorphically encrypted data, and it goes, hmm, very interesting, I'll put it over here and in the meantime let's examine the jnz instruction* - oh look, it probably gives off some radiation from the flipping gate which may let me tamper with a WiFi broadcast and let me out onto another x86-based computer from which I can access the broader Internet... etc. Any encrypted output may be 100% safe, but what about the side-effects? In functional programming, we always remember that 'purely functional' only means that our functions don't have any explicit side-effects but said functions can still easily have side-effects like using 10GB of RAM and causing the OOM killer to be invoked. And cryptography is well-aware of side-channel attacks these days. * example totally made up
Sorry that this was unclear. The intended use is that the code of the AI itself is the encrypted input, and the plaintext program manipulating it is an interpreter for the language in which the AI is written. For example, the AI could be specified as a LISP program and a homomorphically encrypted copy would be given as input to a LISP interpreter.
So (in Haskell notation) our total set up would be something like eval (evilAI ++ problemDescription ++ verifier) where eval :: EncryptedBitstream -> EncryptedBitstream, and the homomorphic argument/input to our evaluator is effectively one gigantic inlined/combined function with no further input?
Correct. The result of that call then gets decrypted. In Haskell, HomomorphicEncryption would probably be a monad. Edit: actually, it makes sense syntactically for it to be a monad but Haskell doesn't really have the machinery to make it work so scratch that.
No no, Haskell probably does. Look at the ST monad - you go 'into' it and then you can perform all sorts of destructive updates but purity is preserved by having to 'escape' it before its results are available anywhere else.
The problem is that the only thing Haskell can do with functions is use them as black boxes, to the best of my knowledge. To apply a function to homomorphically encrypted data, you can't use it as a black box---you need to use an explicit description of the function.
Well, my point was that you can do something akin to Perl's taint - force operations to be done only within a particular type context. So you could do something similar to the ST monad but instead of accepting functions which generate any type output, it operates on, say, a parse tree/ADT representing a Lisp function which is evaluated with the rest of the homomorphic data. But it's not really important; any such strategy would probably be done in a new language (for efficiency, if nothing else) and the into/escape invariant enforced by manual code inspection or something.
In order to hack the substrate, the AI would probably need to crack the homomorphic encryption key used to run it. Noticing that jnz gives off radiation won't do any good if both the timing of when that instruction runs and the results of running it are on the other side of a computationally hard problem.
2Yoav Ravid3y
I don't really understand Barak's proof, but I found a project with a paper from last year that does exactly that. So either there's a flaw in their method, or a flaw in Barak's proof (even if just in it's generality). Their work seems pretty solid so my bet is on them, but I'm definitely not qualified to evaluate it.
The "On the (Im)possibility of Obfuscating Programs" link is interesting, IMO. Haven't finished it yet, and am still rather sceptical. Obfuscation is not really like a cryptographic primitive - but it is not necessarily trivial either. The paper itself says: Also, the main approach is to exhibit some programs which the authors claim can't be obfuscated. That doesn't seem all that significant to me. The link with homomorphic encryption seems rather weak - though the authors do mention homomorphic encryption as a possible application of obfuscation in their abstract.
Ruling out a black box obfuscator is extremely significant. This means that anything which provably obfuscates a program needs to exploit some special structure of the thing being obfuscated. In fact, they show it needs to exploit some super special structure, because even very easy classes of program to obfuscate can't be obfuscated. There are actually several strong connections between obfuscation and homomorphic encryption. For one, obfuscation implies homomorphic encryption. For two, their descriptions are very semantically similar. For three, the possibility of homomorphic encryption is precisely the thing that implies the impossibility of obfuscation.

I found the discussion of homomorphic encryption interesting, but

One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us. This gets around the incredible difficulty of friendliness, but it creates a new, apparently equally impossible problem.

If you can reliably build an AI, but you cannot reliably build friendliness in it, why should I trust that you can build a program which in turn can reliably verify friendliness? It seems to me that if you are unable to bui... (read more)

It seems very plausible to me that certifying friendly source code (while still hugely difficult) is much easier than finding friendly source code. For example, maybe we will develop a complicated system S of equations such that, provably, any solution to S encodes the source of an FAI. While finding a solution to S might be intractably difficult for us, verifying a solution x that has been handed to us would be easy—just plug x into S and see if x solves the equations. ETA: The difficult part, obviously, is developing the system S in the first place. But that would be strictly easier than additionally finding a solution to S on our own.
It looks like you're just arguing about P=NP, no?
Even if P!=NP, there still remains the question of whether friendliness is one of those things that we will be able to certify before we can construct an FAI from scratch, but after we can build a uFAI. Eliezer doesn't think so.
If I have a system which is manifestly maximizing a utility function I can understand, then I have some hope of proving claims about its behavior. I believe that designing such a system is possible, but right now it seems way out of our depth. Every example of intelligence we have ever seen behaves in an incredibly complicated way, which isn't manifestly maximizing any utility function at all. The way things are going, it looks like the first AI humans come up with (if they come up with one in the next 50 years at least) is probably going to be similarly difficult to analyze formally. I think someone who worked on friendliness would agree, but I don't know. I have only started thinking about these issues recently, so I may be way off base.

Let me see if I understand. Firstly, is there any reason what you're trying to do is create a friendly AI? Would, for instance, getting an unknown AI to solve a specific numerical problem with an objectively checkable answer be an equally relevant example, without the distraction of whether we would ever trust the so-called friendly AI?[1]

If that is the case, then as I understand the proposal, we take the code for the unknown AI and the input data, use this encryption scheme on it, and then have a program which performs various operations on them.

As I unde... (read more)

Let me see if I understand. Firstly, is there any reason what you're trying to do is create a friendly AI? Would, for instance, getting an unknown AI to solve a specific numerical problem with an objectively checkable answer be an equally relevant example, without the distraction of whether we would ever trust the so-called friendly AI?

I think Less Wrong needs a variant of Godwin's Law. Any post whose content would be just as meaningful and accessible without mentioning Friendly AI, shouldn't mention Friendly AI.

Fair enough. I am going to rework the post to describe the benefits of a provably secure quarantine in general rather than in this particular example. The main reason I describe friendliness is that I can't believe that such a quarantine would hold up for long if the boxed AI was doing productive work for society. It would almost certainly get let out without ever saying anything at all. It seems like the only real hope is to use its power to somehow solve FAI before the existence of an uFAI becomes widely known.
LOL. Good point. Although it's a two way street: I think people did genuinely want to talk about the AI issues raised here, even though they were presented as hypothetical premises for a different problem, rather than as talking points. Perhaps the orthonormal law of less wrong should be, "if your post is meaningful without fAI, but may be relevant to fAI, make the point in the least distracting example possible, and then go on to say how, if it holds, it may be relevant to fAI". Although that's not as snappy as Godwin's :)
I agree. In particular, I think there should be some more elegant way to tell people things along the lines of 'OK, so you have this Great Moral Principal, now lets see you build a creature that works by it'.
You don't quite understand homomorphic encryption. Property (b) is actually guaranteed in a much stronger sense than you recognize. In fact it is impossible for the AI to hack the computer in any way, or signal in any way to someone watching the program execution. This is an important issue in cryptography, but more complicated than a blog post can apparently communicate. The point is, absolutely NO information is leaked to someone without the secret key. You can't even tell if the AI just gave up and is sitting there. I agree that people ``just being curious'' is an issue (basically, the AI getting out of the box without even using a text channel). Honestly, once you have designed an uFAI you already have this problem if people just decide to go and run it on their own. However, I believe that in the future we will be able to resolve this problem in a limited sense by destroying the original private key and leaving a gimped private key which can only be used to decrypt a legitimate output.
"You don't quite understand homomorphic encryption." I definitely DON'T understand it. However, I'm not sure you're making it clear -- is this something you work with, or just have read about? It's obviously complicated to explain in a blog (though Schnier does a good start) but it surely ought to be possible to make a start on explaining what's possible and what's not! It's clearly possible for Alice to have some data, Eve to have a large computer, and Alice to want to run a well-known algorithm on her data. It's possible to translate the algorithm into a series of addition and multiplication of data, and thus have Eve run the algorithm on the encrypted data using her big computer, and send the result back to Alice. Eve never knows what the content is, and Alice gets the answer back (assuming she has a way of checking it). However, if Alice has a secret algorithm and secret data, and wants Eve to run one on the other, is it possible? Is this analogous to our case? Is the answer "in principle, yes, since applying an algorithm to data is itself an algorithm, so can be applied to an encryoted algorithm with encryopted data without ever knowing what the algortihm is"? But is it much less practical to do so, because each step involves evaluating much redundant data, or not? I'm imagining, say, using the game of life to express an algorithm (or an AI). You can iterate a game of life set-up by combining each set of adjacent cells in a straight-forward way. However, you can't use any of the normal optimisations on it, because you can't easily tell what data is unchanged? Is any of this relevant? Is it better if you build dedicated hardware? If my understanding is correct (that you do run with the program and data encrypted at all times) then I agree the the scheme does in theory work: the AI has no way of affecting its environment in a systematic way if it can't predict how its data appears non-encryp
To answer your question, since Paul apparently failed to --- Paul reads crypto papers and spends non-trivial amounts of time thinking about how to design cryptographic protocols to attain various guarantees. I'm not sure if that constitutes 'working with' or not, but it presumably answers your question either way.
Thank you.
I'm sure I'm not making it clear. I may make a discussion post describing in much more detail exactly what is and is not possible in theoretical computational cryptography, because it is a really beautiful subject that might be independently interesting to many readers. What you say in the last post seems precisely correct. You would use something much more efficient than the game of life. This makes any program run much, much slower. I have limited experience with computer architecture, but I would guess that a single operation on a modern processor would require about 300 homomorphic operations, many of which could be done in parallel. So optimistically you could hope that a program would run only a couple billion times slower using modern cryptography with dedicated hardware (it does parallelize really well though, for what its worth). I think if it was concluded that quarantining code was useful, smart people could solve this problem with much, much better factors. I would not be surprised if you could make the code run say 100x slower but be provably secure in this sense. The gimped key I mentioned is not a part of the cryptosystem. Again, I have high confidence that it could be worked out if smart people decided this problem was interesting and worked on it.
Thank you. But if you're running something vaguely related to a normal program, if the program wants to access memory location X, but you're not supposed to know which memory location is accessed, doesn't that mean you have to evaluate the memory write instruction in combination with every memory location for every instruction (whether or not that instruction is a memory write)? So if your memory is -- say -- 500 MB, that means the evaluation is at least 500,000,000 times slower? I agree there are probably some optimisations, but I'm leery of being able to run a traditional linear program at all. That's why I suggested something massively parallel like game of life (although I agree that's not actually designed for computation) -- if you have to evaluate all locations anyway, they might as well all do something useful. For instance, if your hypothetical AI was a simulation of a neural netowrk, that would be perfect, if you're allowed to know the connections as part of the source code, you can evaluate each simulation neuron in combination with only the ones its connected with, which is barely less efficient than evaluating them linearly, since you have to evaluate all of them anyway. I think that's what my brain choked on when I imagined it the first time -- yes, technically, 500,000,000 times slower is "possible in principle, but much slower", but I didn't realise that's what you meant, hence my assumption that parts of the algorithm would not be encrypted. think if you rewrite this post to make the central point more accessible (which would make it a very interesting post), it would not be hard to put in this much explanation about how the encryption would actually be used. Unless I'm wrong that that is necessary? It wouldn't be hard, just something like "you can even run a whole program which is encrypted, even if you have to simulate every memory location. That's obviously insanely impractical, but if we were to do this for real we could hopefully find a way
Running a C program with 500MB of memory would undoubtedly be very slow. Current technologies would be even slower than you suspect for other reasons. However, there are other languages than C , and for some this transformation is much, much more efficient. The game of life is at least parallel, but is also extremely inefficient. You could instead use some fundamentally parallel language which has all of the benefits of the game of life while actually being structured to do computation. Coincidentally, many AI researchers already do work in functional languages like Lisp, which are very well-suited to this transformation. Running a huge List program homomorphically might be practical if you had constant rate homomorphic encryption, which I suspect we will eventually. The slowdown could be something like 10x as slow, running on 20x as many processors. When I say that I think the problem is theoretically possible, I do also mean that I think significant speedups are theoretically possible. Homomorphic encryption is an unnecessarily huge hammer for this problem. Its just that no one has thought about the problem before (because cryptographers don't care about unfriendly AI and people who care about friendly AI don't think this approach is viable) so we have to co-opt a solution to a normal human problem. I could include a complete discussion of these issues, and you are probably right that it would have made a more interesting article. Instead I initially assumed that anyone who didn't know a lot about cryptography had little chance of learning it from a blog post (and people who did know wouldn't appreciate the exposition), and I should post a convincing enough argument that I could resolve the usual concerns about AI's in boxes. I think as it stands I am going to write a post about theoretically safe things to do with an unfriendly AI in a box, and then conditioned on that going well I may revisit the question of building a box in a more well thought out way. Unti
That sounds reasonable. I agree a complete discussion is probably too complicated, but it certainly seems a few simple examples of the sort I eventually gave would probably help most people understand -- it certainly helped me, and I think many other people were puzzled, whereas with the simple examples I have now, I think (although I can't be sure) I have a simplistic but essentially accurate idea of the possibilities. I'm sorry if I sounded overly negative before: I definitely had problems with the post, but didn't mean to negative about it. If I were breaking down the post into several, I would probably do: (i) the fact of holomorphic encryption's (apparent) existence, how that can be used to to run algorithms on unknown data and a few theoretical applications of that, a mention that this is unlikely to be practical atm. That it can in principle be used to execute an unknown algorithm on unknown data, but that is really, really impractical, but might become more practical with some sort of parallel processing design. And at this point, I think most people would accept when you say it can be used to run an unfriendly AI. (ii) If you like, more mathematical details, although this probably isn't necessary (iii) A discussion of friendlyness-testing, which wasn't in the original premise, but is something people evidently want to think about (iv) any other discussion of running an unfriendly AI safely
Does the uFAI have access to it's own public key? If so, I think it can produce coherent output. All it has to do is encrypt random things with the public key, until it finds something that produces coherent-looking ciphertext from it's own perspective, and then there's a two-way channel.
Homomorphic encryption has to be randomized. If you keep the randomness secret (as the security proof of homomorphic encryption requires) then the AI can't possibly predict what the actual encryptions will look like because it doesn't know anything about the randomness used to produce them. This can be accomplished either by generating fresh quantum randomness for each encryption, or by using a psuedorandom generator which isn't vulnerable to any passive side channel attack (certainly using fresh randomness is much safer).
Isn't "encrypt random things with the public key, until it finds something that produces [some specific] ciphertext" exactly what encryption is supposed to prevent?? :) (Not all encryption, but commonly)
In RSA, it is easy to find a message whose encryption begins with "001001," for example.
Other thoughts not relevant to the central point: (a) Do you really want to rely on cryptography to contain an unfriendly super-intelligence? It's better than NOT doing so. But notice that all the cryptographic schemes we have ever had have one thing in common: they've been broken, or we expect them to be broken eventually. it's an arms race, and if the AI is that great, then even if it can't factorise large numbers in reasonable polynomial time, it may be able to crack the encryption. (b) We could have simply used a normal non-encrypted program to "compare answer to problem, test, reject or publish". It would have been somewhat more vulnerable to being hacked by bad input, but not necessarily much more. (c) I don't think we're able to say much about friendly or non-friendly AIs until we know a lot more about building AI at all, although that doesn't mean it's bad to think about. The idea of a test for a friendly AI is, um, worrying. Judging by computing history, all our current thoughts will turn out to be incredibly simplistic. (d) We're more certain of the existence of this encryption scheme, provided we're right about certain mathematical results about lattices. OTOH, if we DID build an unfriendly AI and it DID get loose on the internet and take over the world, is "well, what do you know, we were wrong about lattices" a consolation? :)
I think most people strongly suspect that there exist cryptographic schemes which can't be broken, because they are actually hard. RSA is probably such a scheme against classical adversaries (as our AI would be). Its true that they could break it without factoring numbers, but they can't break it without defeating the RSA assumption.
I definitely have the impression that even if the hard problem a cryptosystem is based on actually is hard (which is yet to be proved, but I agree is almost certainly true), most of the time the algorithm used to actually encrypt stuff is not completely without flaws, which are successively patched and exploited. I thought this was obvious, just how everyone assumed it worked! Obviously an algorithm which (a) uses a long key length and (b) is optimised for simplicity rather than speed is more likely to be secure, but is it really the consensus that some cryptosystems are free from obscure flaws? Haven't now-broken systems in the past been considered nearly infalliable? Perhaps someone (or you if you do) who knows professional cryoptography systems can clear that up, which ought to be pretty obvious?
A provably correct implementation of any reasonable encryption scheme requires automatic verification tools which are way beyond our current ability. I strongly suspect that we will solve this problem long before AGI though, unless we happen to stumble upon AGI rather by accident. I think that you could implement the scheme I described with reasonable confidence using modern practices. Implementing encryption alone is much easier than building an entire secure system. Most flaws in practice come from implementation issues in the system surrounding the cryptography, not the cryptography itself. Here the surrounding system is extremely simple. You also have a large advantage because the design of the system already protects you from almost all side channel attacks, which represent the overwhelming majority of legitimate breaks in reasonable cryptosystems.
This isn't really right. With the cases where algorithms can be proved as secure as some math problem, you can attack the protocol they are used in - or the RNG that seeds them - but the not really the algorithm itself. Of course, not all cryptography is like that, though.
We know that there exist such schemes. One times pads are an example. What we don't know is whether there exist secure crypto schemes without a shared source of randomness. Note however that the existence of such systems implies that P != NP, so this is a fairly strong statement. The existence of secure homomorphic encryption implies that P != NP. So whatever your confidence is that P != NP your confidence in this should be lower.
I think we've had this very discussion before :)
Well, we had it about P ?= NP. So how much does your confidence go down for the stronger claim?
Significantly. Its very hard to put probabilities on this sort of thing, but I'd take a bet if the odds were... 100:1? I don't know if I would take significantly worse odds. My brain doesn't handle either small probabilities or epistemic uncertainty very well.
Hmm, that's very interesting. I'm surprised that the estimates aren't closer together. My own estimate for the existence of provably secure homomorphic encryption given that P != NP is very high, around, .75. So it seems that you much more strongly believe that P !=NP but are much less comfortable given that P !=NP assigning a high probability to the existence of homomorphic encryption.
Best to be careful with the term "provably secure" - since "provable security" is a technical term with a rather counter-intuitive meaning.

One solution to the problem of friendliness is to develop a self-improving, unfriendly AI, put it in a box, and ask it to make a friendly AI for us. This gets around the incredible difficulty of friendliness, but it creates a new, apparently equally impossible problem. How do you design a box strong enough to hold a superintelligence?

Two problems. (You still need to verify friendliness.)

Quite true. I slightly edited my claim to be less wrong.

As I understand it, homomorphic encryption allows you to write programs that perform algebraic transformations on encrypted data without decrypting it in the process. What you’re looking for is a way to execute the ecrypted data as a general-purpose program without decrypting it, a much more difficult task that doesn't seem to be mentioned in the linked material. Is there any reason to think that such a thing is possible?

Perhaps the wikipedia article was a bad link, because it talks about things from the algebraic perspective. Algebraic operations are equivalent to general-purpose programs if you are willing to let things run more slowly; it turns out that computers are built out of ANDs and ORs, which is exactly what homomorphic encryption lets you do. For C programs the slowdown is significant. For logic circuits the slowdown isn't very significant. For LISP programs the slowdown is intermediate.
Okay, but the problem is still that you're looking for a way to execute operations without knowing what the operations are.

It seems to me that simply breaking the homomorphic encryption would be the most likely outcome here. Note that we're relying on a human-created code to provide security against a superintelligence, and breaking code systems (or at least determining routes of attack that provide a method that has a certain chance of breaking the code system within a certain number of cycles, et cetera, et cetera) is something we do fairly routinely. I think it's far more likely that the AI will simply deduce the method by which we have confined it and break the code. If we're positing an AI that could destroy the universe through the heat output of the processor it's running on, the only real answer is "don't run it".

(Nothing here should be taken to imply that sandboxing is the right way to deal with uFAI)

The encryption step seems relatively pointless. it's useful against an external attacker. If You are the'attacker' and you don't want to do attacking, then just don't.

Instead set up a normal sandbox that wipes its memory when it's done.

If anything the encryption step obfuscates how the sandbox works and leaves more room for implementation errors.

If you are paranoid enough (I don't know if EY is; I might be if I suspected that my AI might actually FOOM with the resources I'd given it) then you might worry that the AI could manipulate the physical hardware it was running on to e.g. send a message to the world or corrupt the environment it was running in. I just wanted to eliminate all such possible concerns so that no one could say "there will always be something you haven't thought of."
I can still say that. Layering sandboxes doesn't mean an AI can't find a flaw in each layer. When you're up against an opponent that might understand its own systems on a much deeper level than you, adding barriers that look confusing to you is not the appropriate response. A proven JVM would be a good a method of hardware isolation. Cryptography gives a false sense of security. I have far more confidence in a system that simply denies access based on a bit flag than that a cryptosystem performs as advertised. If you can't do the former correctly, then I definitely don't trust you with the latter.

Cryptography only gives a false sense of security if it based on false assumptions. The barriers don't just look confusing to me---I have a proof that they look confusing to any system that can exist in the physical universe. Maybe the proof is wrong. Maybe the laws of physics don't look anything like physicists expect (maybe there is a super special law which just breaks my cryptographic scheme), or maybe I incorrectly assumed a problem was outside BQP, or maybe I was too optimistic about the constants in the best attack. But notice that all of these are opportunities for me to be wrong. There is no opportunity for an AI to outthink me if I am right. That is the beauty of cryptography.

Having said that, most cryptography isn't like that. There's no proof that AES is secure, just that it creates a lot of muddle that nobody yet knows how to untangle. There's no proof for the common public key schemes, either. They depend on the difficulty of things like factoring large numbers - and we don't have a proof that that is as difficult as it currently seems.
You assume that the encryption you implemented works the same way as your mathematical proof and your implementation has no bugs. In real life it happens frequently that encryption gets implemented with bugs that compromize it.
It also happens frequently that it doesn't. This is a serious concern with any cryptographic system, but it's unrelated to the accusation that cryptography is security through confusingness. The game is still you vs. the cryptographic design problem, not you vs. an AI.
How do you know that a system that's currently working is bug free? On a more general note, most trusted cryptographic systems have source that get's checked by many people. If you write a cryptographic tool specifically for the purpose of sandboxing this AI the crypto-tool is likely to be reviewed by less people. A system that gives you a simple bit flag to deny access is easier to check for bugs, it's less confusing.
That is only true for those who don't understand it.

Way in over my head here, but if the project is done in a series of steps with a series of boxed AIs helping to work on the next step, wouldn't it be better get something that is at the very least(*)…

  • always trying to be friendly.
  • always trying to be completely honest and non-evasive (to the programmers).
  • pretty damn stable.
  • pretty damn good at avoid killer-Gandhi pills, lost purposes, etc..
  • will alert & stop itself if it thinks it's drifting on any of the above. a much earlier step than either…

  • being smart enough to outwit the programmers in d
... (read more)
I agree completely. I don't think that renders a strong quarantine system useless though, because we don't really get to decide whether we find FAI or uFAI first, whoever discovers the first AI does.

The extreme computational overhead of this sort of scheme limits its applicability. However, it might be a useful technique for an already-friendly self-improving AI that has acquired vast resources and is designing its next most intelligent successor.

I think you are being overly optimistic about homomorphic encryption. The uFAI doesn't need to have absolute control over how the computation happens. Nor does it need to be able to perfectly predict the real-world results of running some computation. It only needs some amount of information leakage. The current best example I can think of is timing attacks on cryptographic protocols. The protocol itself should be secure, but a side-channel causes insecurity. Another example would be the meltdown and spectre vulnerabilities. How do you know your computatio... (read more)

In homomorphic encryption, someone living inside the computation can't affect the outside unless the whole system breaks the cryptographic assumption. It's not at all analogous to side channel attacks on other cryptographic operations, for which the security of the primitive itself says nothing about the security of the implementation. Put differently: in the typical security model, the computer doing the encryption is assumed to be a magical computer that doesn't leak information and keeps the secret key perfectly secret. But the computer that's actually running the encrypted computation is assumed to be adversarial, so it can be as imperfect as you want and it's not going to break things. The analogy to a traditional side-channel attack would be leaking information from the encryption process (which you would still have to be careful about here). For example, it's impossible to provide a homomorphically encrypted computation the ability to "pause." (How would you even try to do that?) And the computation you do cannot at all on what the AI is thinking. Of course this guarantees that homomorphically encrypted computations are slow in practice. Note that since making the OP there have been credible proposals for indistinguishability obfuscation, which would be the more natural thing to use though it's bound to be even less competitive. (I think the crypto in the OP is fine, but I no longer endorse it / consider it interesting.)

This recent development seems relevant: a technique for functional encryption, rather than just message encryption.

Based on this post I'd thought that was already possible, but some discussion I saw of this article suggested that encrypting the code, and not just the input and output, was something new.

A fully homomorphic encryption scheme has the additional counter intuitive property that, although you cannot learn anything about the encrypted data without the secret key, you can still perform arbitrary operations on it. In particular, if you have a homomorphic encryption of the source code of a program you can run that program without learning anything about it (except how long the source is), but the output of the program will be homomorphically encrypted.

The second sentence doesn't follow from the first. Being able to perform arbitrary operations on the encrypted data doesn't enable you to perform operations without knowing what operations they are.


Let's consider a somewhat similar case.

You are an inventor. An evil dictator captures you, and takes you off to a faraway dungeon, where he tells you that he wants you to build him a superweapon. If you refuse to build the weapon, well, he has means of persuading you. If you still refuse, he will kill you.

Of course, when you finish building the doomsday device, your usefulness will be over, and he will probably kill you anyway.

After a little thought, you realise that you could build the dictator a doomsday device in about a week, with the materials lying a... (read more)

It is an interesting idea. However, today, thousands link their machine learning projects up to the internet, without a thought of them accidentally taking over the world. It seems rather difficult to imagine that situation changing very much.