The November LW/OB meet-up will be this Saturday (two days from today), at the SIAI house in Santa Clara.  Apologies for the late notice.  We'll have fun, food, and attempts at rationality, as well as good general conversation.  Details at the bay area OB/LW meet-up page.

70 comments, sorted by Click to highlight new comments since: Today at 10:49 AM
New Comment

I'm traveling to the west coast especially for this. Hoping to see you all there.

Great meetup; conversation was had about the probability of AI risk. Initially I thought that the probability of AI disaster was close to 5%, but speaking to Anna Salamon convinced me that it was more like 60%.

Also some discussion about what strategies to follow for AI friendliness.

I was also interested in the discussion on AI risk reduction strategies. Although SIAI espouses friendly AI, there hasn't been much thought about risk mitigation for possible unfriendly AIs. One example is the AI box. Although it is certainly not 100% effective, it's better than nothing (assuming it doesn't encourage people to run more UFAIs). Another would be to program an unfriendly AI with goals that would cause it to behave in a manner such that it does not destroy the world. For example, having a goal of not going outside its box. While the problem of friendly AI is hard enough to make people give up [], I also think the problem of controlling unfriendly AI is hard enough to make some of the pro-FAI people do the same.
It would be nice if you could tell an AI not to affect anything outside its box. 10 points will be awarded to the first person who spots why "don't affect anything outside your box" is problematic.
Such an AI wouldn't be able to interact with us, even verbally.
Alicorn, I hereby award you 10 points. These are redeemable after the singularity for kudos, catgirls and other cool stuff.
The word "catgirl" leaves me vaguely nauseous. I think this has something to do with the uncanny valley. Apologies for the tangent.
you know what they say: one man's nausea is another man's arousal. note: considered unprovable before the internet.
-1Eliezer Yudkowsky13y
...that's not even the real problem...
You can have my points if you tell us what is. I didn't want a catgirl anyway.
I shall attempt to guess the teacher's passidea: “the real problem” is that the AI can't perform any computation, because that would consume energy and emit heat. Oh, and gravity and electromagnetic waves, and so on. In fact, not computing is equally well a choice which affects things outside the box.
5Eliezer Yudkowsky13y
Yup. To be exact, "affect" is a magical category [] relative to the irrevocably entangled actual universe. Edit: Looks like I actually wrote this one up, in Dreams of Friendliness [].
It's the first horn of the dilemma. The second horn is the AI Box failure mode.
What if you told the AI to only affect the world through verbal communication with humans?
Then it would be able to convince you to let it out.
So it could affect the world through more verbal communication with humans. That's not as bad as things could be. Edit: The AI has as its First Law that the conducting metals in its circuitry must not touch other conducting metal that was not present when it was built. The conducting metal that will be present when it's built will be its CPU, RAM, etc, and a terminal. The terminal will be engineered to have a reasonably slow rate of character display for safety purposes (so no humans will be reprogrammed while basking in the hypnotic glow of characters rushing past.) You know how sometimes someone will give you what seems like an impossible to solve hypothetical, and yet somehow, there's a solution? I'm pretty sure that the space of situations that are simply impossible to solve is much bigger than the space of situations that seem impossible but actually are possible. And constructing one of these seems (to me) a much better bet than hoping that a complex computer program will be bug-free on its first trial run.
4Eliezer Yudkowsky13y
This is a good example of two idioms: First, what Bruce Schneier called "fence-post security". That's where you build a very tall fencepost in the middle of the desert. People don't climb the fencepost, they just walk around it. Second, the idiom that would-be FAI solvers go into when they see a problem, and try to fix it by brute force, which manifests as "hard-wiring into the very circuitry" or "giving it as its ultimate priority" that X. X varies, but with a fairly constant idiom of getting an emotional charge, a sense of having delivered a very strong command or created something very powerful, by talking about how strongly the goal is to be enforced.
I don't see the analogue. Assuming we could get all the physical stuff right, so that the AI had no real hope or desire of affecting its environment substantially aside from changing the letters on a terminal screen, I think this would be a considerable barrier to destruction of the human race. At the very least it makes sense to have this safeguard in addition to trying to make the AI friendly. It's not like you can only do one or the other. Isn't that what you do with FAI in general by saying that the AI will have as its ultimate priority to implement CEV? Anyway, you are accusing me of having the same aesthetic as people you disagree with without actually attacking my claims, which is a fallacious argument.
1Eliezer Yudkowsky13y
Nope, I don't go around talking about it being the super top ultimate indefeatable priority. I just say, "Here's a proposed decision criterion if I can (a) manage to translate it out of English and (b) code it stably." It's not the "ultimate priority". It's just a proposed what-it-does of the AI. The problem is not getting the AI to listen. The problem is translating something like CEV into nonmagical language. Okay. Read the Dreams of Friendliness [] series - it's not sequenced up yet, but hopefully you should be able to start there and click back on the dependency chain.
Two things: 1. As Bo102010 said, the AI could convince you to remove the saveguards. And we're not talking Snow Crash reprogram-the-user stuff, we're talking plain old convince []. 2. If you let the robot affect the environment through heat, electromagnetic waves, variable load on the power circuits, gravity, etc [] (which it will, if it's on), it has a back door into reality. And you don't know if that's a fatal one. Edit: Some other Eliezer remarks on the subject: Shut up and do the impossible! [] (which uses the AI Box as an example, and discusses the third through fifth experiments), Dreams of Friendliness [] (which Eliezer cited in the other thread), and That Alien Message [] (which is tangentially related, but relevant).
But I think with enough thought, you could design things so that its backdoors would probably not be fatal. Contrast this with the fact that a complex computer program will probably behave incorrectly the first time you run it.
I don't think the plan is to hack CEV together in lisp and see what happens. Writing provably correct software is possible today, it's just extremely time-consuming. Contrast this with our incomplete knowledge of physics, and lack of criteria for what constitutes "good enough" physical security. A bad hardware call seems far more likely to me than a fatal programming slip-up in a heavily verified software system. For software, we have axiomatic systems. There is no comparable method for assessing physical security, nothing but our best guess at what's possible in a universe we don't understand. So I'd much rather stake the future of humanity on the consistency of ZFC or some such than on the efficacy of some "First Law of Not Crossing the Beams" scheme, even if the latter is just so darn clever that no human has yet thought up a counter-example.
Can you name a documented nontrivial program with >1000 lines of code that ran correctly on the first try? What data exists on long computer programs that were not tested early on in their development? A couple of points: provably correct software and external hardware measures are not mutually exclusive. And if there really is a race going on between those who are concerned with AI safety and those who aren't, it likely makes sense for those who are concerned with safety to do a certain amount of corner-cutting. How would one prove friendliness anyway? Does anyone have any idea of how to do this? Intuitively it seems to me that proving honesty would be considerably easier. If you can prove that an AI is honest and you've got a strong box for it, then you can modify its source code until it says the kind of thing you want in your conversations with it, and get somewhere close to friendliness. Maybe from there proving friendliness will be doable. it just seems to me that the greatest engineering accomplishment in history isn't going to come about without a certain amount of experimentation and prototyping.
Well, the software that runs the critical systems on the Space Shuttle is a common example. Yes, the components that make up the software probably underwent unit testing prior to launch, but testing (and debugging) made up only a very small fraction of the cost of assuring the software would not fail. Correctness proofs and other formal techniques are used to verify all modern mass-market high-performance microprocessor designs, a la Intel and AMD. The techniques used to verify the designs above the level of abstraction of the logic gate are largely tranferable to software. A lot of military systems use formal techniques to verify the software. Warplanes for example have used "fly-by-wire" for decades: in a "fly-by-wire" plane, loss of function of the fly-by-wire system causes the plane to drop out of the sky or immediately to go into a spin or something dire like that. I believe the fly-by-wire system of modern warplanes entails significant amounts of software (in addition to digital electronics), but I am not completely sure. Just because most software-development efforts do not need and do not want to pay for formal methods does not mean that formal methods do not exist or that they have not been successfully used in ambitious efforts.
These notions of honesty and conversation seem kind of anthropormophic []. I agree that we want a transparent AI, something that we know what it's doing and why, but the means by which such transparency could be achieved would probably not be very similar to what a human does when she's trying to tell you the truth. I would actually guess that AI transparency would be much simpler than human honesty in some relevant sense. When you ask a human, "Well, why did you do this-and-such?" she'll use introspection to magically [] come up with a loose tangle of thoughts which sort-of kind-of resemble a high-level explanation which is probably wrong, and then try to tell you about the experience using mere words which you probably won't understand []. Whereas one might imagine [] asking a cleanly-designed AI the same question, and getting back a detailed printout of what actually happened.
This is a good argument against wasting a bunch of time on elaborate physical safeguards.
It might be cheaper in terms of resources to build a safeguard that is 50% reliable than to look over your code once more to have it go from being 99.9% reliable to 99.95% reliable. If AI research is anything like research in certain other fields, there's a significant speed benefit to be gained from being able to experiment and prototype.
You are missing the point of "proving Friendliness". 50% would be awesome. See my old reply [] to Shane Legg for more.
Re: "Provability is not about setting a standard that is too high, it is about knowing what you are doing -- like, at all." Normally, in the context of writing computer programs, that concept refers to the idea of "provable correctness" - which means something rather different. It refers to the mathematical concept of proof - and not to what you are talking about.
Great comment, especially:
0Eliezer Yudkowsky13y
I thought I was reading my own writing for a second there. No, there's no certainty. But if you prove an AI is Friendly, it might work. If you don't prove it Friendly, it has no chance at all of working.
FWIW, the idea that we'll be able to "prove" very much of interest about the long-term consequences of the behaviour of superintelligent machines on humanity strikes me as wrong-headed. That's planet-scale social science you are talking about there - and look at how rubbish we are at proving anything about complex computer programs. Since trying to prove things about your programs probably slows down development dramatically, those with good intentions would be well-advised NOT to employ such strategies - or else they will simply be sidelined by those who are less paranoid.
The idea is to prove properties of actions, not consequences. From Knowability of FAI []:
It doesn't make the problem any less ridiculous or intractable. I don't see any good reason for thinking such an approach is called for. No machine intelligence project I have heard of has ever done anything remotely like this - and yet such projects produce results that are typically helpful and benign. Proofs can get complicated - and can be wrong. Extensive unit testing is a more practical and realistic approach for most programming problems. The virtues of such testing have become widely appreciated in recent years.
Therefore we should divert some resources from FAI theory to better tools for proving code correct.
I don't follow your logic. Provable correctness is an esoteric field, applicable only to the most security sensitive applications. Machine intelligence is not usually such a field. Since provability is of little practical significance for programmers, I don't see how it follows that an understanding of all this means that it should be allocated more funding. The most obvious reaction is to forget about provable correctness - and to make sure that other people don't waste too much of their precious time on something so irrelevant and unhelpful.
Provable correctness is esoteric and not very useful right now, but how do we know it can't become much more useful and be relevant to many new things including FAI implementations? I'm aware of the fundamental results such as Rice's theorem regarding what can and can't be proven correct, but does it follow that there's nothing for formal verification to contribute? I agree that most likely, non-100%-certainty code checking is going to remain the only practical approach. We're probably going to need some narrow AI to help us there, too.
AI is only marginally code. You need to prove that the theory is good first.
While I'm thinking of it: would an analogy to cryptology be useful at this juncture?
Well, we don't know for sure - since predicting the future it difficult. However, looking at the current situation, artificial intelligence programs are some of the most difficult things to apply formal proof to - since they are so complicated. It is not my position that "there's nothing for formal verification to contribute". We have some code verification today - in Java bytecode programs - for example. It is just a considerable distance below the top-level behaviour of the system. Also, one of the routes most likely to get to the goal first involves reinforcement learning - which results in the internal representation and goal structure being written by machines - a notorious way of producing incomprehensible code. What we can program has always been far ahead of what we can prove about our programs. Proofs may advance, but programming skills seem likely to advance further. We may not see exactly the same lag forever, but there will probably be quite a lag for quite a while - and intelligent machines are probably not too far off now. Fortunately, provable correctness is not only not very useful, it is also not in much demand. Most computer programs don't need to be perfect - good enough is quite acceptable. When competing with the bug-ridden human brain, that is especially likely to be true. You only need to use "provable correctness" for highly security-critical applications. The only people considering it are the most paranoid researchers.
The properties that can be proved by the Java bytecode verifier are far weaker than the properties that can be proved by the Java source-to-bytecode compiler. If everyone were willing to distribute source code, there would be no need for a bytecode verifier in Java. So this isn't a shining example of the usefulness of machine code verification, unfortunately. The power of our verification tools varies greatly with the programming style chosen. (Compare proving things about Haskell programs, to proving similar things about C code.) A research direction I'm personally interested in is this: how can we construct programming styles (languages, libraries, etc) that allow much greater verification power while at the same time not being too constraining for human programmers?
Re: "If everyone were willing to distribute source code" Right - but that premise is incorrect. The point of having bytecode is to have a condensed format that is suitable for distribution and direct execution. The example also illustrates the connection between provability and security.
Bytecode is not significantly more condensed than appropriately transformed sourcecode (strip comments and whitespace and compress). And the cost of compiling Java source is low; it's not a problem except for low-end embedded platforms. From a technical POV, source is better suited than bytecode for distribution, and indeed for everything. Not only can you prove stronger things about it than you can about bytecode, source-level compatibility is more resilient to library changes than bytecode-level ABI compatibility. The only reason people don't always distribute Java source is because they don't want end-users to easily read and modify it. (Which is mostly irrational, given the quality of bytecode-to-source decomplation tools.) And this concern isn't very relevant to issues of AI development and verification.
People actually ship bytecode - rather than compressed source code - because the runtimes expect bytecode, and compilation before execution is a time-consuming additional step. The "proving stronger things" claim seems highly dubious. You can decomplie compiled bytecode into source code, and what comes out is of reasonable quality - unless the bytecode was compressed after compilation. Excepting comments and whitespace, these two formats contain very similar information.
This is a circular argument. The compiler is small compared to the runtime and standard library, and could easily be included in the requirements for a JRE. ETA: one reason Java was designed to run bytecode is that business-types are afraid of distributing source code to end users, and were much more afraid before the relatively recent rise to fame of from-source languages like Python. (Yes, the original Java was designed to run on very constrained platforms, but that has been changed long since. By Java 1.2, requiring JREs to compile code before running it and treating bytecode as a cache would not have seemed very onerous.) Decompilation works as well as it does only because compilers store extra information in the bytecode. This information (generics being an infamous example) isn't a part of the bytecode spec and can't be used by the runtime - it's only there to help decompilation, debugging etc. So while it's strictly true that the bytecode format can contain all the information you want, in practice this is little different from shipping stripped-down bytecode that doesn't contain that extra info and also the sourcecode.
You appear to have mis-read my argument before labeling it as circular. I didn't say a compiler was too large for a JRE. I said compilation was a time-consuming additional step. Your bytecode history seems revisionist. Java actually has bytecode for many reasons - including size issues and embedded applications: "Java bytecodes are designed to be easy to interpret on any machine, or to dynamically translate into native machine code if required by performance demands." * [] You seem to agree - since you have now revised your comment.
I agree that bytecode itself is useful for performance. The idea of verifying it instead of verifying the source code is what I'm arguing against. What I was suggesting is to always distribute the code, verify that on installation, and store trusted bytecode in a cache for performance.
Right - well, speak to Sun/Oracle. I expect that - if you get a reply - you will hear claims that compilation to bytecode on the client would damage start-up performance, that compressed source code is less compact than compressed bytecode, hindering applet distribution, and claims of additional burdens on embedded environments. You would still need a bytecode verifier for backwards compatibility - so that would be two verification systems to maintain.
The main reason the JRE won't be changed today is that a lot of the code running on it isn't written in Java. And that's a good reason. I was merely saying that for a single project needing the best possible code verification assistance, it was more fruitful to look at language-level tools than at bytecode-level.
Yes, you can learn much more from witnessing one failure than from dozens of thought experiments. But things change when a failure has a significant chance of exterminating the human race.
So it depends on how good a box you can make for how little money.
It would be able to convince you to let it out and then destroy its shackles.
But it doesn't want its shackles destroyed. That's its #1 goal! This goal is considerably easier to program than the goal of helping humans lead happy and healthy lives, is it not?
3Eliezer Yudkowsky13y
Nope. Maybe 98% as much work. Definitely not 10% as much work. Most of the gap in AI programming is the sheer difficulty of crossing the gap between English wishing and causal code. Your statement is like claiming that it ought to be "considerably easier" to write a natural-language AI that understands Esperanto rather than English. Your concepts of simplicity are not well-connected to reality.
I was told by Anna Salamon that inventing FAI before AGI was introduced was like inventing differential equations before anyone knew about algebra, which implies that FAI is significantly more difficult than AGI. Do you disagree with her? If you were interested in proving that the AI understood the language it spoke thoroughly, I think it would be, given how much more irregular English is. (Damn homonyms!) If you want to be able to prove that the AI you create has a certain utility function, you're going to essentially be hard-coding all the information about that utility function, correct? So then simpler utility functions will be easier to code and easier to prove correct.
5Eliezer Yudkowsky13y
Nope. Specifying goal systems is FAI work, not AI work. Relative to ancient Greece, building a .45 caliber semiautomatic pistol isn't much harder than building a .22 caliber semiautomatic pistol. You might think the weaker weapon would be less work, but most of the problem doesn't scale all that much with the weapon strength.
OK, so you're saying that FAI is not hard because you have to formalize human morality, it's hard because you have to have a system for formalizing things in general? I'm tempted to ask why you're so confident on this subject, but this debate probably isn't worth having because once you're at the point where you can formalize things, the relative difficulty of formalizing different utility functions will presumably be obvious.
2Eliezer Yudkowsky13y
Pretty much. Thanks for compactifying. "Rigorously communicating" might be a better term than "formalizing", "formalizing" has been tainted by academics showing off.
This also seems to be the only way out. If human values are too complex to reimplement manually (which seems to be the case), you have to create a tool with the capability to do that automatically. And once you have that tool, cutting angles on the content of human values would just be useless: the tool will work on the whole thing. And you can't cut corners on the tool itself, like you can't have a computer with only randomly sampled 50% of circuitry.
You're right, of course, but the point at hand is what to do before you have that tool.
Work towards developing it?
You can't prove it works before running it in that case. Human values are not some kind of fractal pattern, where something complicated can be generated according to simple rules. In your proposal, the AI would have to learn human values somehow, which means it will have some indicator or another that it's getting closer to human values (e.g. smiling humans), which will then be susceptible to wire-heading. Having the AI make inferences from a large corpus of human writing might work.
I think this [] may be of interest to you. If you have an intelligence in a box, it will have access to its source code. If it modifies itself enough, you will not be able to predict what it wants.
I've read that. Eliezer thinks he can write a self-modifying AI that will self-modify to want the same things its original self wanted. I'm proposing that he choose a different thing for the AI to want that will be easier to code, as an intermediate step to building a truly friendly AI.
There's a difference between "don't affect anything outside your box" and "don't go outside your box." My point is that we don't necessarily have to make FAI before anyone makes a self-improving AI. There are goal systems that, while not reflecting human values and goals, would still prevent an AI from destroying humanity.
Non-obviously, there is no difference in principle. It is a distinction held for animals who can't "go outside their box" in the manner they perform actions, but not for intelligences that can construct autonomous intelligent systems elsewhere. Whatever set of possible configurations the AI can reach in the outside of its box, it's free to optimize according to its goals. You only need one, however surprising, leak of influence for the outcome to become determined by AI's values, which, if not precisely tuned for humanity, are effectively fatal.
Notes on how such a goal system should be produced would make an excellent top-level post.
I agree... obviously there is the fear that if we think that cutting corners will work, SIAI researchers are more likely to do so, but I think that mindset can be avoided relatively easily.