What a practical plan for Friendly AI looks like

I have seen too many discussions of Friendly AI, here and elsewhere (e.g. in comments at Michael Anissimov's blog), detached from any concrete idea of how to do it. Sometimes the issue is the lack of code, demos, or a practical plan from SIAI. SIAI is seen as a source of wishful thinking about magic machines that will solve all our problems for us, or as a place engaged in a forever quest for nebulous mathematical vaporware such as "reflective decision theory". You will get singularity enthusiasts who say, it's great that SIAI has given the concept of FAI visibility, but enough with the philosophy, let's get coding! ... does anyone know where to start? And you will get singularity skeptics who say, unfriendly AI is a bedtime ghost story for credulous SF fans, wake me up when SIAI actually ships a product. Or, within this subculture of rationalist altruists who want to do the optimal thing, you'll get people saying, I don't know if I should donate, because I don't see how any of this is supposed to happen.

So in this post I want to sketch what a "practical" plan for Friendly AI looks like. I'm not here to advocate this plan - I'm not saying this is the right way to do it. I'm just providing an example of a plan that could be pursued in the real world. Perhaps it will also allow people to better understand SIAI's indirect approach.

I won't go into the details of financial or technical logistics. If we were talking about how to get to the moon from Earth, then the following plan is along the lines of "Make a chemical-powered rocket big enough to get you there." Once you have that concept, you still have a lot of work to do, but you are at least on the right track - compared to people who want to make a teleportation device or a balloon that goes really high. But I will make one remark about how the idea of Friendly AI is framed. At present, it is discussed in conjunction with a whole cornucopia of science fiction notions such as: immortality, conquering the galaxy, omnipresent wish-fulfilling super-AIs, good and bad Jupiter-brains, mind uploads in heaven and hell, and so on. Similarly, we have all these thought-experiments: guessing games with omniscient aliens, decision problems in a branching multiverse, "torture versus dust specks". Whatever the ultimate relevance of such ideas, it is clearly possible to divorce the notion of Friendly AI from all of them. If a FAI project was trying to garner mass support, it first needs to be comprehensible, and the simple approach would be to say it is simply an exercise in creating artificial intelligence that does the right thing. Nothing about utopia; nothing about dystopia caused by unfriendly AI; nothing about godlike superintelligence; just the scenario, already familiar in popular culture, of robots, androids, computers you can talk with. All that is coming, says the practical FAI project, and we are here to design these new beings so they will be good citizens, a positive rather than a negative addition to the world.

So much for how the project describes itself to the world at large. What are its guiding technical conceptions? What's the specific proposal which will allow educated skeptics to conclude that this might get off the ground? Remember that there are two essential challenges to overcome: the project has to create intelligence, and it has to create ethical intelligence; what we call, in our existing discussions, "AGI" - artificial general intelligence - and "FAI" - friendly artificial intelligence.

There is a very simple approach which - like the idea of a chemical-powered rocket which gets you to the moon - should be sufficient to get you to FAI, when sufficiently elaborated. It can be seen by stripping away some of the complexities peculiar to SIAI's strategy, complexities which tend to dominate the discussion. The basic idea should also be thoroughly familiar. We are to conceive of the AI as having two parts, a goal system and a problem-solving system. AGI is achieved by creating a problem-solving system of sufficient power and universality; FAI is achieved by specifying the right goal system.

SIAI, in discussing the quest for the right goal system, emphasizes the difficulties of this process and the unreliability of human judgment. Their idea of a solution is to use artificial intelligence to neuroscientifically deduce the actual algorithmic structure of human decision-making, and to then employ a presently nonexistent branch of decision theory to construct a goal system embodying ideals implicit in the unknown human cognitive algorithms.

The practical approach would not bother with this attempt to outsource the task of designing the AI's morality, to a presently nonexistent neuromathematical cognitive bootstrap process. While fully cognizant of the fact that value is complex, as eloquently attested by Eliezer in many speeches, the practical FAI project would nonetheless choose the AI's goal system in the old-fashioned way, by human deliberation and consensus. You would get a team of professional ethicists, some worldly people like managers, some legal experts in the formulation of contracts, and together you would hammer out a mission statement for the AI. Then you would get your programmers and your cognitive scientists to implement that goal condition in a way such that the symbols have the meanings that they are supposed to have. End of story.

So far, all we've done is to make a wish. We've decided, after appropriate deliberation, what to wish for, and we have found a way to represent it in symbols. All that means nothing if we can't create AGI, the problem solver with at least a human level of intelligence. Here again, SIAI comes in for a lot of criticism, from two angles: it's said to have no ideas about how to create AGI, and it's said to actively discourage work on AGI, on the grounds that we need to solve the FAI problem first. Instead, it only discusses hopelessly impractical models of cognition like AIXI and exact Bayesian inference, that are mostly of theoretical interest.

Our practical FAI project has "solved" FAI by simply coming to an agreement on what to wish for, and by studying with legalistic care how to avoid pitfalls and loopholes in the finer details of the wish; but what is its approach to the hard technical problem of AGI? The answer is, first of all, heuristics and incremental improvement. Projects like Lenat's Cyc are on the right track. A newborn AI has to be seeded with useful knowledge, including useful knowledge of problem-solving methods. It doesn't have time to discover such things entirely unaided. We should not imagine AGI developing just from a simple architecture, like Schmidhuber's Gödel machine, but from a basic architecture plus a large helping of facts and heuristics which are meant to give it a head start.

So fine, the practical approach to AGI isn't a search for a single killer concept, it's a matter of incrementally increasing the power of a general-purpose problem solver with many diverse ingredients in its design, so that it becomes more and more capable and independent. Ben Goertzel's approach to AGI exhibits the sort of eclectic pluralism that I have in mind. Still, we do need a selling point, something which shows that we're different, that we're aiming for the stars and we have a plan to get there.

Here, I want to use Steve Omohundro's paper "The Basic AI Drives" in a slightly unusual way. The paper lists a number of behaviors that should be exhibited by a sufficiently sophisticated AI: it will try to model its own operation, clarify its goals, protect them from modification, protect itself from destruction, acquire resources and use them efficiently... The twist I propose is that Omohundro's list of drives should be used as a design specification. If your goal is AGI, then you want a cognitive architecture that will exhibit these emergent behaviors. They offer a series of milestones for your theorists and developers: a criterion of progress, and a set of intermediate goals sufficient to bridge the gap between a blank-slate beginning and an open-ended problem solver.

That's the whole plan. It's an anticlimax, I know, for anyone who might have imagined that there was a magic formula for superintelligence coming at the end of this post. But I do claim that what I have described is the skeleton of a plan which can be fleshed out, and which, if it was fleshed out and pursued, would produce goal-directed AGI. Whether the project as I have described it would really produce "friendly" AI is another matter. Anyone versed in the folk wisdom about FAI should be able to point out multiple points of potential failure. But I hope this makes it a little clearer, to people who just don't see how FAI is supposed to happen at all, how it might be pursued in the real world.

13 comments, sorted by
magical algorithm
Highlighting new comments since Today at 8:03 PM
Select new highlight date

Not to be too harsh, but this seems like several steps backwards in designing a Friendly AI. To extend your metaphor, it's as if NASA has started calculating the potential trajectory of a rocket and you're running in shouting "Guys, guys, all this math is going to make the project way too confusing! We can see the moon - let's just aim our rocket at that!".

I think part of the problem is determining the goal of your post. Are you trying to come up with a simple, non-weird sounding explanation of FAI that LWers can explain to people without sounding low-status? If so, good job. This sounds reassuring and intuitive, and I'm fairly confident I could explain it to anyone without inferential-gap problems. Or are you trying to create a blueprint for how FAI "might be pursued in the real world"? If this is your goal, I'm concerned.

Having managers, bureaucrats, and lawyers create a set of laws for an AI to follow just won't work. Human deliberation and consensus can't solve simple, human created problems - why on earth do we think it'll solve AI problems? Constraining an AI with today's value system stifles all moral progress - possibly forever.

Then you would get your programmers and your cognitive scientists to implement that goal condition in a way such that the symbols have the meanings that they are supposed to have.

That's the whole problem. We don't really know what we mean by "right", and we don't even know what "happiness" and "freedom" and similar concepts mean to us. There's no reason to believe doing this would be any easier than CEV, and some evidence to suggest it's impossible.

Good post.

Let me describe to you how it felt to read one line of the post. While reading it, I felt like I was a prominent supporting character in a war movie, one who has to die so that suspension of disbelief is not broken, because otherwise the good guys would mow down hundreds of enemies with their small, rag-tag band not taking any casualties and that would just be silly.

It's a rare thing to have a good guy die, and since he has to be dead and out of the plot, the screenwriter might as well squeeze all the drama he or she can out of that moment. Have him survive a few wounds - not that he first few would be enough to kill a good guy anyway. The dying guy (that's me here) should shudder and convulse dramatically from the force of each hit.

You would get a team of professional ethicists,

BANG (lessdazed is hit by sniper fire.)

lessdazed: "Sir, I...I've been hit...I can't feel my legs..."

Commander: "MEDIC! MEDIC! Don't worry son, just hang in there we'll get you home."

some worldly people like managers,

BOOM (lessdazed is hit by a mortar.)

lessdazed: "...my left side...it's gone...I'll never make it...go on without me..."

Commander: "Stay with me dammit! I told your mother I'd get you home and by gawd I will!"

some legal experts

KABLOOEY (lessdazed has been obliterated by an artillery shell.)

lessdazed: "..." (much of lessdazed's former mass dies in the arms of the commander)

Commander: "Nooooooooooooo! I'll get you, you bastards!" (Commander goes on rampage, does not even stop to reload until 20 minutes and hundreds of dead bad guys later.)

In short, I am not sure that the team you described (but did not endorse, I note, I'm not saying you did) is the best to produce something that behaves ethically.

I have seen too many discussions of Friendly AI, here and elsewhere (e.g. in comments at Michael Anissimov's blog), detached from any concrete idea of how to do it....

At present, it is discussed in conjunction with a whole cornucopia of science fiction notions such as: immortality, conquering the galaxy, omnipresent wish-fulfilling super-AIs, good and bad Jupiter-brains, mind uploads in heaven and hell, and so on. Similarly, we have all these thought-experiments: guessing games with omniscient aliens, decision problems in a branching multiverse, "torture versus dust specks". Whatever the ultimate relevance of such ideas, it is clearly possible to divorce the notion of Friendly AI from all of them....

SIAI, in discussing the quest for the right goal system, emphasizes the difficulties of this process and the unreliability of human judgment. Their idea of a solution is to use artificial intelligence to neuroscientifically deduce the actual algorithmic structure of human decision-making, and to then employ a presently nonexistent branch of decision theory to construct a goal system embodying ideals implicit in the unknown human cognitive algorithms.

In short, there is a dangerous and almost universal tendency to think about FAI (and AGI generally) primarily in far mode. Yes!

However, I'm less enamored with the rest of your post. The reason is that building AGI is simply an altogether higher-risk activity than traveling to the moon. Using "build a chemical powered rocket" as your starting point for getting to the moon is reasonable in part because the worst that could plausibly happen is that the rocket will blow up and kill a lot of volunteers who knew what they were getting into. In the case of FAI, Eliezer Yudkowsky has taken great pains to show that the slightest, subtlest mistake, one which could easily pass through any number of rounds of committee decision making, coding, and code checking, could lead to existence failure for humanity. He has also taken pains to show that approaches to the problem which entire committees have in the past thought were a really good idea, would also lead to such a disaster. As far as I can tell, the LessWrong consensus agrees with him on the level of risk here, at least implicitly.

There is another approach. My own research pertains to automated theorem proving, and its biggest application, software verification. We would still need to produce a formal account of the invariants we'd want the AGI to preserve, i.e., a formal account of what it means to respect human values. When I say "formal", I mean it: a set of sentences in a suitable formal symbolic logic, carefully chosen to suit the task at hand. Then we would produce a mathematical proof that our code preserves the invariants, or, more likely we would use techniques for producing the code and the proof at the same time. So we'd more or less have a mathematical proof that the AI is Friendly. I don't know how the SIAI is trying to think about the problem now, exactly, but I don't think Eliezer would be satisfied by anything less certain than this sort of approach.

Not that this outline, at this point, is satisfactory. The formalization of human value is a massive problem, and arguably where most of the trouble lies anyway. I don't think anyone's ever solved anything even close to this. But I'd argue that this outline does clarify matters a bit, because we have a better idea what a solution to this problem would look like. And it makes it clear how dangerous the loose approach recommended here is: virtually all software has bugs, and a non-verified recursively self-improving AI could magnify a bug in its value system until it no better approximates human values than does paperclip-maximizing. Moreover, the formal proof doesn't do anyone a bit of good if the invariants were not designed correctly.

Whatever the ultimate relevance of such ideas, it is clearly possible to divorce the notion of Friendly AI from all of them.

Take for example Pascal's mugging, if you can't solve it then you need to implement a hack that is largely based on human intuition. Therefore in order to estimate the possibility of solving friendly AI, or to distinguish it from the implementation of fail-safe mechanisms, one needs to account for the difficulty in solving all those sub-problems you mentioned.

As multifoliaterose wrote, we don't even know "how one would start to research the problem of getting a hypothetical AGI to recognize humans as distinguished beings."

...it is simply an exercise in creating artificial intelligence that does the right thing.

Yes, but solving metaethics, to figure out what we mean when we use the word "right", already seems to be ridiculously difficult.

What you need to show is that there is a possibility to solve friendly AI before someone stumbles upon AGI. A possibility that would outweigh the (in my opinion) vastly less effective but easier possibility of creating fail-safe mechanisms that might prevent a full-scale extinction scenario or help us to employ an AGI to solve friendly AI.

Take for example Pascal's mugging, if you can't solve it then you need to implement a hack that is largely based on human intuition.

What's the problem? Someone tells you about huge utility - they are probably trying to manipulate you. Tell them to show you the utility. That does not seem to be much of a hack.

Simplicity has its value, but one shouldn't pursue plans any simpler than the simplest plan that might actually work. We don't know whether it's even in principle possible for us to knowingly construct an explicit goal system that matches our values and doesn't offer any room for a super-human AI to cleverly "optimize" in unexpected ways without referencing our values themselves. But I would be extremely skeptical of any attempt inspired by contract law. In your analogy this plan sounds more like "balloon that goes really high" to me. Perhaps the Singinst is pursuing Verneian cannons and no one has thought of rockets yet.

While fully cognizant of the fact that value is complex, as eloquently attested by Eliezer in many speeches, the practical FAI project would nonetheless choose the AI's goal system in the old-fashioned way, by human deliberation and consensus. You would get a team of professional ethicists, some worldly people like managers, some legal experts in the formulation of contracts, and together you would hammer out a mission statement for the AI. Then you would get your programmers and your cognitive scientists to implement that goal condition in a way such that the symbols have the meanings that they are supposed to have. End of story.

We believe that this won't result in a FAI.

Instead, it only discusses hopelessly impractical models of cognition like AIXI and exact Bayesian inference, that are mostly of theoretical interest.

That sounds backwards. We study these models because it's easier. If we find friendliness very hard to capture even after making strong simplifying assumptions, that means a team of lawyers trying to solve the messier real-life problem is guaranteed to fail.

the practical FAI project would nonetheless choose the AI's goal system in the old-fashioned way, by human deliberation and consensus

[...]

Our practical FAI project has "solved" FAI by simply coming to an agreement on what to wish for, and by studying with legalistic care how to avoid pitfalls and loopholes in the finer details of the wish

If we are making an AGI, then humans think too slowly, in comparison, to be able to completely consider every single possible aspect of a "wish", so I don't think legalistic is strong enough, given the large negative utility of a mistake. A mathematical proof of Friendliness should be required, and that is what the formalisations of "hopelessly impractical models of cognition" (e.g. TDT) are a step towards.

If your goal is AGI, then you want a cognitive architecture that will exhibit these behaviors.

FTFY. If you are designing something to have behaviour x then you want behaviour x to definitely occur, possibly being built out of other behaviours, but not just "emerging" out of other behaviours.

If you are designing something to have behaviour x then you want behaviour x to definitely occur, possibly being built out of other behaviours, but not just "emerging" out of other behaviours.

I think the problem with the proposal is the opposite of what I think you think it is.

Omohundro's universal AI instrumental values are things that, if absent in the final product, mean that you have failed. Their presence means little because one could simply design for them.

It's not that we want these behaviors to occur; if we don't know how they do then "emerging" or "arising in a way I do not understand" are fine phrases to use. If you don't understand how they arise from the sub-units that you've carefully built, you're probably, but not certainly, in a lot of trouble. If you try too hard to design the unit to do these behaviors directly, you're hacking together a solution and are almost certainly failing, basically certainly failing less "so you're saying there's a chance".

If you don't understand how they arise from the sub-units that you've carefully built, you're probably, but not certainly, in a lot of trouble.

That's what I was trying to say, thanks :)

Upvoted for having at least some acquaintance with reality. To use the analogy in the post, I would put this into the category of "we should use chemical fuel, maybe in the form of a very large cannon," by contrast with proposals on the level of psychic levitation or breeding kangaroos for the ability to jump really high.

I want to use Steve Omohundro's paper "The Basic AI Drives" in a slightly unusual way. The paper lists a number of behaviors that should be exhibited by a sufficiently sophisticated AI: it will try to model its own operation, clarify its goals, protect them from modification, protect itself from destruction, acquire resources and use them efficiently... The twist I propose is that Omohundro's list of drives should be used as a design specification. If your goal is AGI, then you want a cognitive architecture that will exhibit these emergent behaviors.

IMHO, that doesn't help too much. We mostly know what we want - what we don't know is how to get there.

Incidentally, what most people don't want is just a bunch of universal instrumental values.