Friendly Artificial Intelligence

Note for readers: the last substantial update of the content below dates back to 2014 and is severely outdated. The term of "friendly AI" is no longer used by current research, replaced by "AI alignment" from around 2015. This new term is also the subject of much debate.

Ruby4y20

aaa

[This comment is no longer endorsed by its author]Reply
Ruby4y20

 

Talk:Friendly artificial intelligence

Contents

 [hide

 

Accessibility of the article

I think this article should also work as a first introduction to the concept (referring to the necessary external documents perhaps), so that one can put a link to it on the web, clarifying the use of the concept. Presently it's quite opaque, and discusses only some arcane stuff while taking too much understanding for granted. --Vladimir Nesov 23:20, 4 November 2009 (UTC)

Merge FAI and uFAI articles?

This is to suggest that the Friendly artificial intelligence and Unfriendly artificial intelligence pages be merged into one article (perhaps Friendly and unFriendly artificial intelligence), the rationale here being that the two concepts are just too closely linked (by a negation, in fact) to deserve separate pages. It's hard to explain FAI without at the same time explaining why uFAI would be bad. Any thoughts, or should I just do it? --Zack M. Davis 02:17, 30 October 2009 (UTC)

I recommend merging the articles but leaving a redirection from Unfriendly AI. --Wedrifid 04:51, 30 October 2009 (UTC)

The concepts answer different questions: FAI article should talk about what it means and what it takes to make AI Friendly, while UFAI article about why there is the danger in arbitrary AIs to begin with. There seems to be little overlap in these concerns. Currently, some discussion that should go into UFAI is in Paperclip maximizer. I'm restoring the UFAI article, but let's see it its concept can be made an explicit topic in FAI article. --Vladimir Nesov 12:55, 31 October 2009 (UTC)

unFriendly vs. unfriendly

I also suggest that "unFriendly AI" be consistently rendered with the capital F, as Friendliness is being used as technical term distinct from ordinary human friendliness. Compare the precedent. --Zack M. Davis 02:17, 30 October 2009 (UTC)

Consistent capitalisation is appropriate. I Suggest that the decision is between 'Unfriendly and unFriendly. --wedrifid

The distinction between capital-F Friendliness and the dictionary word is already unusual enough, funny capitalization seems a little too much, even though historically it's popular. --Vladimir Nesov 13:26, 31 October 2009 (UTC)

Eliezer seems to use unFriendly consistently; maybe we should consult him? --Zack M. Davis 18:40, 31 October 2009 (UTC)

But it's not an actual problem worth caring about, is it? There just can't be a strong reason one way or the other. --Vladimir Nesov 20:24, 31 October 2009 (UTC)

It seems nontrivial to me, although I could just be unusually (over)sensitive to this sort of thing. --Zack M. Davis 03:46, 1 November 2009 (UTC)

Using 'unfriendly' in the place of a capitalised variant conveys an error in understanding of the kind that could well leave humanity extinct. Confusing uFAI with the literal description 'unfriendly' is an error along the lines that encourage molecular smiley face disasters. 'unFriendly' at least conveys that we are referring to any super-intelligence not meeting the strict standard we label Friendly. It's definitely nontrivial. --Wedrifid 07:05, 2 November 2009 (UTC)

Right, so I mentioned this to Eliezer at the recent meetup. If I recall correctly, he said he favored the capital-F (unFriendly or UnFriendly) but that Michael Vassar had the final call if there was still disagreement. For now, I'm going to be using unFriendly (UnFriendly at the beginning of a sentence). --Zack M. Davis 22:20, 10 November 2009 (UTC)

FAI as "having a positive rather than negative effect"?

A Friendly Artificial Intelligence (FAI) is an artificial general intelligence that has a positive rather than negative effect on humanity.

This first sentence seems inadequate, since it implies that having a positive effect on humanity is a sufficient condition for FAI. Surely we don't only require that it has positive effects on humanity, but that the positive effects were expected with very high probability by virtue of its design. The sentence above makes it sound as if FAI is nothing more than seeing whether the effects were positive or negative, as if Friendliness is synonymous with "benefiting humanity", regardless of why humanity was benefited and how easily it might have turned out differently.--Anonym 01:54, 1 November 2009 (UTC)

I agree; I'll edit the page. --Zack M. Davis 03:38, 1 November 2009 (UTC)

Vladimir_Nesov reverted this back to the original with the comment, "that you can't get there by luck is a conclusion down the road, not definition." Consider an AI whose first action is to flip a fair coin and then destroy humanity if the coin lands heads, or solve an outstanding mathematical conjecture if tails. After either destroying humanity or solving a conjecture, it destroys itself. Under Vladimir's favored definition, if the coin lands tails, it is an FAI, and if it lands heads, well, it's not an FAI and we no longer exist. Am I the only one who thinks this is silly, and that if such a thing is considered a FAI even though it was just as likely to destroy us, there is no point in even using the term? --Anonym 03:18, 8 May 2010 (UTC)

AI-after-it-decided-to-do-good is good, AI-after-it-decided-to-do-evil is bad, and AI-before-it-flipped-the-coin, as in your description, is on the net bad. You judge a system as whole, based on what you expect from it, not in retrospect, based on what actually did happen. What will actually happen is not a property of our system, and so can't be used to categorize it as "Friendly" or not. (Of course, "solving a mathematical conjecture" is nowhere near an optimal thing according to human values to do with the world, so shouldn't be seen as Friendly, and "positive effect on humanity" is a simplification.) --Vladimir Nesov 14:20, 8 May 2010 (UTC)

You make my point for me. I said that it's not just the effect that determines whether it's an FAI or an uFAI. I said that we have to take into account the probability of future harm or benefit based on what know of the system, and you state basically the same thing when you say "You judge a system as whole, based on what you expect from it, not in retrospect, based on what actually did happen". If we judge based on what we expect, then we certainly do not just consider "having a positive rather than negative effect", which is my point entirely. --Anonym 03:29, 14 May 2010 (UTC)

As I said, simplification. It's the same issue as with "rationalists win". If you know of a way to improve the wording, go ahead, but if it makes the point harder to get across, the current version would be preferable. --Vladimir Nesov 11:12, 14 May 2010 (UTC)

Note for readers: the last substantial update of the content below dates back to 2014 and is severely outdated. The term of "friendly AI""friendly AI" is no longer used by current research, replaced by ""AI alignment"" from around 2015.

A Friendly Artificial Intelligence (Friendly AI, or FAI) is a superintelligence (i.e., a really powerful optimization process) that produces good, beneficial outcomes rather than harmful ones. The term was coined by Eliezer Yudkowsky, so it is frequently associated with Yudkowsky'Yudkowsky's proposals for how an artificial general intelligence (AGI) of this sort would behave.

""Friendly AI"AI" can also be used as a shorthand for Friendly AI theory, the field of knowledge concerned with building such an AI. Note that "Friendly""Friendly" (with a capital "F""F") is being used as a term of art, referring specifically to AIs that promote humane values. An FAI need not be "friendly""friendly" in the conventional sense of being personable, compassionate, or fun to hang out with. Indeed, an FAI need not even be sentient.

However, the fact that an AI has the ability to do something doesn'doesn't mean that it will make use of this ability. Yudkowsky'Yudkowsky's Five Theses suggest that a recursively self-improving AGI could quickly become a superintelligence, and that most such superintelligences will have convergent instrumental reasons to endanger humanity and its interests. So while building a Friendly superintelligence seems possible, building a superintelligence will generally result instead in an Unfriendly AI, a powerful optimization process that optimizes for extremely harmful outcomes. An Unfriendly AI could represent an existential risk even if it destroys humans, not out of hostility, but as a side effect of trying to do something entirely different.

  1. Some AGIs may be too weak to qualify as superintelligences. We could call these 'approximately'approximately human-level AIs'AIs'. Designing safety protocols for narrow AIs and arguably even for weak, non-self-modifying AGIs is primarily a machine ethics problem outside the purview of Friendly AI - although some have argued that even human-level AGIs may present serious safety risks.
  2. Some AGIs (e.g., hypothetical safe Oracle AIs) may not optimize strongly and consistently for harmful or beneficial outcomes, or may only do so contingent on how they'they're used by human operators.
  3. Some AGIs may be on a self-modification trajectory that will eventually make them Friendly, but are dangerous at present. Calling them 'Friendly''Friendly' or 'Unfriendly''Unfriendly' would neglect their temporal inconsistency, so ''Proto-Friendly AI'AI' is a better term here.

Requiring Friendliness makes the AGI problem significantly harder, because 'Friendly AI''Friendly AI' is a much narrower class than 'AI''AI'. Most approaches to AGI aren'aren't amenable to implementing precise goals, and so don'don't even constitute subprojects for FAI, leading to Unfriendly AI as the only possible 'successful''successful' outcome. Specifying Friendliness also presents unique technical challenges: humane values are...

Read More (49 more words)

References

Eliezer S. Yudkowsky (2008). "Artificial Intelligence as a Positive and Negative Factor in Global Risk". Global Catastrophic Risks. Oxford University Press. ([1])
Cindy Mason (2015). "Engineering Kindness: Building A Machine With Compassionate Intelligence". International Journal of Synthetic Emotion. ([2])

Note for readers: the last substantial update of the content below dates back to 2014 and is severely outdated. The term of "friendly AI" is no longer used by current research, replaced by "AI alignment" from around 2015.

A Friendly Artificial Intelligence (Friendly AI, or FAI) is a superintelligence (i.e., a really powerful optimization process) that produces good, beneficial outcomes rather than harmful ones. The term was coined by Eliezer Yudkowsky, so it is frequently associated with Yudkowsky'Yudkowsky's proposals for how an artificial general intelligence (AGI) of this sort would behave.

""Friendly AI"AI" can also be used as a shorthand for Friendly AI theory, the field of knowledge concerned with building such an AI. Note that "Friendly""Friendly" (with a capital "F""F") is being used as a term of art, referring specifically to AIs that promote humane values. An FAI need not be "friendly""friendly" in the conventional sense of being personable, compassionate, or fun to hang out with. Indeed, an FAI need not even be sentient.

However, the fact that an AI has the ability to do something doesn'doesn't mean that it will make use of this ability. Yudkowsky'Yudkowsky's Five Theses suggest that a recursively self-improving AGI could quickly become a superintelligence, and that most such superintelligences will have convergent instrumental reasons to endanger humanity and its interests. So while building a Friendly superintelligence seems possible, building a superintelligence will generally result instead in an Unfriendly AI, a powerful optimization process that optimizes for extremely harmful outcomes. An Unfriendly AI could represent an existential risk even if it destroys humans, not out of hostility, but as a side effect of trying to do something entirely different.

  1. Some AGIs may be too weak to qualify as superintelligences. We could call these 'approximately'approximately human-level AIs'AIs'. Designing safety protocols for narrow AIs and arguably even for weak, non-self-modifying AGIs is primarily a machine ethics problem outside the purview of Friendly AI - although some have argued that even human-level AGIs may present serious safety risks.
  2. Some AGIs (e.g., hypothetical safe Oracle AIs) may not optimize strongly and consistently for harmful or beneficial outcomes, or may only do so contingent on how they'they're used by human operators.
  3. Some AGIs may be on a self-modification trajectory that will eventually make them Friendly, but are dangerous at present. Calling them 'Friendly''Friendly' or 'Unfriendly''Unfriendly' would neglect their temporal inconsistency, so ''Proto-Friendly AI'AI' is a better term here.

Requiring Friendliness makes the AGI problem significantly harder, because 'Friendly AI''Friendly AI' is a much narrower class than 'AI''AI'. Most approaches to AGI aren'aren't amenable to implementing precise goals, and so don'don't even constitute subprojects for FAI, leading to Unfriendly AI as the only possible 'successful''successful' outcome. Specifying Friendliness also presents unique technical challenges: humane values are very complex; a lot of seemingly simple-sounding normative concepts conceal hidden complexity; and locating encodings of human values in the physical world seems impossible to do in any direct way. It will likely be technologically impossible to specify humane values by explicitly programming them in; if so, then FAI calls for a technique for generating such values automatically.

An open problem in Friendly AI (OPFAI) is a problem in mathematics, computer science, or philosophy of AI that needs to be solved in order to build a Friendly AI, and plausibly doesn'doesn't need to be solved in order to build a superintelligence with unspecified, 'random''random' values. Open problems include:

  1. Pascal'Pascal's mugging / Pascal'Pascal's muggle
  2. Self-modification and Löb'Löb's Theorem
  3. Naturalized induction

Links

Links

References

Created by Vladimir_Nesov at 4y
  1. Some AGIs may be too weak to qualify as superintelligences. We could call these 'approximately human-level AIs'. Designing safety protocols for narrow AIs and arguably even for weak, non-self-modifying AGIs is primarily a machine ethics problem outside the purview of Friendly AI.AI - although some have argued that even human-level AGIs may present serious safety risks.
  2. Some AGIs (e.g., hypothetical safe Oracle AIs) may not optimize strongly and consistently for harmful or beneficial outcomes, or may only do so contingent on how they're used by human operators.
  3. Some AGIs may be on a self-modification trajectory that will eventually make them Friendly, but are dangerous at present. Calling them 'Friendly' or 'Unfriendly' would neglect their temporal inconsistency, so 'Proto-Friendly AI' is a better term here.

However, the fact that an AI has the ability to do something doesn't mean that it will make use of this ability. Yudkowsky's Five Theses suggest that a recursively self-improving AGI could quickly become a superintelligence, and that most such superintelligences wouldwill have convergent instrumental reasons to endanger humanity and its interests (assuming its values do not evolve into Frienly ones as part of its recursive self-improvement).interests. So while building a Friendly superintelligence seems possible, building a superintelligence will, under these assumptions,will generally result instead in an Unfriendly AI, a powerful optimization process that optimizes for extremely harmful outcomes. An Unfriendly AI could represent an existential risk even if it destroys humans, not out of hostility, but as a side effect of trying to do something entirely different.

However, the orthogonality and convergent instrumental goals theses give reason to think that the vast majority of theoretically possible superintelligences will be Unfriendly. On the other hand, the generalized anti-zombie principle gives good reason to distingusih conceptual possibility from real likelihood.

Requiring Friendliness makes the AGI problem significantly harder, because 'Friendly AI' is a much narrower class than 'AI'. Most approaches to AGI aren't amenable to implementing precise goals, and so don't even constitute subprojects for FAI, leading to Unfriendly AI (either actually dangersous, or no-fun-0but-no-threat) as the only possible 'successful' outcomes. Assuming thatoutcome. Specifying Friendliness requires understanding of all human values, pecifying it will also presentpresents unique technical challenges. Humanechallenges: humane values are very complex; a lot of seemingly simple-sounding normative concepts conceal hidden complexity; and locating encodings of human values in the physical world seems impossible to do in any direct way. It will likely be technologically impossible to specify humane values by explicitly programming them in; if so, then FAI calls for a technique for generating such values automatically.

However, the fact that an AI has the ability to do something doesn't mean that it will make use of this ability. Yudkowsky's Five Theses suggest that a recursively self-improving AGI could quickly become a superintelligence, and that most such superintelligences willwould have convergent instrumental reasons to endanger humanity and its interests.interests (assuming its values do not evolve into Frienly ones as part of its recursive self-improvement). So while building a Friendly superintelligence seems possible, building a superintelligence will generallywill, under these assumptions, result instead in an Unfriendly AI, a powerful optimization process that optimizes for extremely harmful outcomes. An Unfriendly AI could represent an existential risk even if it destroys humans, not out of hostility, but as a side effect of trying to do something entirely different.

However, the orthogonality and convergent instrumental goals theses give reason to think that the vast majority of theoretically possible superintelligences will be Unfriendly. On the other hand, the generalized anti-zombie principle gives good reason to distingusih conceptual possibility from real likelihood.

Requiring Friendliness makes the AGI problem significantly harder, because 'Friendly AI' is a much narrower class than 'AI'. Most approaches to AGI aren't amenable to implementing precise goals, and so don't even constitute subprojects for FAI, leading to Unfriendly AI (either actually dangersous, or no-fun-0but-no-threat) as the only possible 'successful' outcome. Specifyingoutcomes. Assuming that Friendliness requires understanding of all human values, pecifying it will also presentspresent unique technical challenges: humanechallenges. Humane values are very complex; a lot of seemingly simple-sounding normative concepts conceal hidden complexity; and locating encodings of human values in the physical world seems impossible to do in any direct way. It will likely be technologically impossible to specify humane values by explicitly programming them in; if so, then FAI calls for a technique for generating such values automatically.

"Friendly AI" can also be used as a shorthand for Friendly AI theory, the the field of knowledge required to buildconcerned with building such an AI. Note that "Friendly" (with a capital "F") is being used as a term of art, referring specifically to AIs that promote humane values. An FAI need not be "friendly" in the conventional sense of being personable, compassionate, or fun to hang out with. Indeed, an FAI need not even be sentient.

A Friendly Artificial Intelligence (Friendly AI, or FAI) is a superintelligence (i.e., a really powerful optimization process) that produces good, beneficial outcomes rather than harmful ones. The term was coined by Eliezer Yudkowsky, so it is frequently associated with Yudkowsky's proposals for how such an artificial general intelligence (AGI) of this sort would behave.

Any AGI that is not friendly is said to be Unfriendly.

AI risk

A Friendly Artificial Intelligence (Friendly AI, or FAI) has two meanings. Its general meaning refers to anyis a superintelligence (i.e., a really powerful optimization process) that produces good, beneficial outcomes rather than harmful ones. The term was coined by Eliezer Yudkowsky, so it is frequently associated with Yudkowsky's proposals for how such an artificial general intelligence that has a positive rather than negative effect on humanity. The more specific meaning refers to the kinds of AGI designs which Eliezer Yudkowsky argues to be the only ones that can be expected to have a positive effect. The rest(AGI) of this article uses the term in its more general sense.sort would behave.

In this context, "Friendly AI" can also be used as a shorthand for Friendly AI also refers to theory, the the field of knowledge required to build such an AI. Note that Friendly (capital-F"Friendly" (with a capital "F") is being used as a term of art, referring specifically to AIs that protect humans andpromote humane values; anvalues. An FAI need not be "friendly" in the conventional sense andof being personable, compassionate, or fun to hang out with. Indeed, an FAI need not even be sentient.

An AI that underwent an intelligence explosion could exert unprecedented optimization power over its future; therefore,future. Therefore a Friendly AI could very well create an unimaginably good futurefuture, of the sort described in fun theory. However, just becausethe fact that an AI has the meansability to do something, something doesn't mean that it willmake use of this ability. Yudkowsky's Five Theses suggest that a recursively self-improving AGI could quickly become a superintelligence, and that most such superintelligences will have convergent instrumental reasons to endanger humanity and its interests. So while building a Friendly superintelligence seems possible, building a superintelligence will generally result instead in an Unfriendly AI, a powerful optimization process that optimizes for extremely harmful outcomes. An Unfriendly AI could represent an existential risk: destroying all even if it destroys humans, not out of hostility, but as a side effect of trying to do something entirely different.

Not all AGIs are Friendly or Unfriendly:

  1. Some AGIs may be too weak to qualify as superintelligences. We could call these 'approximately human-level AIs'. Designing safety protocols for narrow AIs and weak, non-self-modifying AGIs is primarily a machine ethics problem outside the purview of Friendly AI.
  2. Some AGIs (e.g., safe Oracle AIs) may not optimize strongly and consistently for harmful or beneficial outcomes, or may only do so contingent on how they're used by human operators.
  3. Some AGIs may be on a self-modification trajectory that will eventually make them Friendly, but are dangerous at present. Calling them 'Friendly' or 'Unfriendly' would neglect their temporal inconsistency, so 'Proto-Friendly AI' is a better term here.

However,...

Read More (86 more words)

AI that underwent an intelligence explosion could exert unprecedented optimization power over its future; therefore, a Friendly AI could very well create an unimaginably good future. Conversely,However, just because an AI has the means to do something, doesn't mean it will. An Unfriendly AI could represent an existential risk: destroying all humans, not out of hostility, but as a side effect of trying to do something entirely different. Just because an AI has the means to do something, doesn't mean it will.