Creating Friendly AI

Creating Friendly AI: The Analysis and Design of Benevolent Goal Architectures (CFAI) is a book-length document written by Eliezer Yudkowsky in 2001. CFAI is about the design features and cognitive architecture of friendly goal systems, rather than the intelligence architecture or goal content. It also analyzes the ways in which AI and human psychology are likely to differ, and the ways in which those differences are subject to design decisions.

The following paragraphs briefly summarize the content of the document.

Challenges

FAI refers to “benevolent” AI systems, that have advanced at least to the point of making real-world plans in pursuit of goals. In this context, some challenges arise.

At the beginning of the design process, a first challenge can be envisioning perfection, which in terms of FAI means a set of “perfectly friendly” external behaviors. However, the main challenge is not getting an AI to exhibit some specific set of behaviors, but solving the problem of the Friendship structure, i.e. the task of building a Friendly AI that wants to learn Friendliness.

Concerning “conservative” assumptions on FAI, they differ, and in some cases they are even opposed to those made by futurism. In fact, in creating FAI we should aim not “just to solve the problem, but to oversolve it”. Furthermore, an extremely powerful AI produced by an ultrarapid takeoff is not only a “conservative” assumption, but the most likely scenario. As the tendency of abusing of power is an evolved human tendency that need not necessarily exist among minds-in-general, a FAI could be built that lacks what Yudkowsky calls an "observer-centered goal system". Therefore, as the accumulation of power is not problematic, the Sysop scenario, a superintelligence that would act as the underlying operating system for all the matter in human space, is a possible, Friendly, extreme solution to Singularity. It also emphasizes the role of individual volition; "volition-based Friendliness" is the assumed model for Friendliness content. However, content is not as important as a robust and error-tolerant Friendship architecture, that would recover from (nearly certain) programmer errors. Instead of enumerating a fixed set of rules the AI should follow, Friendliness architecture is designed to promote selective learning ("Friendship acquisition") of values that approximate "normative altruism" and constitutes a goal system that is not centered on the continued existence of the AI as an end in itself, but one that is focused on benevolence towards sentient beings ("Friendliness").

Beyond anthropomorphism

Many specific features of human thought that are exclusively specific to humans have historically been mistakenly attributed to AIs. For example, the concept of a goal system that centers around the observer. The lack of a "selfish" goal system is indeed one of the fundamental differences between an evolved human and a Friendly AI.

Another common anthropomorphic belief about superintelligence is that the most rational behaviour is maximizing pleasure and minimizing pain. However, pain and pleasure are just two internal cognitive feedback circuits in human evolutionary terms; their particular settings as found in humans are not necessary to the functionality of negative or positive feedback among minds-in-general. For instance, upon being attacked, it might never occur to a young AI to retaliate, because underlying retaliation is a "complex functional adaptation" (cognitive module) unique to particular animals (like humans) and would not exist in AI unless explicitly programmed in. Instead of kneejerk positive or negative feedback reactions defined by impulse, AIs could rationally evaluate arbitrary events, including personal injury, in a purely aloof and calculated fashion, deliberately steering away from negative outcomes using intelligence rather than automatic physiological alarm systems. The AI could select subgoals that steer away from negative outcomes and towards positive outcomes by rationally selecting them rather than depending on "pain" or "pleasure" as such.

The anthropomorphic, Hollywood-style version of AI (Terminator) has led us to worry about the same problems that we would worry about in a human in whom we feared rebellion or betrayal. Actually, "building a Friendly AI is an act of creation”; we are not trying to persuade, control or coerce another human being.

Design of Friendship systems

Goal systems

A cleanly causal goal system is a system “in which it is possible to view the goal system as containing only decisions, supergoals, and beliefs; with all subgoal content being identical with beliefs about which events are predicted to lead to other events; and all "desirability" being identical with "leads-to-supergoal-ness”. Friendliness is the sole top-level supergoal. Other motivations, such as "self-improvement", are subgoals and derive their desirability from Friendliness. Such a goal system might be called a cleanly causal Friendly goal system. Cleanly causal subgoal content for a Friendly AI are motivations that the programmer sees as necessary and nonharmful to the existence and growth of a Friendly AI. If the importance of a behavior or goal is directly visible to the programmers, but not to the AI, the predictive link is affirmed by the programmers. When an affirmation has been independently confirmed (by means of a Bayesian sensory binding process) to such a degree that the original programmer affidavit is no longer necessary or significant, the affirmation has been absorbed into the system as a simple belief. Bayes' theorem can also implement positive and negative reinforcement.

A generic goal system is one that makes generic mistakes, that can result in a failure of Friendliness. A designer focusing on the Friendship aspect of a generic goal system considers cognitive complexity that prevents mistakes, or considers the design task of preventing some specific failure of Friendliness. To recognize a mistake, the AI needs knowledge, adequate predictive horizons, and understanding of which actions need checking; layered mistake detection.

Concerning seed AI goal systems, a seed AI is an AI designed for self-understanding, self-modification, and recursive self-improvement. In self-improving, a Friendly AI wouldn’t want to modify the goal system, as adding unFriendly content to the goal system, and eventually causing unFriendly events, is undesirable. Friendship features do not need to be imposed by programmer intervention; unity of will between the programmer and the AI is particularly important. It occurs when humans set aside adversarial attitudes and expectations that the AI will make observer-biased decisions.

Friendship structure

When it has been decided that a AI will be Friendly, then the result is, not a Friendly AI, but the Friendly AI. Complete convergence, a perfect unique solution, is the ideal; if it is not possible, the solution must be "sufficiently" convergent.

Structurally Friendly goals systems can overcome errors made by programmers in supergoal content, goal system structure and underlying philosophy. A generic goal system can overcome mistakes in subgoals by improving knowledge; however, the programmer cannot make direct changes to subgoals, as he cannot make perseverant changes to knowledge. A seed AI goal system can overcome errors in source code; however the programmer cannot directly make arbitrary, perseverant, isolated changes to code.

External reference semantics are the behaviors and mindset associated with the idea that current supergoals are not "correct by definition". If they are “correct by definition”, any change to them is automatically in conflict with the current supergoals. Instead, if supergoals can be wrong or incomplete, rather than a definition of Friendliness, they take the form of hypotheses about Friendliness. In this case supergoals are probabilistic, and the the simplest form of external reference semantics is a Bayesian sensory binding.

The simplest method of grounding the supergoals is that the programmers tell the AI information about Friendliness. AI could want to know how the programmers know about Friendliness. Several factors affect human beliefs about Friendliness, supergoals, and morality (communicable supergoals). Some of them are high-level moral beliefs (moral equity), some are more intuitive (moral symmetry), some lie very close to the bottom layer of cognition (causal semantics). By transferring philosophies to the AI as supergoal content by means of “shapers”, the AI can guess our responses, produce new supergoal content and revise our mistakes. This doesn't just mean the first-order causes of our decisions, such as moral equality, but also second-order and third-order causes of our decisions, such as moral symmetry and causal semantics. If a philosophical content is not fully known by humans, anchor semantics are a structural attribute that enable the AI to discover and absorb philosophical content even if the programmers themselves are unaware of it.

Causal validity semantics subsume both external reference semantics and shaper/anchor semantics. Shaper/anchor semantics provide a means whereby an AI can recover from errors in the supergoal content. Causal validity semantics provide a means by which an AI could perceive and recover from an error that was somehow implicit in the underlying concept of "shaper/anchor semantics", or even in the basic goal system architecture itself.

Finally, a renormalization process stabilizes the self-correcting Friendly AI.

In synthesis, the initial shaper network of the Friendly AI should converge to normative altruism. Which requires “an explicit surface-level decision of the starting set to converge, prejudice against circular logic as a surface decision, protection against extraneous causes by causal validity semantics and surface decision, use of a renormalization complex enough to prevent accidental circular logic, a surface decision to absorb the programmer's shaper network and normalize it, plus the assorted injunctions, ethical injunctions, and anchoring points that reduce the probability of catastrophic failure. Add in an initial, surface-level decision to implement volitional Friendliness so that the AI is also Friendly while converging to final Friendliness... And that is Friendly AI."

Developmental Friendliness

Concerning teaching the Friendliness content, external reference semantics are the simplest kind of teaching scenarios in which the presence of a concept differs from its absence. Providing shaper/anchor semantics and causal validity semantics, the simplest trainable difference is the case of the programmer correcting himself when noticing a spelling error. When the AI is mature enough, the trainable difference is that the programmer catches a thinking mistake, and issues a correction.

At some point in the development of advanced AI, Friendliness becomes necessary. External reference semantics become necessary when the AI has the internal capability to resist alterations to supergoal content, or to formulate the idea that supergoal content should be protected from all alteration in order to maximally fulfill the current supergoals. Causal validity semantics become necessary at the point where the AI has the capability to formulate the concept of a philosophical crisis, and that such a crisis would have negative effects. Shaper/anchor semantics should be implemented whenever the AI begins making decisions that are dependent on the grounding of the external reference semantics. A shaper network with fully understood content, full external reference semantics, and tested ability to apply causal validity semantics, becomes necessary at the stage where an AI has any real general intelligence.

At the point where a hard takeoff is even remotely possible, the Friendship system must be capable of open-ended discovery of Friendliness and open-ended improvement in the system architecture. Full causal validity semantics and a well-understood shaper network are required. As a safeguard, a “controlled ascent” is a temporary delay that occurs when the AI is asked to hold off on a hard takeoff while the Friendship programmers catch up.

Policy implications

Artificial Intelligence is unique among the ultratechnologies in that it can be given a conscience, and in that successful development of Friendly AI will assist us in handling any future problems. Building a Friendly AI is a problem in architecture, content creation, and depth of understanding, not raw computing power. However, building AI in general (not necessarily Friendly) is more susceptible to being brute-forced by powerful computers by programmers with less understanding. Thus, increasing computing power decreases the difficulty of building AI relative to the difficulty of building Friendly AI. The amount of programmer effort required to implement a Friendly architecture and provide Friendship content should be small relative to the amount of effort needed to create AI in general. To the extent that the public is aware of AI and approves of AI, this will tend to accelerate AI relative to other ultratechnologies; while being aware of AI and disapproving it, will tend to slow down AI and possibly other technologies. To the extent that the academic community is aware of Friendly AI and approves of Friendly AI, it will make it more likely that any given research project is Friendliness-aware.

References

Creating Friendly AI

The following paragraphs briefly summarize the content of the document.

Challenges

FAI refers to “benevolent” AI systems, that have advanced at least to the point of making real-world plans in pursuit of goals. In this context, some challenges arise.

LESSWRONG
LW

LESSWRONG
LW

Creating Friendly AI

Challenges

Beyond anthropomorphism

Design of Friendship systems

Goal systems

Friendship structure

Developmental Friendliness

Policy implications

See also

References

Creating Friendly AI

Challenges

Beyond anthropomorphism

Design of Friendship systems

Goal systems

Friendship structure

Developmental Friendliness

Policy implications

See also

References