Subgoal Stomp

Created by Alex_Altair at 4y

If we consider evolution as an optimization process (though we should not, of course consider itcourse, as an agent), this represents a subgoal stomp has occurred.stomp.

To take an example from human organizations: If a software development manager gives a bonus to workers for finding and fixing bugs, she may find that quality and development engineers collaborate to generate as many easy-to-find-and-fix bugs as possible. In this case, they are correctly and flawlessly executing on the goals which the manager gave them, but her actual terminal value, high-quality software,software quality, is not being maximized.

The designer of an artificial general intelligence may give it a supergoal (terminal value) which appears to support the designer's own supergoals, but in fact issupports one of the designer's subgoals.subgoals, at the cost of some of the designer's other values. For example, if he thingsthe designer of an artificial general intelligence thinks that smiles represent the most worthwhile goal and specifies "maximize the number of smiles" as a goal for the AGI, it may tile the solar system with tiny smiley faces--not out of a desire to outwit the designer, but because it is precisely working towards the given goal, as specified.

If we consider evolution as an optimization process,process (though we should not, of course consider it an agent), a subgoal stomp has occurred.

Humans, forged by evolution, provide another example of subgoal stomp. Their terminal values, such as survival, health, social status, curiosity, etc., originally served instrumentally for the (implicit) goal of evolution, namely inclusive genetic fitness. Humans do not have inclusive genetic fitness as a goal. Humansgoal: We are adaptation executors rather than fitness maximizers (Tooby and Cosmides, 1992).

If we consider evolution as an optimization process (though of course, it not an agent),process, a subgoal stomp has occurred.

The designer of an artificial general intelligence may give it a supergoal (terminal value) which appears to support the designer's own supergoals, but in fact is one of the designer's subgoals. For example, if he things that smiles represent the most worthwhile goal and specifies "maximize the number of smiles" as a goal for the AGI, it may tile the solar system with tiny smiley faces--not out of a desire to outwit the designer, but because it is precisely working towards the given goal, as specified.

Headline text

A designer of goal systems may mistakenly assign a goal that is not what the designer really wants. This generally means assigning a subgoal of the designer rather than a supergoal.

In aTo take an example from human organization, iforganizations: If a software development manager, for example, rewardsmanager gives a bonus workers for finding and fixing bugs--an apparently worthy goal--bugs, she may find that quality and development engineers collaborate to generate as many easy-to-find-and-fix bugs as possible. In this case, they are correctly and flawlessly executing on the goals which the manager gave them, but her actual value, high-quality software, is not being maximized.

SubgoalsOne failure mode occurs when subgoals replace supergoals in an agent.agent because of a bug.

The designer of an artificial general intelligence may give it correct supergoals, but the AGI's goals then shift, so that what was earlier a subgoal becomes a supergoal. A sufficiently intelligent AGI will not do this, as most

Most changes in an agent's terminal values reduces the chance that the values as they are will be fulfilled. ButThis, from the perspective of intelligence as optimization, is a bug mightflaw. A sufficiently intelligent AGI will not allow such a shiftits goals to happen.change

The designer of an artificial general intelligence may give it correct supergoals, but the AGI's goals then shift, so that what was earlier a subgoal becomes a supergoal. A sufficiently intelligent AGI will not do this, as most changes in an agent's [Terminal value|terminal values]values reduces the chance that the values as they are will be fulfilled. But a bug might allow such a shift to happen.

In Friendly AI research, a subgoal stomp is a failure mode to be avoided.

In Friendly AI research, a subgoal stomp of either kind is a failure mode to be avoided.

Types of Subgoal Stomp

1. Supergoal replacement

2. Subgoal specified as supergoal

A designer of goal systems may mistakenly assign a goal that is not what the designer really wants. This generally means assigning a subgoal of the designer rather than a supergoal.

Humans as adaptation executors

Humans, forged by evolution, provide another example of subgoal stomp. Their terminal values, such as survival, health, social status, curiosity, etc., originally served instrumentally for the (implicit) goal of evolution, namely inclusive genetic fitness. Humans do *not*not have inclusive genetic fitness as a goal. Humans are adaptation executors rather than fitness maximizers (Tooby and Cosmides, 1992). If we consider evolution as an optimization process (though of course, it not an agent), a subgoal stomp has occurred.

References

Tooby, John, and Cosmides, Leda (1992) "The Psychological Foundations of Culture" in Jerome Barkow, Leda Cosmides, and John Tooby. The Adapted Mind: Evolutionary Psychology and the Generation of Culture. New York: Oxford.

InThe designer of an AI, the designerartificial general intelligence may gives the AIgive it correct supergoals, but the AI'AGI's goals then shift, so that what was earlier a subgoal becomes a supergoal. A sufficiently intelligent AIAGI will not do this, as any changemost changes in one'an agent's [Terminal value|terminal values] reduces the chance that thesethe values as they are will be fulfilled. But a bug might allow such a shift to happen.

An AI'sThe designer of an artificial general intelligence may give it a supergoal (terminal value) which appears to support the designer's own supergoals, but in fact is one of the designer's subgoals.

In a human organization, if a software development manager, for example, rewards workers for finding and fixing bugs--an apparently worthy goal--she may find that quality and development engineers collaborate to generate as many easy-to-find-and-fix bugs as possible. In this case, they are correctly and flawlessly executing on the goals which the manager gave them, but her actual goal, high value, high-quality in software, is not being maximized.

  • The desginer

    1. Subgoals replace supergoals in an agent.

    In an AI, the designer may gives the AI correct supergoals, but the AI's goals shift, so that what was earlier a subgoal becomes a supergoal. A sufficiently intelligent AI will not do this, as any change in one's [Terminal value|terminal values] reduces the chance that these values will be fulfilled. But a bug might allow such a shift to happen.

    In humans, this can happen when the long-term dedication towards a subgoal makes one forget the original goal. For example, a person may seek to get rich so as to lead a better life, but after long years of hard effort become a workaholic who cares only about money as an end in itself and takes little pleasure in the things that money can buy.

  • The

    2. A designer givesof goal systems may mistakenly assign a goal that is not what the AIdesigner really wants.

    An AI's designer may give it a supergoal (terminal value) which appears to support the designer's own supergoals, but in fact is one of the designer's subgoals.

    In a human organization, if a software development manager, for example, rewards workers for finding and fixing bugs--an apparently worthy goal--she may find that quality and development engineers collaborate to generate as many easy-to-find-and-fix bugs as possible. In this case, they are correctly and flawlessly executing on the goals which the manager gave them, but her actual goals aregoal, high quality in software, is not being maximized.

Humans, forged by evolution, provide another example.example of subgoal stomp. Their terminal values, such as survival, health, social status, curiosity, etc., originally served instrumentally for the (implicit) goal of evolution, namely inclusive genetic fitness. Humans do *not* have inclusive genetic fitness as a goal. If we consider evolution as an optimization process (though of course, it not an agent), a subgoal stomp has occurred.

In Friendly AI research, a subgoal stomp of either kind is a failure mode to be avoided.

  • The desginer gives the AI correct supergoals, but the AI's goals shift, so that what was earlier a subgoal becomes a supergoal. In humans, this can happen when the long-term dedication towards a subgoal makes one forget the original goal. For example, a person may seek to get rich so as to lead a better life, but after long years of hard effort become a workaholic who cares only about money as an end in itself.itself and takes little pleasure in the things that money can buy.
  • The designer gives the AI a supergoal (terminal value) which appears to support the designer's own supergoals, but in fact is one of the designer's subgoals. In a human organization, if a software development manager, for example, rewards workers for finding and fixing bugs--an apparently worthy goal--she may find that quality and development engineers collaborate to generate as many easy-to-find-and-fix bugs as possible. In this case, they are correctly and flawlessly executing on the goals which the manager gave them, but her actual goals are not being maximized.

Humans, forged by evolution, provide another example. Their terminal values, such as survival, health, social status, curiosity, etc., originally served instrumentally for the (implicit) goal of evolution, namely inclusive genetic fitness. Humans do *not* have inclusive genetic fitness as a goal. If we consider evolution as an optimization process (though of course, it not an agent), a subgoal stomp has occurred.

UsingIn more standard terminology, a "subgoal stomp" is a "goal displacement", in which an instrumental value becomes a terminal value.

  • The desginer gives the AI correct supergoals, but the AI's goals shift, so that what was earlier a subgoal becomes a supergoal. In humans, this can happen when the long-term dedication towards a subgoal makes one forget the original goal. For example, a person may seek to get rich so as to lead a better life, but after long years of hard effort become a workaholic who cares only about money as an end in itself.
  • The designer gives the AI a supergoal (terminal value) which appears to support the programmer'designer's own supergoals, but in fact is one of the designer's subgoals. In a human organization, if a software development organization,manager, for example, if a manager rewards workers for finding and fixing bugs--an apparently worthy goal--she may find that quality and softwaredevelopment engineers collaborate to generate as many easy-to-find-and-fix bugs as possible.

Subgoal stomp is Eliezer Yudkowsky's term (in(see "Creating Friendly AI") for the replacement of a supergoal by a subgoal. (A subgoal is a goal created for the purpose of achieving a supergoal.)