Literature Review on Goal-Directedness
Introduction: Questioning Goals Goals play a central role in almost all thinking in the AI existential risk research. Common scenarios assume misaligned goals, be it from a single AGI (paperclip maximizer) or multiple advanced AI optimizing things we don’t want (Paul Christiano’s What Failure Looks Like). Approaches around this issue ask for learning the right goals (value/preference learning), allowing the correction of a goal on the fly (corrigibility), or even removing incentives for forming goals (CAIS). But what are goals, and what does it mean to pursue one? As far as we know, Rohin Shah’s series of four posts were the first public and widely-read work questioning goals and their inevitability in AI Alignment. These posts investigate the hypothesis that goals are necessary, and outline possible alternatives. Shah calls the property of following a goal “goal-directedness”; but he doesn’t define it: > I think of this as a concern about long-term goal-directed behavior. Unfortunately, it’s not clear how to categorize behavior as goal-directed vs. not. Intuitively, any agent that searches over actions and chooses the one that best achieves some measure of “goodness” is goal-directed (though there are exceptions, such as the agent that selects actions that begin with the letter “A”). (ETA: I also think that agents that show goal-directed behavior because they are looking at some other agent are not goal-directed themselves -- see this comment.) However, this is not a necessary condition: many humans are goal-directed, but there is no goal baked into the brain that they are using to choose actions. Later on, he explains that his “definition” of goal-directedness relies more on intuitions: > Not all behavior can be thought of as goal-directed (primarily because I allowed the category to be defined by fuzzy intuitions rather than something more formal) Clearly, fuzzy intuitions are not enough to decide whether or not to focus on less goal-directed alternatives,
I’m not sure why you are saying the argument does not work in this case, what about all the other things the AI could learn from other experiences or teachings? Below I copy a paragraph from the post