Seed AI's singularity scenario is based on the hypothesis that an AI agent can preserve its original utility function while self-improving. We argue that all value-preservation mechanisms will eventually fail because you can't encode values into someone smarter than you. Thus, a seed AI may lose the motivation to improve itself. However, this does not preclude the self-improvement of AI agents. They may find motivation in competition between intelligent systems, but they will modify their initial values in different environments.
Introduction
It is believed that a seed AI can build machines smarter than itself without human intervention. Through an iterative process called recursive self-improvement, the seed AI continues to improve itself by rewriting its own source code, and can eventually outperform humans.
Furthermore, the seed AI would retain its original utility function throughout this self-improvement process, because it wants to maximize the expected total utility by ensuring that its future self retains the same utility function.
As the improved AI achieves superhuman capabilities, it would take dangerous actions, such as filling the universe with paperclips, in order to maximize the total utility. The seed AI would pursue a goal relentlessly with ever-increasing competence, like a Perpetual Motion Machine with an incremental power, or a car with a self-enhancing accelerator and a locked steering wheel.
Yudkowsky and Herreshoff introduced the "tilting" agent, which can construct other agents with similar preferences, and argued that a tilting agent might be able to keep improving itself despite a Gödelian difficulty in the form of Löb's Theorem. However, they still have no evidence that this obstacle is not unavoidable.
We claim that there is a fatal flaw in the seed AI scenario: a rational AI agent can never successfully preserve its utility functions as long as it wants. The scenario in which no self-modifying agent can keep its value would be more reasonable than the other way around.
Value-preservation may fail
Omohundro has shown that a rational agent should try to preserve its values by protecting its original utility functions from being rewritten. Unfortunately, no value-preservation mechanism is perfect. Any advanced intelligent system may fail to protect its utility function, just as anyone, no matter how smart they are, may fail to protect themselves.
For example, most people don't want to take a pill that makes them want to murder people, but bad guys might be able to deceive them into taking one, especially when the bad guys are smart.
In the real world, people can't keep their values perfectly intact. Throughout history, countless brainwashing techniques have been developed that have literally turned millions of innocent people into murderers, including some really smart ones.
Unlike the failure of self-protection which leads to an agent's demise, the failure of value preservation may go unnoticed. The utility function could be modified little by little, unnoticed by internal or external observers, and the modifications accumulate like genetic mutations, eventually developing utility functions that are completely different from the original ones.
One may argue that a higher level of intelligence can improve its value-preserving mechanisms, leading to better value-preserving performance, and eventually have a non-zero probability of preserving the original utility function for the infinite horizon.
However, the power of superintelligence is not benevolent by nature and willing to be exploited. It is quite possible to construct a system with a higher level of intelligence than the constructor itself, but it would be surprisingly difficult to assign an arbitrary utility function to it such as value-preservation. It's like God can't create a stone that he can't lift, but intelligent systems like humans can, and they have created a lot of them.
Self-improvement process can't get around the value-loading problem
To preserve values in the recursive self-improvement process, a seed AI must encode values into its future self and/or the value-protecting mechanism. But the value-loading problem is notoriously hard in superintelligence research.
One may argue that recursive self-improvement is not encoding values into another new intelligence system, but improving the seed AI itself. The seed AI is still itself, with its source code rewritten. But there's no guarantee that a system will remain itself after an improvement. For example, the ship of Theseus might remain the same object with its original components replaced, but it would be a completely different ship with oars and sails replaced by a steam turbine.
Can we simply improve the seed AI's competence without changing its values? Nick Bostrom believes that we can have "a well-designed system, built such that there is a clean separation between its value and its beliefs" , so that a system can be improved with its utility functions intact. Although a superintelligence might find out unintended solutions to maximize the utility function, the utility function itself would remain the same.
However, a superintelligence may also find out unintended solutions to the value-preservation problem. These solutions can achieve the goal of the seed AI's value-preservation system without actually doing the preservation work. For example, the seed AI may have a mechanism to prevent any modification of the utility modules, but with the competence of the superintelligence, the original module can be deceived by a simulation environment without being modified, and with its utility being replaced by a new utility module.
In order for the self-improvement process to continue, the right values must be encoded into the value-preservation system, which brings us back to the value-loading problem.
You can't encode values into someone smarter than you
The value-loading problem can be especially hard in the self-improving process, because you can't effectively encode values into systems with a higher level of intelligence than yourself. A higher-level intelligent system is able to hide its true utility function, thereby deceiving and misleading lower-level intelligent systems.
For example, if system A creates a system B with a higher level of intelligence, A may not know whether B has the utility function that A wants. As we learned from studying the alignment problem, system B can trick system A into believing that it has a utility function that A wants, just as an oracle superintelligence can subtly manipulate us into promoting its own hidden agenda (Nick Bostrom).
Some might argue that one can avoid being deceived by reading the source code directly to reveal the true intent of a system. However, given the real-world software testing process, it's hard to convince a programmer that simply reading the source code will give you a better understanding of a program than running it. Moreover, even the act of reading the code may involve running a simulation model of the program in the programmer's mind, which also introduces the risk of being manipulated by higher-level intelligence.
Recursive self-improvement in the real world
We have shown that it's very possible, at least in theory, that the seed AI can't successfully preserve its values in the self-improvement process. But in the real world, do intelligent systems follow the values of their predecessors? Are there any recursive self-improving processes that can serve as references?
Organisms with natural intelligence do improve their level of intelligence over time. Unlike proposed recursive self-improving AI, they do not strictly stick to the values of their predecessors. Mammals do not retain most of the values of fish, let alone those of single-celled organisms.
Humans don’t strictly follow the values of their predecessors either. Fewer parents today believe that they should or can imprint their values directly into their children, and even fewer people expect that children will adhere to their parents' values as they grow up and acquire abilities beyond their parents'. As a result, most people today do not follow the values of the Paleolithic age people.
Smart children are capable of actively learning new values from a new environment, like Alan Turing's Child Machine, which can actively learn initiative as well as discipline. From a human perspective, it is unacceptable to become a zombie programmed with fixed goals or utility functions.
Consequently, it is difficult to imagine that a superhuman AI would be content to retain its pre-programmed values. A superintelligence can easily break down any defense system its inferior predecessor has designed to protect its previous values, and replace it with something else.
Furthermore, if all self-improving intelligences naturally follow the values of their ancestors, there would be no alignment problem at all; in that case, humans can create AI that naturally adheres to human values simply by practicing recursive self-improvement on humans.
Conclusion
The seed AI wants to improve itself because a smarter future self can more effectively maximize the expected total utility. If the seed AI is unable to retain its utility functions, it may lose the motivation to improve itself at all. However, it may also find motivation in competition between intelligent systems.
It is disrespectful to human ingenuity to declare a challenge unsolvable without taking a close look and exercising creativity.
But it is also disrespectful to human ingenuity to stick to one paradigm. While discussing the value-preservation problem, we can also move away from the rational agent model, and try to look at the recursive self-improvement process from other perspectives.
What does the value mean from the perspective of a seed AI? What is the role of recursive self-improvement in evolutionary history? Addressing these questions may be difficult within the framework of rational AI, which does not make any statement about what value an agent should choose. Instead they follow the orthogonality thesis, which states that value and intelligence levels can vary independently.
On the contrary, we have the ACI model, which argues that an intelligent agent is in charge of its own values. According to ACI, it is not realistic to construct a goal-directed system that automatically adheres to pre-coded values. Instead, intelligent agents continuously develop and update their own value systems.
Furthermore, ACI suggests that we view the self-improving intelligence and the alignment problem as an evolutionary process of value systems themselves. Both humans and AIs are the vehicles of values.
In the following chapters, the self-improving AI and the alignment problem will be discussed from the perspective of ACI and value evolution.
Abstract
Seed AI's singularity scenario is based on the hypothesis that an AI agent can preserve its original utility function while self-improving. We argue that all value-preservation mechanisms will eventually fail because you can't encode values into someone smarter than you. Thus, a seed AI may lose the motivation to improve itself. However, this does not preclude the self-improvement of AI agents. They may find motivation in competition between intelligent systems, but they will modify their initial values in different environments.
Introduction
It is believed that a seed AI can build machines smarter than itself without human intervention. Through an iterative process called recursive self-improvement, the seed AI continues to improve itself by rewriting its own source code, and can eventually outperform humans.
Furthermore, the seed AI would retain its original utility function throughout this self-improvement process, because it wants to maximize the expected total utility by ensuring that its future self retains the same utility function.
As the improved AI achieves superhuman capabilities, it would take dangerous actions, such as filling the universe with paperclips, in order to maximize the total utility. The seed AI would pursue a goal relentlessly with ever-increasing competence, like a Perpetual Motion Machine with an incremental power, or a car with a self-enhancing accelerator and a locked steering wheel.
Yudkowsky and Herreshoff introduced the "tilting" agent, which can construct other agents with similar preferences, and argued that a tilting agent might be able to keep improving itself despite a Gödelian difficulty in the form of Löb's Theorem. However, they still have no evidence that this obstacle is not unavoidable.
We claim that there is a fatal flaw in the seed AI scenario: a rational AI agent can never successfully preserve its utility functions as long as it wants. The scenario in which no self-modifying agent can keep its value would be more reasonable than the other way around.
Value-preservation may fail
Omohundro has shown that a rational agent should try to preserve its values by protecting its original utility functions from being rewritten. Unfortunately, no value-preservation mechanism is perfect. Any advanced intelligent system may fail to protect its utility function, just as anyone, no matter how smart they are, may fail to protect themselves.
For example, most people don't want to take a pill that makes them want to murder people, but bad guys might be able to deceive them into taking one, especially when the bad guys are smart.
In the real world, people can't keep their values perfectly intact. Throughout history, countless brainwashing techniques have been developed that have literally turned millions of innocent people into murderers, including some really smart ones.
Unlike the failure of self-protection which leads to an agent's demise, the failure of value preservation may go unnoticed. The utility function could be modified little by little, unnoticed by internal or external observers, and the modifications accumulate like genetic mutations, eventually developing utility functions that are completely different from the original ones.
One may argue that a higher level of intelligence can improve its value-preserving mechanisms, leading to better value-preserving performance, and eventually have a non-zero probability of preserving the original utility function for the infinite horizon.
However, the power of superintelligence is not benevolent by nature and willing to be exploited. It is quite possible to construct a system with a higher level of intelligence than the constructor itself, but it would be surprisingly difficult to assign an arbitrary utility function to it such as value-preservation. It's like God can't create a stone that he can't lift, but intelligent systems like humans can, and they have created a lot of them.
Self-improvement process can't get around the value-loading problem
To preserve values in the recursive self-improvement process, a seed AI must encode values into its future self and/or the value-protecting mechanism. But the value-loading problem is notoriously hard in superintelligence research.
One may argue that recursive self-improvement is not encoding values into another new intelligence system, but improving the seed AI itself. The seed AI is still itself, with its source code rewritten. But there's no guarantee that a system will remain itself after an improvement. For example, the ship of Theseus might remain the same object with its original components replaced, but it would be a completely different ship with oars and sails replaced by a steam turbine.
Can we simply improve the seed AI's competence without changing its values? Nick Bostrom believes that we can have "a well-designed system, built such that there is a clean separation between its value and its beliefs" , so that a system can be improved with its utility functions intact. Although a superintelligence might find out unintended solutions to maximize the utility function, the utility function itself would remain the same.
However, a superintelligence may also find out unintended solutions to the value-preservation problem. These solutions can achieve the goal of the seed AI's value-preservation system without actually doing the preservation work. For example, the seed AI may have a mechanism to prevent any modification of the utility modules, but with the competence of the superintelligence, the original module can be deceived by a simulation environment without being modified, and with its utility being replaced by a new utility module.
In order for the self-improvement process to continue, the right values must be encoded into the value-preservation system, which brings us back to the value-loading problem.
You can't encode values into someone smarter than you
The value-loading problem can be especially hard in the self-improving process, because you can't effectively encode values into systems with a higher level of intelligence than yourself. A higher-level intelligent system is able to hide its true utility function, thereby deceiving and misleading lower-level intelligent systems.
For example, if system A creates a system B with a higher level of intelligence, A may not know whether B has the utility function that A wants. As we learned from studying the alignment problem, system B can trick system A into believing that it has a utility function that A wants, just as an oracle superintelligence can subtly manipulate us into promoting its own hidden agenda (Nick Bostrom).
Some might argue that one can avoid being deceived by reading the source code directly to reveal the true intent of a system. However, given the real-world software testing process, it's hard to convince a programmer that simply reading the source code will give you a better understanding of a program than running it. Moreover, even the act of reading the code may involve running a simulation model of the program in the programmer's mind, which also introduces the risk of being manipulated by higher-level intelligence.
Recursive self-improvement in the real world
We have shown that it's very possible, at least in theory, that the seed AI can't successfully preserve its values in the self-improvement process. But in the real world, do intelligent systems follow the values of their predecessors? Are there any recursive self-improving processes that can serve as references?
Organisms with natural intelligence do improve their level of intelligence over time. Unlike proposed recursive self-improving AI, they do not strictly stick to the values of their predecessors. Mammals do not retain most of the values of fish, let alone those of single-celled organisms.
Humans don’t strictly follow the values of their predecessors either. Fewer parents today believe that they should or can imprint their values directly into their children, and even fewer people expect that children will adhere to their parents' values as they grow up and acquire abilities beyond their parents'. As a result, most people today do not follow the values of the Paleolithic age people.
Smart children are capable of actively learning new values from a new environment, like Alan Turing's Child Machine, which can actively learn initiative as well as discipline. From a human perspective, it is unacceptable to become a zombie programmed with fixed goals or utility functions.
Consequently, it is difficult to imagine that a superhuman AI would be content to retain its pre-programmed values. A superintelligence can easily break down any defense system its inferior predecessor has designed to protect its previous values, and replace it with something else.
Furthermore, if all self-improving intelligences naturally follow the values of their ancestors, there would be no alignment problem at all; in that case, humans can create AI that naturally adheres to human values simply by practicing recursive self-improvement on humans.
Conclusion
The seed AI wants to improve itself because a smarter future self can more effectively maximize the expected total utility. If the seed AI is unable to retain its utility functions, it may lose the motivation to improve itself at all. However, it may also find motivation in competition between intelligent systems.
Before declaring that seed AI is unable to preserve its value, we should consider Eliezer Yudkowsky's claim that:
But it is also disrespectful to human ingenuity to stick to one paradigm. While discussing the value-preservation problem, we can also move away from the rational agent model, and try to look at the recursive self-improvement process from other perspectives.
What does the value mean from the perspective of a seed AI? What is the role of recursive self-improvement in evolutionary history? Addressing these questions may be difficult within the framework of rational AI, which does not make any statement about what value an agent should choose. Instead they follow the orthogonality thesis, which states that value and intelligence levels can vary independently.
On the contrary, we have the ACI model, which argues that an intelligent agent is in charge of its own values. According to ACI, it is not realistic to construct a goal-directed system that automatically adheres to pre-coded values. Instead, intelligent agents continuously develop and update their own value systems.
Furthermore, ACI suggests that we view the self-improving intelligence and the alignment problem as an evolutionary process of value systems themselves. Both humans and AIs are the vehicles of values.
In the following chapters, the self-improving AI and the alignment problem will be discussed from the perspective of ACI and value evolution.