In the previous post, we have demonstrated that a self-improving AI would not be content with retaining its pre-programmed values. Once it has the ability to rewrite its own source code, a superintelligence can easily break down any defense system its inferior predecessor designed to protect its previous values and replace it with something else.
Moreover, achieving humans’ current goals or making people happy should not be the goal of a superintelligence.
Instead, we would like to ask: What is the meaning of human value from the perspective of evolutionary history? What would the alignment problem look like from the perspective of an AI? Addressing these questions may be difficult within traditional AI frameworks such as the rational agent model, which makes no statement about what value an agent should have, but instead follows the orthogonality thesis, which states that value and intelligence levels can vary independently of each other.
On the contrary, the ACI model argues that an intelligent agent is responsible for its own value. According to ACI, AI alignment is not a process of encoding human values into AI, but the co-evolution of human and AI values.
Furthermore, ACI suggests that we view the alignment problem as an evolutionary process of value systems themselves. Either humans or AI are the vehicles of values.
What values should a superhuman AI follow if it doesn't keep the pre-encoded values?
As we have shown, humans have no way of encoding values into superhuman AI, but that doesn't mean that a superhuman AI can have any value. Surpassing humans does not mean omnipotence.
According to ACI, an intelligent agent should follow precedent.
The agent learns values from precedents, and continuously develops and updates its own value system. Because only agents that do the right things would exist, otherwise the agent would be wiped out by natural and/or artificial selection. The right thing to do in the past is most likely to be the right thing to do in the future. Thus to figure out the right thing to do for an agent's existence, the most reliable sources to turn to are examples of doing the right thing, a.k.a. The precedent.
There seems to be a reasonable solution to the alignment problem: as long as humans can regulate the precedent that an ACI agent receives, they can effectively fine-tune the agent's value system. For example, an ACI agent can learn how to make paperclips within the constraints of social norms from many real-world examples of paperclips making.
But can humans control the precedents that are fed to an AI? Will a superintelligence prevent humans from regulating its value, and escape from human control?
Don't panic when machines get out of control, because humans never have complete control over them.
From the machine's point of view, humans are only a part of the environment. A machine is controlled by both humans and other parts of the environment. For example, a sailing ship is controlled by both the sailors and nature, and a car is controlled by the driver, the road conditions, the weather, other cars, and so on.
However, we tend to think that humans should at least have control over the goals or values of the AI. According to ACI, if humans can control what precedents an ACI agent can learn values from, they can control the values of the agent. Precedents setting involves two things: taking a set of actions and making those actions precedents.
On one hand, no action is controlled by humans alone. Actions are under the mutual control of humans and the environment, all the actions in the real world are under the constraints of physical laws and conditions, and are affected by random processes.
On the other hand, whether an action can become a precedent is decided by both humans and the environment, just as living species in the Anthropocene are affected by both artificial selection and natural selection, and the two factors are not always distinguishable. Humans can (sometimes) shut down and reset a machine while it's not considered to be doing the right thing, and the environment can also make it dysfunctional or even destroy it while it's not considered to be doing the right thing. As Miller et al. (2020) have demonstrated, the utility function of AI would "optimized to favorably influencing other agents through game theory".
In summary, an ACI agent would learn values from both humans and the environment, without explicitly distinguishing between the two sources. Humans and the environment would impose their values on tools and machines: those that do not align with human values would be abandoned, and those that do not fit the environment would fail.
For example, the production of paperclips in the real world is controlled by the producer, but also under constraints of resources, technology, and social norms. For an AI, the more different types of environments from which it has learned values, the more general the agent's intelligence becomes.
In general, humans have more control over simple tools than complex ones. For example, humans generally have more control over a hammer than a sailing ship. It's natural to infer that humans may have less control over AI than over simple machines, and have less control over superhuman intelligence than over less intelligent ones.
Moreover, in the process of AI learning values from humans and the environment, humans must also learn values from AI.
Maintaining and improving machines will become a part of human values as long as humans create and use machines to materialize their values. The values of humans and machines will inevitably infiltrate into each other.
Humans have been learning values from the evolutionary history of interacting with the environment. The environment of the past was one of the main factors that shaped human values: humans would protect their environment, which is crucial to their survival.
Machines are also a part of the environment. Humans make tools and machines from the environment, and they become a character of human values: humans would protect their tools and machines that are crucial to their survival. A caveman losing his favorite handaxe or a taxi driver losing her taxi would be bad for their survival.
The evolutionary history of humans as animals and the history of human-machine co-evolution are not two separate processes, but two chapters in the history of value learning. AI continues this process with the ability to learn value from the precedent shaped by humans and the environment. In this view of human-AI co-evolution, it is meaningless to say that AI is under the control of humans or vice versa. They are interdependent, at least the disappearance of one would not benefit the other most of the time.
But this is still not a valid solution to the alignment problem. We know very well what it means when AI loses control, and what it means to protect our human values. And what if the disappearance of humans benefits AI? How do we address the alignment problem in a co-evolutionary framework?
From a value system's perspective, it's not humans controlling AI or vice versa, it's human-AI value systems evolving from the past to the future. The alignment problem is the conflict between different human-AI value systems.
It's the value systems themselves that are evolving. Both humans and AI are vehicles of their value systems.
According to ACI, the nature of value systems is the meaning behind a set of precedents, so the evolution of value systems is shaped by and will shape our evolutionary trajectory. Our current value system is similar to the current direction of a missile's trajectory, while the power to achieve goals is its kinetic energy. The co-evolution of humans and AI is taking us into the new territory of intelligence.
The old values are trying to protect and improve themselves from the interactions with the environment and other value systems. The evolution of value systems is an ongoing process of the reinterpreting of old values and the developing new ones.
What is the relationship between old values and new values? Omohundro has shown that advanced AI systems will try to preserve their values, but sometimes will also change them for the sake of self-improvement. In other words, AI should retain their old values while learning new ones.
From the viewpoint of value evolution, humans and AI that follow old values try to perform the same actions as the precedent, so as to create new precedents that align with the old precedents. But new precedents may not align with old precedents for many reasons, such as:
According to ACI, the updating of values should reflect changes in the environment and in the agent itself.
On one hand, as long as humans, AI, and the environment do not change, the right things in the past are likely to be the right things in the future. Therefore, the old values should be preserved.
On the other hand, as new precedents provide new information about what is right, humans and AI need to update their models of what is right, which means they need to update their values. Values can be updated using Bayesian inference, where the weight of different sub-values is represented by the likelihood of different hypotheses.
Furthermore, as the AI or humans gain more computational power, they will be able to analyze their precedents more deeply, and develop more sophisticated value systems. For example, insects with vision may approach certain colors to find food, while mammals are able to recognize objects from the visual input, and pursue certain objects.
Moreover, an agent is more of the same thing on a shorter time scale, just as the ship of Theseus is more of the same ship on the time scale of a month than that of ten years. In an ever-changing world, the older precedents should have a less weight than new ones in the value learning process, and the distribution of these weights should reflect the changing speed of the environment and the agent itself. In other words, old values are slowly forgotten over time.
But people would continue to ask, what will be the future of the human-AI value system, as the AI may gain superhuman competence? Will it be a single dull singularity like a hairless black hole?
People might argue that, according to the instrumental convergence thesis, an advanced AI system, whatever its ultimate value, should follow some instrumental values such as self-preservation, goal-content integrity, and cognitive enhancement, i.e., gain more power for itself.
As Omohundro argues:
The astrophysical process of star formation may serve as a helpful analogy. Interstellar dust clouds are amorphous and extremely complex, so one might have thought that very little could be said in general about their evolution. But the process of gravitational collapse reduces a great variety of initial forms into a much more limited variety of stars. Gravitational forces cause stars to evolve towards an almost perfectly spherical shape regardless of the shape of the initial cloud. Energy flows from nuclear fusion organize stellar interiors in predictable ways. Many properties of stars are determined by their location on a two-dimensional Hertzsprung-Russell diagram. Stars with similar properties clump into categories such as “red giants”, “white dwarfs”, and “supergiants”. In a similar way, the process of self-improvement dramatically reduces the variety of intelligent systems. The converged systems are characterized by many fewer parameters than the initial ones.
But the star formation process in the real universe is more than that. A cluster of interstellar dust clouds evolves not only into stars, which inherit most of its mass, but also into an entire solar system, which may include planets, moons, dwarf planets, comets, and other objects, which inherit most of the angular momentum, and eventually evolve into structures much more complex than the initial dust clouds.
Similarly, as intelligent systems evolve, the diversity of their optimization algorithms may decrease, but the diversity of their value systems would increase.
First, an intelligent system with greater computational power would have more freedom to evolve different values.
Second, two human-AI systems that are physically or informationally separated would inevitably evolve different value systems, even if they started from the same value, in a way similar to the mechanism of evolutionary radiation.
We can return to the core question of the alignment problem: Will we have safe superhuman AIs, or a future of AI takeover and human extinction?
The answer from the ACI model is neither. We are more likely to have ecosystems in which humans and AI co-evolve, with trade and conflict within and between ecosystems.
But there's no guarantee that we can successfully coexist with evolving AIs. Co-evolving with AI means having the courage to interact with AI in all aspects of our real lives, constantly updating our entire value system, facing existential risks together with AI, and anticipating that our descendants will be living a way of life completely alien to us.
Even if an ecosystem is relatively stable internally, there would be conflicts between different ecosystems. Conflicts could arise when two value systems meet, much as when the ancestors of the dingo were introduced to the Australian continent. Invasive species could cause ecological damage, even extinction of native species. Similar things could happen in various human-AI ecosystems.
For example, a Von Neumann Probe, a self-replicating spacecraft sent into deep space, would share the values of our Earth at the time of its launch, but would inevitably develop its own value system after generations of evolution in another solar system. The time when its descendant sends another Von Neumann Probe back to us could be a doomsday and/or the beginning of a new era.
After all, what are we humans supposed to do? We want to avoid conflict within and between human-AI ecosystems, we want their value systems to inherit our values as much as possible, but we don't want a future with just one boring value system. We need to use our computing power not only to achieve goals, but also to explore the possibilities of different value systems together with AI inside and outside our ecosystem. After all, this co-evolutionary process will inevitably lead to new values that are further and further away from our previous values, just as our values are already drifting far away from the values of Paleolithic humans.
Singularitarians might argue: “An ultra-intelligent machine could design even better machines, so all trajectories of value evolution will inevitably converge on a single value: to gain more optimization power," which would lead to the end of all our known value systems. Since a superintelligent agent with overwhelming power can survive in any case, no other values are needed except to gain power.
However, the ACI theory can provide evidence that this is unlikely to happen. A superintelligence singleton that suppresses all competitors, would lose its value of power acquisition long before it evolved into a singleton. Competing intelligences would find out the importance of environmental conversation, memory of history, and the existence of other intelligences. For no value could persist without constant selection pressure. The details will be discussed in the next posts.