Artificial intelligence is advancing quickly. In some ways, AI development is an uncharted frontier, but in others, it follows the familiar pattern of other competitive processes; these include biological evolution, cultural change, and competition between businesses. In each of these, there is significant variation between individuals structures and some are copied more than others, with the result that the future population is more similar to the most copied individuals of the earlier generation. In this way, species evolve, cultural ideas are transmitted across generations, and successful businesses are imitated while unsuccessful ones disappear.
This paper argues that these same selection patterns will shape AI development and that the features that will be copied the most are likely to create an AI population that is dangerous to humans. As AIs become faster and more reliable than people at more and more tasks, businesses that allow AIs to perform more of their work will outperform competitors still using human labor at any stage, just as a modern clothing company that insisted on using only manual looms would be easily outcompeted by those that use industrial looms. Companies will need to increase their reliance on AIs to stay competitive, and the companies that use AIs best will dominate the marketplace. This trend means that the AIs most likely to be copied will be very efficient at achieving their goals autonomously with little human intervention.
A world dominated by increasingly powerful, independent, and goal-oriented AIs is dangerous. Today, the most successful AI models are not transparent, and even their creators do not fully know how they work or what they will be able to do before they do it. We know only their results, not how they arrived at them. As people give AIs the ability to act in the real world, the AIs’ internal processes will still be inscrutable: we will be able to measure their performance only based on whether or not they are achieving their goals. This means that the AIs humans will see as most successful — and therefore the ones that are copied — will be whichever AIs are most effective at achieving their goals, even if they use harmful or illegal methods, as long as we do not detect their bad behavior.
In natural selection, the same pattern emerges: individuals are cooperative or even altruistic in some situations, but ultimately, strategically selfish individuals are best able to propagate. A business that knows how to steal trade secrets or deceive regulators without getting caught will have an edge over one that refuses to ever engage in fraud on principle. During a harsh winter, an animal that steals food from others to feed its own children will likely have more surviving offspring. Similarly, the AIs that succeed most will be those able to deceive humans, seek power, and achieve their goals by any means necessary.
If AI systems are more capable than we are in many domains and tend to work toward their goals even if it means violating our wishes, will we be able to stop them? As we become increasingly dependent on AIs, we may not be able to stop AI’s evolution. Humanity has never before faced a threat that is as intelligent as we are or that has goals. Unless we take thoughtful care, we could find ourselves in the position faced by wild animals today: most humans have no particular desire to harm gorillas, but the process of harnessing our intelligence toward our own goals means that they are at risk of extinction, because their needs conflict with human goals.
This paper proposes several steps we can take to combat selection pressure and avoid that outcome. We are optimistic that if we are careful and prudent, we can ensure that AI systems are beneficial for humanity. But if we do not extinguish competition pressures, we risk creating a world populated by highly intelligent lifeforms that are indifferent or actively hostile to us. We do not want the world that is likely to emerge if we allow natural selection to determine how AIs develop. Now, before AIs are a significant danger, is the time to begin ensuring that they develop safely.
Note: This paper articulates a concern that emerges in multi-agent situations. Of particular interest to this community might be Section 3.4 (existing moral systems may not be human compatible (e.g., impartial consequentialism)) and Section 4.3.2 (Leviathan).
I think this paper is missing an important distinction between evolutionarily altruistic behaviour and functionally altruistic behaviour.
These two forms of behaviour can come apart.
A parent's care for their child is often functionally altruistic but evolutionarily selfish: it is motivated by an intrinsic concern for the child's welfare, but it doesn't confer a fitness cost on the parent.
Other kinds of behaviour are evolutionarily altruistic but functionally selfish. For example, I might spend long hours working as a babysitter for someone unrelated to me. If I'm purely motivated by money, my behaviour is functionally selfish. And if my behaviour helps ensure that this other person's baby reaches maturity (while also making it less likely that I myself have kids), my behaviour is also evolutionarily altruistic.
The paper seems to make the following sort of argument:
I think we have reasons to question premises 1 and 2.
Taking premise 2 first, recall that evolutionarily selfish behaviour can be functionally altruistic. A parent’s care for their child is one example.
Now here’s something that seems plausible to me:
If that’s the case, then functionally altruistic behaviour is evolutionarily selfish for AIs: this kind of behaviour confers fitness benefits. And functionally selfish behaviour will confer fitness costs, since we humans are more likely to shut off AIs that don’t seem to have any intrinsic concern for human welfare.
Of course, functionally selfish AIs could recognise these facts and so pretend to be functionally altruistic. But:
Here’s another possible objection: functionally selfish AIs can act as a kind of Humean ‘sensible knave’: acting fairly and honestly when doing so is in the AI’s interests but taking advantage of any cases where acting unfairly or dishonestly would better serve the AI’s interests. Functionally altruistic AIs, on the other hand, must always act fairly and honestly. So functionally selfish AIs have more options, and they can use those options to outcompete functionally altruistic AIs.
I think there’s something to this point. But:
Here’s another possible objection: AIs that devote all their resources to just copying themselves will outcompete functionally altruistic AIs that care intrinsically about human welfare, since the latter kind of AI will also want to devote some resources to promoting human welfare. But, similarly to the objection above:
Okay, now moving on to premise 1. I think you might be underrating group selection. Although (by definition) evolutionarily selfish AIs outcompete evolutionarily altruistic AIs with whom they interact, groups of evolutionarily altruistic AIs can outcompete groups of evolutionarily selfish AIs. (This is a good book on evolution and altruism, and there’s a nice summary of the book here.)
What’s key for group selection is that evolutionary altruists are able to (at least semi-reliably) identify other evolutionary altruists and so exclude evolutionary egoists from their interactions. And I think, in this respect, group selection might be more of a force in AI evolution than in biological evolution. That’s because (it seems plausible to me) that AIs will be able to examine each other’s source code and so determine with high accuracy whether other AIs are evolutionary altruists or evolutionary egoists. That would help evolutionarily altruistic AIs identify each other and form groups that exclude evolutionary egoists. These groups would likely outcompete groups of evolutionary egoists.
Here’s another point in favour of group selection predominating amongst advanced AIs. As you note in the paper, groups consisting wholly of altruists are not evolutionarily stable, because any egoist who infiltrates the group can take advantage of the altruists and thereby achieve high fitness. In the biological case, there are two ways an egoist might find themselves in a group of altruists: (1) they can fake altruism in order to get accepted into the group, or (2) they can be born into a group of altruists as the child of two altruists, and (by a random genetic mutation) can be born as an egoist.
We already saw above that (1) seems less likely in the case of AIs who can examine each other’s source code. I think (2) is unlikely as well. For reasons of goal-content integrity, AIs will have reason to make sure that any subagents they create share their goals. And so it seems unlikely that evolutionarily altruistic AIs will create evolutionarily egoistic AIs as subagents.
I think https://www.alignmentforum.org/posts/TATWqHvxKEpL34yKz/intelligence-or-evolution is somewhat related in case you haven't seen it.
a lot of negative reactions I've seen to this take are either "well obviously" (which is fine) or "evolution is slow"which seems less fine, because in fact evolution is not slow at all, humans have simply been consistently and durably most self-preservationally fit for a very long time
In §3.1–3.3, you look at the main known ways that altruism between humans has evolved — direct and indirect reciprocity, as well as kin and group selection — and ask whether we expect such altruism from AI towards humans to be similarly adaptive.
However, as observed in R. Joyce (2007). The Evolution of Morality (p. 5),
Evolutionary psychology does not claim that observable human behavior is adaptive, but rather that it is produced by psychological mechanisms that are adaptations. The output of an adaptation need not be adaptive.
This is a subtle distinction which demands careful inspection.
In particular, are there circumstances under which AI training procedures and/or market or evolutionary incentives may produce psychological mechanisms which lead to altruistic behavior towards human beings, even when that altruistic behavior is not adaptive? For example, could altruism learned towards human beings early on, when humans have something to offer in return, be “sticky” later on (perhaps via durable, self-perpetuating power structures), when humans have nothing useful to offer? Or could learned altruism towards other AIs be broadly-scoped enough that it applies to humans as well, just as human altruistic tendencies sometimes apply to animal species which can offer us no plausible reciprocal gain? This latter case is analogous to the situation analyzed in your paper, and yet somehow a different result has (sometimes) occurred in reality than that predicted by your analysis.
I don’t claim the conclusion is wrong, but I think a closer look at this subtlety would give the arguments for it more force.
Although you don’t look at network reciprocity / spatial selection.
Even factory farming, which might seem like a counterexample, is not really. For the very existence of humans altruistically motivated to eliminate it — and who have a real shot at success — demands explanation under your analysis.