Epistemic Status: Very speculative, intended to provoke thought rather than convince.
 

A crux of much AI safety research is the theory that agents are powerful; that we should robustly expect agents that plan and seek power (operationalized as P2B loops) to outcompete other entities in the long term. Among agents, higher levels of agency are better - we should expect entities with planning and metacognition to outcompete entities that merely learn from direct experience or mimicry.

I claim this theory is not well-supported by empirical evidence, and that it’s at least as plausible to expect cooperation, not agency, to be the key property that determines which entities will survive and outcompete.

I don’t mean to say that cooperation and agency are mutually exclusive; in fact they are complements, and I expect future intelligent systems to exhibit high levels of both. I mean that power-seeking and cooperation are mutually exclusive, and if the world selects for cooperation more strongly than for agency, the instrumental convergence arguments for power-seeking may not go through.

Evidence from Biomass

A clear notion of power that has been the subject of extreme optimization pressure for billions of years is biomass. An organism’s biomass measures the amount of matter they directly control (with a lot of corner cases and “extended phenotype” considerations we won’t get into - let’s just count carbon atoms).

The vast majority of the world’s biomass (>99%) is controlled by bacteria, plants, and fungi, entities which are not particularly agentic. But perhaps we should exclude them because they are not really in control of their component matter, not in the same sense that animals are - e.g. they are either much slower at moving their carbon atoms around (plants, fungi) or are so small and simple that they just don’t count (bacteria).

Let’s very roughly normalize for organism size and speed, then. Once we’ve normalized, the most outlier species in terms of biomass are

  1. Humans, with a single species composing something like 50% of all large animal biomass (most of the remainder being livestock under our direct control).
  2. Eusocial insects, especially ants and termites, which compose 2% of insect species but the majority of insect biomass.

What makes these species so powerful? Not agency, not their ability to plan. Their distinctive feature relative to comparable species is cooperation - their ability to coordinate with thousands and sometimes millions of other organisms from the same species. This is most obvious with eusocial insects which clearly lack any ability to plan, and very limited ability to learn from experience. Human societies do sometimes have central nodes that perform limited planning, but this doesn’t seem essential to their success.

The related cultural intelligence hypothesis suggests that humans are dominant due to our ability to cooperate not just across space but across time via culture - our most powerful tools like language, mathematics, and the scientific method are the product of millions of individuals cooperating over thousands of years.

Evidence from recent AI progress

It’s notable that the most promising current AI systems are large language models - AIs that learn culturally (by reading things humans have written over our history), while more “agentic” approaches to AI like deep reinforcement learning have stagnated in relative significance.

It might be that this is a temporary trend due to a “cultural knowledge overhang”, and once we get to the frontier of what humans know, agentic approaches will begin to outperform. 

But at least based on simple trend extrapolation and the biological evidence, we should bet that the future belongs to entities that feature unusually high levels of cooperation, not unusually high levels of power-seeking.

Implications for AI Safety

Just because evolution on Earth so far has selected strongly for cooperation, doesn’t mean this will continue being the case.

A reframe of the AI alignment problem is “ensuring that software continues to be selected for cooperation with humans more strongly than for power-seeking”. This is certainly not guaranteed, and there are many plausible paths to ruin. But we know there exist real environments with this kind of selection pressure, so it can’t be impossible, and in fact might be easier than the instrumental convergence arguments suggest.

So what’s wrong with power-seeking?

The usual instrumental convergence arguments give us reasons to expect a world dominated by power-seeking, planning agents, but instead we observe a world dominated by cooperators. So what are those arguments missing?

The simplest explanation is that evolution just got stuck in a local optimum, and agents are like bicyclesvastly more efficient but can’t be evolved by a local search algorithm. Then it seems likely that gradient descent won’t select for agents either, but human designers will.

Alternatively, some real-world constraint on computation has made planning uncompetitive (so far). You’re better off investing the marginal watt of energy into communicating with someone else, than into thinking more yourself. What could this constraint be?

The obvious candidate is the data bottleneck for learning systems, most recently seen in the Chinchilla results. For any fixed budget of compute and bandwidth, you’re better off pushing computation and decision making towards the edges. This constraint might be generated by a property like the high energy cost of error-free information transmission across physical space. Perhaps this constraint will be lifted as we transition from carbon-based computation to silicon-based computation, but it's not obvious why.

The most optimistic possibility is that there exists an instrumentally convergent drive for cooperation, along the lines of Robert Axelrod’s Evolution of Cooperation. Depending on the specific environment, the selection pressures for cooperation might be stronger than for power-seeking. Properties which make an environment select more for cooperation include

  • The existence of other powerful entities (in a dead universe, there's nobody to cooperate with).
  • Iterated games with affordances for punishment (easy to hurt other entities, but hard to destroy them).
  • Overlap in values between entities
    • Genetic similarity.
    • Gains from trade or symbiosis.
  • High communication bandwidth between entities.
  • Affordances for making pre-commitments (e.g. via legal systems, via showing your code, via self-boxing).

Increasing the extent to which these properties hold for our world is an underrated path for reducing existential risk.

Thanks to Daniel Kokotajlo, Deger Turan, and TJ for helpful comments on an earlier draft.


 

14

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 2:34 AM

while this appears to me to be true, a species of cooperator-behavior mechanisms A that same-mechanism-cooperate A-with-A in order to defect against not-same-mechanism behaviors B can be catastrophically dangerous to the other species B, even if the other species B are themselves self-cooperators. if one subnetwork of self-cooperators does not participate in the eventual broad interspecies cooperation group, that subnetwork can potentially produce non-cooperative outcomes on the outer scale. it seems to me that at a given scale, agency that is local on that scale does tend to try non-cooperative routes first, and that this initially limits its success; see, for example, how in evo game theory simulations, semi-cooperators like tit-for-tat-with-forgiveness and such usually win eventually and produce a cooperative society (depending on the setup), but defector-against-outer-network subnetworks can have a significant foothold for a long time.

because of this, it is my current view that eventually, some unspecified species of durable semicooperators are likely to win nearly the entire universe. however, on the path to get there, species of non-cooperator behaviors may cause significant damage to other cooperators. This despite that a fully cooperative network appears to me to be eventually near-guaranteed, for efficiency reasons.

so if one network of self-cooperators A value their own defection against other self-cooperators B where [A, B] could be [one military, humanity], or [humanity, cows and chickens], or [integrated multi-component ASI, humanity], then the defense analysis necessary to produce a game tree that visibly pays out to semicooperators when they cooperate is not necessarily trivial.

That said, because I strongly agree that long term, any agent wants to become a universally pro-tolerance semicooperator and constrain its aesthetic-structure values to apply to only a finite amount of the universe's negentropy, I think we have the potential to teach agentic AIs from the start that bridging non-cooperative circumstances is a worthy and useful goal which results in durability for the agent's aesthetic intentions because of durable coprotection.

(coprotection is a word I've used for a while to refer to agentic cooperation, ie an agent that tries to find ways to produce durability for other agents' values, not just their own. there are probably terms of art that I should be using, but coprotection seems to mean the right thing semantically without further explanation in nontechnical conversation.)

Universal cooperation on all system levels means total optimisation in the universe as a neural network, and indeed this can be a "goal" (unattainable, though), but the maximum gradient according to this loss function doesn't necessarily mean removing optimisation frustrations with particular subsystems (humans and the society) first, or at all. Especially if the AI takes panpsychism seriously.

constrain its aesthetic-structure values to apply to only a finite amount of the universe's negentropy

I don't understand this phrase. (Neg)entropy) is a numeric property of a physical system (including the whole universe), that is, a number.  What does it mean to apply something to a "limited amount" of it?

I mean that we can assign a particular block of matter, as priced by the amount of negentropy it contains, to a computation trajectory (eg, a person, or an ai). that is, we would fuel the ai with that amount of unspent energy.

can you clarify what you mean by the comparison to the universe as a neural network? I'm having trouble understanding the paper due to insufficient physics background, but it seems like it's not drawing a very coherent connection. I do think there's a connection to be drawn, but I'm extremely suspicious about whether this is the correct one.

I'm not sure we've ever had cooperators that weren't subjects within agents.

In every case I can think of, systems of cooperators (eukaryotic cells, tribes, colonies) arose with boundaries, that distinguish them from their conspecifics, predators, or competing groups, who they were in fierce enough competition with that the systems of cooperators needed to develop some degree of collective agency to survive.

I think the tendency for sub-agents within a system to become highly cooperative is interesting and worth talking about though.

It's not obvious how to... pierce the boundary... and get them to cooperate with things outside of the system.

I think it might be interesting if you tried to characterize "cooperation" more, give it the same level of vividness or tangibility that agency holds for us. Until then, I'm not sure what you think cooperation is. It might just be complicity. A hammer is a very cooperative object. It goes along with everything an agent might want to do with it, for instance, cracking another agent's skull. Abundance of complicity often doesn't lead to conditions of peace. It can be like kindling for wildfires, a source of environmental instability.

Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.

When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).

Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.

Power-seeking conflict might be zero- or negative-sum in terms of its immediate effect, yet the order which is established after the conflict is over (perhaps, temporarily) is not necessarily zero-sum. Dictatorship is not a zero-sum order, it could be even more productive in the short run than democracy.

P2B seems related to the planning step in the Active Inference loop.

I mean that power-seeking and cooperation are mutually exclusive, and if the world selects for cooperation more strongly than for agency, the instrumental convergence arguments for power-seeking may not go through.

Power-seeking, cooperation, and agency are all vague behaviour patterns that I think it makes little sense to talk about what "the world selects for".

I think cooperation should be considered as the construction of a higher-level system (whether this system is an "agent" or not is an unrelated question, if this question is scientifically meaningful at all, which I doubt). For example, cells in the human body cooperate to create a human. Also, using the examples from the post, humans form communities (also companies and societies) and ants form ant colonies in this way, all higher-level systems relative to individual people or ants.

Power-seeking is similar, in fact. Power-seeking can either be conceived as the power-seeking agent pretending to be a higher-level agent itself or comprise a higher-level system with some other systems which it dominates. So it leads to the creation of a higher-level system with a different communication/control/governance structure than in the case of cooperation.

Then, which type of system (grassroots-cooperative or centrally controlled) is "[morally] better in the long-term" or "outcompetes" or "emerges from the current AI development trajectory, coupled with economic, cultural, and political trajectories of our civilisation) is a totally separate question, or, rather, multiple different questions with possibly different answers. The answers depend on the features of our world, available for inquiry today, and the emergent properties of these systems: agility/adaptivity, raw information processing power, etc.

But at least based on simple trend extrapolation and the biological evidence, we should bet that the future belongs to entities that feature unusually high levels of cooperation, not unusually high levels of power-seeking.

From what I wrote above, I would say this bet doesn't make much sense, or at least not properly sharpened. You should focus on the properties of the emergent systems.


Active Inference tells us that instrumental convergence is not about power per se, it's about the predictability, of both oneself and one's environment. Power is just one of the good precursors of predictability, but not the only one: balanced systems with many feedback loops (see John Doyle's work on "diversity-enabled sweet spots", e. g. https://ieeexplore.ieee.org/abstract/document/9867859) should expect to be predictable, including to themselves.

Thinking about the trajectories which lead to the selection for a cooperative system, I think we should revisit Drexler's comprehensive AI services.