The Governance Problem and the "Pretty Good" X-Risk

Zach Stein-Perlman

No longer endorsed.

Like a child suddenly given awesome strength, I could have pushed too hard, and left the world a broken toy I could never repair. —The Hero of Ages

I hope we create superintelligence, and I hope it does what we want. But this would not suffice to make the future great. For a great future from superintelligence, a third condition is necessary: what we want is great. This post is about what the controllers of a superintelligence will want and how we can improve that.

Epistemic status: confident about the problem, not confident about specific failure modes or solutions.

I. Introduction

If it were available to them, I think current human elites (or almost any other group of humans) would overwhelmingly choose this bargain over the status quo:

Earthly Utopia. Human civilization will last on Earth for ten billion years. During that time, almost every person's life will be better than the best life ever lived before now. Earth will be a place of great learning and discovery, personal excellence and achievement, expansive freedom, extraordinary beauty, wonderful experiences, deep relationships, harmony with nature, meaningful projects, and profound joy. However, we may never use resources outside our solar system.

It may be that more intelligent, thoughtful, and wise versions of ourselves would take such a bargain. But more likely, I think, our enlightened counterparts would regard it as the worst mistake and the greatest catastrophe ever. If an astronomical number of wonderful planets is astronomically better than one, we have a problem:

Our decisionmakers and decisionmaking institutions would take a terrible bargain because they neglect nontraditional sources of value (viz., optimizing the rest of the universe) that happen to be overwhelmingly important (in expectation). This is not surprising — these same processes egregiously fail to appreciate the value or risk of AI. So if we solve alignment and create superintelligence, it will not just take care of the rest. It's not enough to have nice people control the superintelligence. To adapt from The Rocket Alignment Problem:

Superintelligence may be developed by ill-intentioned people. But that's not the failure mode I'm worried about right now. I'm more worried that right now, even nice and generous potential-controllers-of-superintelligence would want it to do non-optimal things. Whether Google or the United States or North Korea is the one to develop superintelligence won't make a major difference to the probability of an existential win from my perspective, because right now potential controllers of superintelligence don't want what's optimal.

Aligned superintelligence is good to the extent that the operator wants good things. Longtermists/EAs/rationalists worry a lot about alignment, but little about aligning the eventual operator's instructions with what is best. I aim to investigate what could make us succeed or fail and explain why we should care a lot about this problem, which I call the governance problem since I expect governance issues to be vital.^[1] If we survive and create powerful aligned AI, I believe it is almost certain that the future will be wonderful by provincial human standards. But most such scenarios are still existential catastrophes.

II. Good Uses of Superintelligence

Some uses of superintelligence would be near-optimal.^[2] Call such uses "great," and call a possible future "great" if it involves great use of superintelligence. Under reasonable normative and empirical assumptions, whether a future is great depends overwhelmingly on how we use most of the matter available to us in the universe in the long run, not on the short-term future or the future of Earth.^[3] I make such assumptions here, although some of my conclusions hold without them.

A final use of superintelligence would be great if and only if we near-perfectly:

Precisely specify the object-level things the superintelligence should do,
Tell it what to optimize for,
Tell it how to choose what to optimize for, or
Etc.^[4]

Fortunately, we need not make a final decision—one that closes off most possible futures—immediately after creating superintelligence. The immediate problem of what to do with superintelligence is not how to optimize the universe but how to decide how to optimize the universe. A use of superintelligence is great if we:

Have the superintelligence do something great as a final use; i.e., do something on the previous list (level 0),
Delegate to a system that chooses and effects a level 0 system, near-certainly and without wasting much time/resources (level 1),
Delegate to a system that delegates to a level 1 system, near-certainly and with little waste (level 2), or
Etc.^[5]

Some prima facie great systems for choosing how to use superintelligence are:

Well-designed long reflection, likely involving intelligence amplification, using the superintelligence to understand what various possible futures would look like across many dimensions, and receiving and evaluating arguments from the superintelligence.
Well-designed indirect normativity.
Any system that would delegate to one of the above systems (near-certainty and with little waste).

But I won't speculate here on the details of great systems. Instead, I'll consider what affects the system that we will end up with.

III. Short-Term Issues

We will presumably be able to choose what precisely to do with aligned superintelligence after we create it (rather than the use being determined before or as we create the superintelligence). But what happens before superintelligence affects what we ultimately choose. Our short-term actions affect:

What kind of agent will control the superintelligence
How the controller will think about how to use it and the options that the controller will see as possible, wise, moral, legitimate, dangerous, reckless, etc.
The formal and informal constraints on the use of superintelligence

For example, a poorly-designed "AI constitution," whether hard (truly binding) or soft (with political and psychological power), would be bad. A well-designed one would be good.

I believe it is very likely that the controller of superintelligence will be a state, a group of states, or a new international organization. In particular, I expect that states will appreciate AI before we create superintelligence and will nationalize or oversee promising projects.

The organizations and ideas with power just before superintelligence will determine how we use it. I expect that at the beginning of a fast-takeoff intelligence explosion, how we use superintelligence will be predictable; it will appear very likely or very unlikely that we will use it well. So while the failures I will discuss manifest after superintelligence, and while the period during and directly after the intelligence explosion may be important, I think whether we succeed on governance will be mostly determined before the intelligence explosion.^[6]

IV. Unipolar Failure Modes

A world is "unipolar" if it is dominated by a single agent (such as an organization, human, or AI), called a singleton. A unipolar world order could arise if multiple powerful organizations (presumably states) unite. More plausibly, an organization could form a singleton if it becomes sufficiently powerful relative to others, such as by creating powerful aligned AI as the result of an intelligence explosion.

Suppose a single aligned superintelligence is much more powerful than the rest of the world, and its controller is a singleton, making decisions for the whole world. What does the controller do with this power? It depends on the controller's values and decisionmaking structure — we should expect different choices depending on the nature of the controller (individual, state, coalition of states, international organization), its decisionmaking structure, and popular ideas about AI and what to do with it. Brainstorming potential failure modes, examples or subcases indented (in no particular order):

Flawed benevolence. The singleton tries to do good (with direct or indirect normativity) but chooses poorly, locking in a non-great outcome.
Flawed decisionmaking delegation.
- Decisionmaking—whether hard power or just the ability to make a recommendation that will probably be executed—is delegated to a reasonable-sounding and politically acceptable but uniformed group (e.g., poorly-chosen experts, the UN, a state's legislators, a state's voters, or all adults), and something big is decided without sufficient information (viz., without leveraging AI to make suggestions and comment on proposals, but more broadly deciding something before there's been time/opportunity to improve decisionmakers' normative beliefs).
Flawed power delegation.
- The world order breaks down into a multipolar scenario which then fails (e.g., powerful AI is controlled by a group of states, which decides to give copies of the AI to each state in the group).
- Power is delegated directly to individuals, and a great outcome is impossible due to coordination difficulties.
Other internal coordination failure. The controller is an organization without a specific mandate for using superintelligence and without a centralized, agenty process to decide.
- Formal or informal constraints on use of superintelligence mean that the best politically feasible use is not great.
  - Poor plans were formally locked in or informally became the default, even if the controller dislikes them now (or would dislike them if they had not been locked in).
- Deadlock. Decisionmakers cannot reach a required level of consensus on major issues and the decisionmaking structure responds poorly; the default outcome is not great.
Lack of benevolence. The singleton does not try to do good.
Lack of understanding/appreciation. The controller does not appreciate how superintelligence enables irrevocably locking out great possible futures and does not act with the necessary care.
- Initially, superintelligence is used in ways that are large only by prosaic standards. Perhaps due to popular misunderstanding of AI, superintelligence is treated as a tool for fulfilling object-level preferences rather than for planning or improving our preferences. Eventually, a controller of superintelligence decides to optimize for its current preferences, locking in a narrow set of possible futures, and they are not great.
- To prevent locking in undesired futures, the controller implements a poorly-designed constitution for superintelligence, locking out great possible futures.
Weak safeguards. A discontent faction seizes power, legally or illegally, and uses superintelligence poorly. (This is more relevant the more decentralized and disparate the controller is.)

But attempting to separate these possibilities may be analytically counterproductive by concealing the large-scale reason for concern. Few people want what's optimal, so political forces just don't push in that direction. Most people, interest groups, and policymakers will just have prosaic goals. Political institutions like ours would likely struggle to achieve anything meaningful: it is much easier to imagine decisive political support for something prosaic than for a plan for using our cosmic endowment. A radical (by our standards) plan for how to use our cosmic endowment—which is presumably necessary for a great future^[7]—is not politically feasible. Instead, we may end up with a "pretty good" future: one excellent along prosaic dimensions, like Earthly Utopia, but not great. Extremely democratic uses of AI are prima facie similarly problematic—most people don't want what's best—and also have coordination issues.

I think superintelligence will likely be governed by something like our current political institutions and these institutions may fare poorly. We should not trust the long-term future to decisionmaking systems based in current humans' preferences; we prima facie should delegate this choice to another system. Any candidate system must be both politically feasible and likely to choose well. That is, I think we need something like long reflection to use our cosmic endowment well.^[8] If ideas like long reflection sound radical or just unreasonable immediately before the intelligence explosion, our future is in trouble.

"In the very near future, we are going to lift something to Heaven."^[9] And even if it's aligned with what we want, it might not be aligned with what's good. So what can we do? Some desiderata:

What is popularly perceived as Responsible and Respectable Opinion supports great uses of superintelligence. (Perhaps long reflection is seen as the only legitimate short-term macro-level use of superintelligence.)
The superintelligence's decisionmaking/governance structure is good.
- The controller's decisionmaking process tends to output something great.
  - The controller having a specific mandate for long reflection (or another kind of delegation of power) might help.
  - Intelligence augmentation for relevant humans might help.
- There are strong safeguards against outsiders or factions seizing the superintelligence.
Longtermists/EAs/rationalists are influential. Longtermist individuals and organizations are generally respected authorities on how to use superintelligence and are not associated with ideas that are unpopular or illegible.

V. Multipolar Failure Modes

Despite their titles, this section does not complement section IV. That section was about the risk that an agent can do whatever it wants and it fails to choose well. This section is about additional ways we could fail if nobody has such power after the intelligence explosion.

I am generally more pessimistic about multipolar scenarios (although it really depends on the specifics): roughly, multipolar scenarios include the risk of unipolar failure for each powerful agent. If an omnipotent organization can fail, two semi-omnipotent organizations can each fail in the same way. But multipolar scenarios have their own special failure modes as well:

Conflict. Empowered but unwise humans could act for the sake of relative power, retribution, or enforcing certain norms on others.
Competition. I would expect technologically mature civilizations to be good at coordinating even with those they disagree with, but this may not hold, especially in the short run.
- Cooperation is impossible. Perhaps factions will be unable to prove things about themselves or reliably simulate others and thus be unable to trust one another. Then competition over resources could consume all surplus.
- Agents fail to cooperate even though it is possible.
  - Relevant decisions happen too fast.
  - Humans irrationally decide not to coordinate.
  - AI are constrained for nonstrategic reasons (due to poor decisions by the humans and institutions running the world) and unable to cooperate.
- Many agents each have power and creating value requires coordination (e.g., not racing to seize a common good, or not optimizing for influence or reproductive fitness) but some don't cooperate.
Catastrophic negotiation failure.
- Agents use brinkmanship, attrition, or precommitment to coerce one another. Due to a mistake in communication or in predicting others' responses, catastrophe ensues.
- At least one agent negotiates in a manner that is flawed even from its subjective vantage. Either directly or because others' behavior assumed that no agents would make that mistake, catastrophe ensues.
Accidents. This could occur in multipolar scenarios due to agents not being totally secure in their power (analogous to our near-misses in erroneous nuclear "retaliation"). Perhaps powerful AI systems will eliminate accidents, or perhaps they will foster uncertainty and instability, promote deception, and lead to stronger hair-trigger dynamics.

VI. Conclusion

The governance problem is the practical problem of getting the controller of superintelligence to use it near-optimally.

Here are some propositions I believe:

If we create aligned superintelligence, how we use it will involve political institutions and processes. Superintelligence will probably be controlled by a state or a group of states. This is more likely the more AI becomes popularly appreciated and the more legibly powerful AI is created before the intelligence explosion.

Aligned superintelligence enables directing the arbitrarily distant future. Consider what an intelligent but not omniscient observer would predict about the future of Earth and Earth-originating systems. Throughout human history, events almost always have had negligible effects on the observer's credences. In the last century, some events had non-negligible effects on the credences through their effects on extinction risk. But because of the possibility of superintelligence, it may soon become possible to lock in narrow classes of possible futures. This could happen intentionally (a singleton could optimize for any preferences) or unintentionally (if we create unaligned powerful AI or fail to coordinate to use aligned powerful AI well).

Accidental governance failure is possible. We could create aligned superintelligence but still end up with an outcome that nobody wants.

"Pretty good" governance failure is possible. We could end up with an outcome that many or most influential people want, but that wiser versions of ourselves would strongly disapprove of. This scenario is plausibly the default outcome of aligned superintelligence: great uses of power are a tiny subset of the possible uses of power, the people/institutions that currently want great outcomes constitute a tiny share of total influence, and neither will those who want non-great outcomes be persuaded nor will those who want great outcomes acquire influence much without us working to increase it.

The governance problem depends (in largely predictable ways) on various factors that we can affect before TAI. These include:

Ideas about what to do with powerful AI (among AI elites, elites in general, and people in general). For example, it would be good if people appreciated the value/responsibility of leaving our options open until we can wisely choose what to do with superintelligence. A popular mandate for more specific plans (long reflection or indirect normativity or something else) may be valuable.
The formal and informal constraints on AI research and development.
The level of international cooperation and competition for powerful AI.
Whether an AI constitution exists, and if so, what it looks like.

To improve our chances of achieving successful governance, we should think about what affects how superintelligence is used and how we can affect those factors, then do it.

Thanks to Daniel Kokotajlo for suggestions.

I am not aware of an existing name for the important problem of getting a superintelligence that does what its operator wants to do what is best. This problem roughly requires wisdom and caution to avoiding locking in object-level values prematurely and coordination among people with influence over using superintelligence.

Nick Bostrom defined the "political problem," complementing the control problem, as "how to achieve a situation in which individuals or institutions empowered by such AI use it in ways that promote the common good." To the extent that value is binary, it matters less whether AI promotes the common good on net and more whether AI does astronomical good. To the extent that superintelligence (not previous AI) is all that matters after superintelligence exists, it only matters how we use the superintelligence. I assume Bostrom used this less carving-at-the-joints-y definition for simplicity and to decrease inferential distance for people outside the community; I'm pretty sure that my "governance problem" is closer to how we should be thinking about the problem of using AI well.

Will MacAskill once called some related issues the "second-level alignment problem," but it's not clear what exactly he meant.

Note that, roughly, P(win) = P(aligned powerful AI) * P(great use) = P(survive until powerful AI) * P(powerful AI is aligned) * P(great use). This suggests a decomposition of the problem of achieving an existential win into three subproblems: the survival problem, the alignment problem, and the governance problem. ↩︎
That is, some uses of superintelligence would have near-optimal expected value, where optimal expected value is roughly what we would achieve if we were thoughtful, wise, coordinated, and successful, by our standards. ↩︎
Acausal trade is a conceivable source of value that does not necessarily require our colonizing the universe well. But it is prima facie even more politically challenging. Regardless, the prospect of it and other speculative, potentially radically effective strategies gives us additional reason to increase our collective ability to do unintuitive things with superintelligence. ↩︎
These look similar in practice — rather than just telling the superintelligence to optimize for X, we'd probably have it tell us what optimizing for X would look like first, so we're effectively hearing the object-level way to optimize for X and then telling it to pursue that path. ↩︎
Similarly to note 4, level number isn't really meaningful; it just matters that there's a chain of delegation that ends in something great. ↩︎
More uncertain scenarios would occur if (1) the controller of superintelligence does not make decisions in a predictable way (e.g., it's a group of states with different goals, or it's an international organization without a clear mandate for using superintelligence) or (2) there is a multipolar outcome of some sort — e.g., if there is slow takeoff (in particular, no threshold-y behavior) or superintelligence is not able to form a singleton. ↩︎
Since almost all of the resources eventually available to us involve colonizing the universe and it's prima facie unlikely that what sounds normal to current humans is optimal. ↩︎
I am sympathetic to long reflection but will not defend it here. I merely use it as a prima facie example of a system that could have the two necessary properties for successful governance: acceptability and great decisionmaking. ↩︎
Scott Alexander's Meditations on Moloch. While superintelligence could kill Moloch dead, Moloch might choose how we use it. That would be ironic. ↩︎

[-]Mitchell_Porter5y20

Are you familiar with CEV?

[-]Zach Stein-Perlman5y10

Yes, I definitely consider (successful, philosophically sound) CEV to be a great use of superintelligence. An earlier draft mentioned CEV explicitly, but I decided to just mention the broader category "indirect normativity," which should include any sound method for specifying values indirectly.