Principles of AI Uncontrollability

This essay is based on the ideas of Roman Yampolskiy’s “AI: Unexplainable, Unpredictable, Uncontrollable.” I am not following his reasoning closely, but rather using it as a jumping off point, and the last section is mostly my own reasoning.

TL;DR: Things break. Big things break more bigly. ASI double plus big.

Uncontrollability is a Neglected Research Area

Mainstream AI research focuses on capabilities, assuming problems will sort themselves out as they appear. Within the minority of AI safety research, most seem to assume a solution, even if that solution is really hard and we are not on track to finding it. This neglect is understandable given that AI safety advocates have their hands full trying to convince AI labs and their supporters that there is a problem and, failing that, to educate the public regarding what is even going on with AI in order to bring political pressure to bear.

Uncontrollability is Big if True

The AI safety movement is currently split between technical research to make AI safer and policy advocacy. I mean this in the simple, direct sense of attention and resources being devoted to each, not implying a political division. One could further draw a conceptual distinction within technical research between defining and demonstrating the true extent of AI risks vs. finding ways to mitigate current or theoretical dangers. In the policy space, there is a relevant conceptual distinction between proposals to slow down development vs. promoting the use of best practices (relative to reckless ones).

To the extent it is likely that AI is uncontrollable, in theory or in practice, then policy efforts should be focused on slowing down development and technical research should be focused on clarifying risks in order to support policy efforts. Since uncontrollability has implications regarding optimal focus for both technically and politically minded people, there is some degree of competition for attention. Perhaps more importantly, technical and political efforts towards developing safe(r) forms of AI have capabilities externalities, which is a much less appealing trade-off if acceptable levels of safety are impossible.

That said, the relationship between these efforts are not zero sum. Engaging with theories of uncontrollability clarifies the underlying problems in AI safety, which could ultimately point towards solutions if they exist. Further, slowing down capabilities buys time for getting both the technical and regulatory infrastructure in place for safe development, if the latter is possible. In the other direction, policies that require provable safety as a prerequisite for further development will effectively become a permanent ban if provable safety turns out to be impossible, regardless of how optimistic the policymakers were that someone would figure out a solution eventually (if the standards are well-chosen and remain uncorrupted).

An Outside View Points to Uncontrollability

Given the relative lack of effort applied to AI safety relative to capabilities, it is not surprising that problems in AI safety are not solved, but it is suspicious that there is not at least a clear path forward. After years of nontrivial (if not sufficient) effort, there is no consensus on the nature or difficulty of the problem. Furthermore, safety research often surfaces new problems in a seemingly unending fractal of danger. These patterns do not by themselves prove that the problem is impossible (rather than just hard) but they suggest that the possibility of impossibility should not be dismissed out of hand.

What Does it Mean to Say Something is Impossible?

Prior to the invention of the airplane, flying machines seemed impossible, but they were not. Perpetual motion machines, however, are genuinely impossible. What’s the difference?

When people say something is “impossible,” they often mean “I don’t see a way to do this,” which is not a strong indicator of impossibility. On the other hand, “there is an inescapable contradiction here,” is far more meaningful. It is difficult, however, to determine the difference between an apparent and a genuine contradiction. One typically resolves an apparent contradiction either by going up or down one or more levels of abstraction. By “going up a level of abstraction,” I mean thinking more broadly about a problem to expand the possibility space of solutions. By “down a level of abstraction,” I mean breaking a problem into subparts to see if something changes with greater resolution. A contradiction is genuine if the principles involved are comprehensive enough that there is no space for such trickery.

Alignment is Control

Typical systems are controlled with some form of constraints. Intelligent systems route around obstacles. An AI system that is more intelligent than its human operators should therefore be assumed to be more capable of routing around whatever guardrails people use to contain it.

Ah, but if the system wants to fulfill the values of its operators, then it doesn’t need guardrails! We just need to specify these values correctly and the system will “control” itself!

But human values are a complex system, which collapses when reduced to a set of clearly defined rules.

Maybe we just haven’t found the right abstractions? Maybe there is a sweet spot of AI capability where it can help us out without yet being dangerous? Or maybe we can apply the bitter lesson to alignment?

Perhaps. But how much are we willing to bet? And who gets to decide?

Controlling Complex Systems is a Cursed Problem

Some form of restraint is necessary for an AI to remain safe, whether that restraint comes from the outside or from an internal value system. We can thus, at least conceptually, divide an AI into its control system and its goal-optimizing system. The control system must be designed to operate effectively in all conditions, forever. And if the scale is such that control from the outside is insufficient, then systems whose safety is fully dependent on internal control must work correctly on the first try, without bugs. But software always has bugs.

In any case, “operating effectively” means that the control system must maintain control of the goal-optimizing system and apply that control in a direction consistent with human values. Which humans? That is out of scope for engineering. And that’s a problem because the biosphere does not care about job descriptions. Whether or not developers have the incentives to care about whose values are embedded in AI systems, the consequences of those values are real and enduring.

Control requires comprehension, but nontrivially intelligent systems are too big for one mind to understand in their entirety. Simplifying the problem to reduce scope is a standard practice in engineering because it lets you deal with problems in manageable chunks. You solve one, then the next, then the next, then tie the solutions together on a higher level of abstraction, which is just another manageable chunk. But this assumes problems that allow themselves to be neatly broken up, an assumption that does not hold for systems that are complex and chaotic. A system is complex when interactions among its parts create behaviors or properties that can’t be understood by examining the parts in isolation. It is chaotic if small changes in initial conditions radically affect its operation. Complex systems are as (if not more) dependent on the connections between parts than the parts themselves, so treating the parts in isolation is insufficient—the best one can do is create a less complex system that approximately models the full version. But in chaotic systems, small losses of fidelity can mean huge losses in accuracy. Leaving out “unimportant” nodes misses edge cases, which feeds bad input to “important” nodes, which propagates error into common cases.

Safety Has Boundaries

Even if a safe form of superintelligence exists in a discoverable region of possibility space, we need to (1) actually find it, (2) keep the ASI contained in that region, and (3) keep it there indefinitely—and all of this via starting conditions. Failing any of these requirements represents an irrecoverable disaster.

In the above graph, the Y axis refers to the degree of capability of an AI. The X axis refers to the possibility space of AI systems at a given level of capability. Any given AI at any given time can be represented as a point on this graph. At all capability levels, there is a narrow band of ways in which an AI can behave that are beneficial to humanity and a much larger range where it is not. At lower capability levels, most systems that fail to be helpful are harmless—they break, are unmarketable, or are destructive on a very small scale. As capabilities increase, however, designs that would have been harmless at a smaller scale become meaningfully destructive. Note that AI designs each occupy a space on the graph, rather than being single points, so it is possible for a given AI to occupy any combination of beneficial, harmless, and destructive aspects. Note also that using clear dividing lines between the regions, such as between harmless and destructive, assumes a specific level of interest, such as the difference between an AI that has the capacity to cause human extinction vs. one that doesn’t. A more accurate graph would replace such lines with gradients, but this is harder to draw and conceptualize.

In this lens of analysis, AI safety research is typically concerned with failure mode (1), where poor design results in an AI being developed in the destructive region of possibility space. The focus is thus on trying to find better designs that exist in the beneficial region. If one broadens one’s scope to consider how AI functions in society and the intentions of the diverse actors that might build it, then one is compelled to additionally consider failure mode (2), in which the world must contend with AIs developed by malicious actors like hackers and terrorists, tyrannical governments, and other sociopathic power-maximizing humans. If one broadens one’s scope further to include the change of an AI’s behavior over time, such as by unpredictable, recursive drift or by competitive pressure, then (3) becomes an issue.

For AI to stay non-catastrophically destructive at ever-increasing levels of capability, the following must occur:

The line between harmless and destructive rises at pace with capability, such as by society becoming more robust, so that AI stays harmless despite where it is on the X-axis (this is much of the d/acc strategy), OR
AI stays within the beneficial region indefinitely, BECAUSE
- Careful design prevents type-1 failures, BECAUSE
  - Designers are working from a solid methodology, AND
  - Are operating under non-perverse incentives, AND
- AI remains constrained to the beneficial region to prevent type-2 failures, BECAUSE
  - Coordination restricts the range of AIs that are built, OR
  - Beneficial AIs render destructive AIs harmless, AND
- All AIs are built such that their behavior never drifts in a way that would cause type-3 failures, BECAUSE
  - They don’t have the combination of properties that would cause drift, OR
  - They permanently restrain themselves, BECAUSE
    - They have a control system that prevents drift in the adaptive system, AND
    - That control system does not itself drift.

Of these paths, some seem immediately implausible at scale, including:

Indefinitely scaling societal robustness at a rate that keeps pace with unrestrained capabilities development,
Threading the needle of (at least) wickedly difficult design challenges within the time constraints implied by unrestrained capabilities development,
Non-perverse corporate incentives without governance,
Restriction on the range of AIs built without governance.

Three principles of wisdom are discernment, restraint, and humility. In the context of AI, the wise course of action is to discern which paths of development are destructive, restrain ourselves from following those paths (building the societal capacity where necessary), and applying the precautionary principle based on a humble recognition of the limitations of our ability to predict the impact of high-stakes technology.

In other words, rather than asking "How can AI be made safely?" we should instead be asking "Which AIs can be made safely?"

LESSWRONG
LW

LESSWRONG
LW

1

Principles of AI Uncontrollability

1

1