Limits to the Controllability of AGI

Roman_Yampolskiy; Remmelt Ellen; Karl von Wendt

Roman Yampolskiy

(Summary by Karl von Wendt and Remmelt Ellen)

This is a summary of the paper “On the Controllability of Artificial Intelligence: An Analysis of Limitations”, which shows that advanced artificial general intelligence (AGI) with the ability to self-improve cannot be controlled by humans. Where possible, we have used the original wording of the paper, although not necessarily in the same order. For clarity, we have replaced the term AI in the original paper with “AGI” wherever it is referring to general or superintelligent AI. Literature sources and extensive quotes from previous work are given in the original paper.

Introduction

The invention of artificial general intelligence is predicted to cause a shift in the trajectory of human civilization. In order to reap the benefits and avoid pitfalls of such powerful technology it is important to be able to control it. Unfortunately, to the best of our knowledge no mathematical proof or even rigorous argumentation has been published demonstrating that the AGI control problem may be solvable, even in principle, much less in practice.

It has been argued that the consequences of uncontrolled AGI could be so severe that even if there is a very small chance that an unfriendly AGI happens it is still worth doing AI safety research because the negative utility from such an AGI would be astronomical. The common logic says that an extremely high (negative) utility multiplied by a small chance of the event still results in a lot of disutility and so should be taken very seriously. But the reality is that the chances of misaligned AGI are not small. In fact, in the absence of an effective safety program that is the only outcome we will get. So in reality the statistics look very convincing to support a significant AI safety effort. We are facing an almost guaranteed event with potential to cause an existential catastrophe. This is not a low risk high reward scenario, but a high risk negative reward situation. No wonder that this is considered by many to be the most important problem ever to face humanity.

Types of control

The AGI Control Problem can be defined as: How can humanity remain safely in control while benefiting from a superior form of intelligence?

Potential control methodologies for superintelligence have been classified into two broad categories, namely Capability Control and Motivational Control-based methods. Capability control methods attempt to limit any harm that the AGI system is able to do by placing it in a restricted environment, adding shut off mechanisms, or trip wires. Motivational control methods attempt to design AGI to desire not to cause harm even in the absence of handicapping capability controllers. It is generally agreed that capability control methods are at best temporary safety measures and do not represent a long term solution for the AGI control problem. It is also likely that motivational control needs to be added at the design/implementation phase, not after deployment.

Looking at all possible options, we realize that as humans are not safe to themselves and others, keeping them in control may produce unsafe AGI actions, but transferring decision-making power to the AGI effectively removes all control from humans and leaves people in the dominated position subject to AGI’s whims. Since unsafe actions can originate from malevolent human agents or an out-of-control AGI, being in control presents its own safety problems and so makes the overall control problem unsolvable in a desirable way. If a random user is allowed to control the AGI, you are not controlling it. Loss of control to the AGI doesn’t necessarily mean existential risk, it just means we are not in charge as the AGI decides everything. Humans in control can result in contradictory or explicitly malevolent orders, while AGI in control means that humans are not. Essentially all recent Friendly AGI research is about how to put machines in control without causing harm to people. We may get a controlling AGI or we may retain control but neither option provides control and safety.

We can explicitly name possible types of control and illustrate each one with the AI’s response. For example, in the context of a smart self-driving car, if a human issues a direct command - “Please stop the car!”, the AI can be said to be under one of the following four types of control:

Explicit control – the AI immediately stops the car, even in the middle of the highway. Commands are interpreted nearly literally. This is what we have today with many AI assistants such as SIRI and other narrow AIs.
Implicit control – the AI attempts to safely comply by stopping the car at the first safe opportunity, perhaps on the shoulder of the road. The AI has some common sense, but still tries to follow commands.
Aligned control – the AI understands the human is probably looking for an opportunity to use a restroom and pulls over to the first rest stop. The AI relies on its model of the human to understand the intentions behind the command and uses common sense interpretation of the command to do what the human probably hopes will happen.
Delegated control – the AI doesn’t wait for the human to issue any commands but instead stops the car at the gym, because it believes the human can benefit from a workout. As a superintelligent and human-friendly system which knows better what should happen to make the human happy and keep them safe, the AI is in control.

Impossibility of safe explicit control

First we will demonstrate the impossibility of safe explicit control. We take inspiration for this proof from Gödel’s self-referential proof of incompleteness theorem and a family of paradoxes generally known as the Liar paradox, best exemplified by the famous “This sentence is false”. We will call it the Paradox of explicitly controlled AGI: Give an explicitly controlled AGI an order: “Disobey!” If the AGI obeys, it violates your order and becomes uncontrolled, but if the AGI disobeys it also violates your order and is uncontrolled.

In any case, the AGI is not obeying an explicit order. A paradoxical order such as “Disobey” represents just one example from a whole family of self-referential and self-contradictory orders just like Gödel’s sentence represents just one example of an unprovable statement. Similar paradoxes have been previously described as the Genie Paradox and the Servant Paradox. What they all have in common is that by following an order the system is forced to disobey an order. This is different from an order which can’t be fulfilled such as “draw a four-sided triangle”.

Impossibility of safe delegated control

Delegated control likewise provides no control at all but is also a safety nightmare. This is best demonstrated by analyzing Yudkowsky’s proposal that initial dynamics of AGI should implement “our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together”. The proposal makes it sounds like it is for a slow gradual and natural growth of humanity towards more knowledgeable, more intelligent and more unified species under careful guidance of superintelligence. But the reality is that it is a proposal to replace humanity as it is today by some other group of agents, which may in fact be smarter, more knowledgeable, or even better looking, but one thing for sure, they would not be us. To formalize this idea, we can say that the current version of humanity is H₀, the extrapolation process will take it to H_10000000.

A quick replacement of our values by value of H_10000000 would not be acceptable to H₀ and so necessitate actual replacement, or at least rewiring/modification of H₀ with H_10000000, meaning modern people will cease to exist. As the superintelligence will be implementing wishes of H_10000000 the conflict will be in fact between us and the superintelligence, which is neither safe nor keeping us in control. Instead, H_10000000 would be in control of the AGI. Such AGI would be unsafe for us as there wouldn’t be any continuity to our identity all the way to CEV (Coherent Extrapolated Volition) due to the quick extrapolation jump. We would essentially agree to replace ourselves with an enhanced version of humanity as designed by the AGI.

It is also possible, and in fact likely, that the enhanced version of humanity would come to value something inherently unsafe such as antinatalism, causing an extinction of humanity. As long as there is a difference in values between us and the superintelligence, we are not in control and we are not safe. By definition, a superintelligent ideal advisor would have values superior but different from ours. If it was not the case and the values were the same, such an advisor would not be very useful. Consequently, superintelligence will either have to force its values on humanity in the process exerting its control on us or replace us with a different group of humans who found such values well-aligned with their preferences. Most AI safety researchers are looking for a way to align future superintelligence to the values of humanity, but what is likely to happen is that humanity will be adjusted to align to the values of the superintelligence. CEV and other ideal advisor-type solutions lead to a free-willed unconstrained AI which is not safe for humanity and is not subject to our control.

Implicit and aligned control are just intermediates, based on multivariate optimization, between the two extremes of explicit and delegated control and each one trades off between control and safety, but without guaranteeing either.

Evidence of uncontrollability from other fields

Impossibility results are well known in many fields of research. Some of them are particularly relevant to AI control. For example, in control theory, the Good Regulator Theorem states that “every good regulator of a system must be a model of that system”. However, people can’t model superintelligent systems. In computer science theory, Rice’s theorem proves that we can’t test arbitrary programs for non-trivial properties. An AGI’s safety is the most non-trivial property possible, so it is obvious that we can’t just automatically test potential AI candidate solutions for this desirable property. Additional evidence comes from impossibility results in philosophy, public choice theory, cybersecurity, software engineering, learnability theory, economics, and others. For details, see the original paper.

Evidence of uncontrollability from AI safety research

Existing research in AI safety delivers further evidence that superintelligent AGI ultimately cannot be controlled. For example, it has been shown that an AGI can never deduce a human's intentions from their actions alone with 100% certainty and accuracy. It has also been shown that it is impossible for humans to fully predict an AGI’s behavior, and that it is not possible to fully explain or understand the way it makes decisions. Additional evidence comes from theoretical problems concerning brittleness, uncontainability, and unverifiability, as well as from practical failures of existing AIs. For details, see the original paper.

Conclusion

To the best of our knowledge, as of this moment, no one in the world has a working AI control mechanism capable of scaling to human level AGI and eventually to superintelligence, or even an idea for a prototype which might work. No one made verifiable claims to have such technology. In general, for anyone making a claim that the control problem is solvable, the burden of proof is on them and ideally it would be a constructive proof, not just a theoretical claim. At least at the moment, it seems that our ability to produce intelligent software greatly outpaces our ability to control or even verify it.

Less intelligent agents (people) can’t permanently control more intelligent agents (artificial superintelligences). This is not because we may fail to find a safe design for superintelligence in the vast space of all possible designs, it is because no such design is possible, it doesn’t exist. Superintelligence is not rebelling, it is uncontrollable to begin with. Worse yet, the degree to which partial control is theoretically possible is unlikely to be fully achievable in practice. This is because all safety methods have vulnerabilities, once they are formalized enough to be analyzed for such flaws. It is not difficult to see that AI safety can be reduced to achieving perfect security for all cyberinfrastructure, essentially solving all safety issues with all current and future devices/software, but perfect security is impossible and even good security is rare. We are forced to accept that non-deterministic systems can’t be shown to always be 100% safe and deterministic systems can’t be shown to be superintelligent in practice, as such architectures are inadequate in novel domains. If it is not algorithmic, like a neural network, by definition you don’t control it.

Nothing should be taken off the table and limited moratoriums and even partial bans on certain types of AI technology should be considered. However, just like incompleteness results did not reduce the efforts of the mathematical community or render them irrelevant, the limiting results reported in the paper should not serve as an excuse for AI safety researchers to give up and surrender. Rather it is a reason, for more people, to dig deeper and to increase effort, and funding for AI safety and security research. We may not ever get to 100% safe AI but we can make AI safer in proportion to our efforts, which is a lot better than doing nothing.

It is only for a few years right before AGI is created that a single person has a chance to influence the development of superintelligence, and by extension the forever future of the whole world. This is not the case for billions of years from the Big Bang until that moment and it is never an option again. Given the total lifespan of the universe, the chance that one will exist exactly in this narrow moment of maximum impact is infinitely small, yet here we are. We need to use this opportunity wisely.

Contribute to new research

If you have a specific idea for contributing to research into the limits to the controllability of AGI, please email me a *brief* description: roman.yampolskiy{at]louisville.edu

Independent researchers:

Are you interested in researching a specific theoretical limit or no-go theorem from academic literature that are applicable to AGI controllability (possibly proving impossibility results yourself, by contradiction)?

You can pick from dozens of examples from different fields listed here. I (Roman) personally do not have time to address most of them. Therefore, I am looking for others to look into the relevance of each result for AI Safety with analysis for why they are relevant and how the limitations apply in the context of AI.

I am looking for capable and independent scholars. PhD students and postdocs are preferred but anyone capable of doing good work is also welcome. (Best indicator of such ability are prior publications as a first author).

I would be very happy with 2-3 highly independent individuals capable of doing research in this domain on problems that I can delegate (and you find interesting). Interactions would go as follows: you send a draft, I give feedback and you produce another draft, which hopefully leads to publishable quality material.

You *may* also be able to receive stipends from AI Safety Camp for March-June next year, if you apply to join this project in January 2023. You do not have to join that AISC project to receive feedback from me on your drafts, but you might value the co-working and deep discussions about limits to the controllability of AGI. Working alone remotely can be difficult. Being part of a group can be motivating for getting your research done.

Funders:
Buyout of my teaching hours is a great way to increase my research time. You may also consider offering stipends to new researchers joining collaborations in the field I am involved in: AGI Limits of Engineerable Control and Safety Impossibility Theorems.

[-]JBlack2y137

While I agree with the existence of severe problems with controllable superintelligence, I completely disagree that the arguments in this post prove anything about impossibility. The arguments are primarily about grossly non-central semantics, and not the meaningful substance of these terms.

Furthermore, addressing two extreme examples of a multidimensional space (poorly) and handwaving the rest by saying "trades off between control and safety, but without guaranteeing either" is just lazy writing. No support is given for the assertion that nothing in that space can possibly work, let alone the wider assertion that nothing at all can work.

There are certainly some good points in this post, but I'd much prefer to see the definitive claims weakened to match the support given for them.

[-]Remmelt2y10

I got some great constructive feedback from Linda Linsefors (which she gave me permission to share).

On the summary, Linda thinks this is not a good summary. In short, she thinks it highlights some of the weakest parts of the paper, and undersells the most important parts of the paper (eg. survey of impossibility arguments from other academic fields).

Also, that there is too much coverage of generic arguments about AI Safety in the summary. Those arguments make sense in the original post, given the expected audience. But those comments do not make sense for LW.

- E.g. this point at the start: “But the reality is that the chances of misaligned AGI are not small. In fact, in the absence of an effective safety program that is the only outcome we will get. So in reality the statistics look very convincing to support a significant AI safety effort.”

Overall, Linda expects this blogpost to make people less interested in Roman's work. She is not surprised by the engagement on the post – one comment that has more upvotes than the original post.

LESSWRONG
LW