Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Preface

The following text is my submission for the AI Safety Public Materials contest. In it, I try to lay out the importance of AI Safety Research to people who, according to the winning conditions of the contest, have not yet engaged with AI Safety, Lesswrong, or effective altruism. As such, I’m eager to receive feedback on how accessible the text is to people outside of these communities. Additionally, I’d like to know where you think my framings might be misguided.

In my text, I focus on the distribution shift problem since it is an inner alignment problem, which increasingly feels like the most central challenge of AI Safety to me. However, the text does not focus on explicit inner optimizers or agency, even though this would strengthen the argument. The reason is that I think the distribution shift problem can already motivate existential risk well enough; the value of keeping the discussion focused therefore seems larger than the value of painting a complete picture. Nevertheless, in the appendix, I include a short overview of other safety concerns.

I thank Gabriela Jiang, Tom Lieberum, Shos Tekofsky, and Magdalena Wache for their useful comments and feedback on this text.

Introduction

This text covers the importance of technical AI Safety research. I give an explanation of why current machine learning paradigms might lead to the disempowerment of humanity in this century. With disempowerment of humanity, I thereby mean any situation in which humans lose the ability to steer the future development of society, leading in the worst case to a loss of everything that we find valuable.

The argument is based on the distribution shift problem of contemporary machine learning, the increased dependence on AI and stronger distribution shifts that High-Level Machine Intelligence (HLMI) entails, and the likely development of HLMI in this century. I hope this text can motivate people to work on AI Safety research to prevent humanity's disempowerment.

Note that the distribution shift problem is only one of several safety problems that have been discussed over recent years. You can find a short overview in the appendix. A good starting point for learning more about the distribution shift problem specifically is the 2016 paper on Concrete Problems in AI Safety

It will be useful to have basic knowledge about machine learning to appreciate this text. I include a short overview at the beginning of the argument.

The Core Argument

Machine Learning

In the contemporary machine learning paradigm, a machine learning (ML) system receives inputs and produces outputs. It is trained to achieve some specified or implicit goal. The internals of the ML system are thereby changed in such a way that it will, in the future, be better at reaching the goal. The algorithm to achieve this internal change is typically some variation of gradient descent. The ML system itself is nowadays usually based on artificial neural networks. The subfield of machine learning that deals with large neural networks is called deep learning.

The most important sub-paradigms in machine learning are:

  • Supervised learning: The ML system receives inputs, like images, and produces outputs, like labels. There is a ground truth in the training data, typically produced by human labelers. The goal of the ML system is to give a high probability to the true label. A famous example is AlexNet.
  • Autoregressive Learning: A model receives time-dependent inputs, like text, and produces as outputs a continuation. The model is tasked with correctly predicting how the data continues, which makes this part of the “self-supervised” framework. A famous example is GPT-3
  • Reinforcement Learning: A model interacts with the world or a simulation and thereby receives, for example, sensory input. It outputs actions. The goal is to maximize some reward signal. A famous example is AlphaGo.

Note that autoregressive learning is a subclass of what’s often called unsupervised learning. I decided to focus the discussion above on autoregressive learning since it has been increasingly important in recent years.

What’s common to all these paradigms is that the goal is not explicitly represented inside the ML system itself.[1] Instead, it simply receives inputs and produces outputs according to its initially random internals. If the outputs do not satisfy the goal, then an algorithm outside of the ML system, typically gradient descent, adjusts the internals of the ML system to better achieve the goal in the future. Thus, the ML system will over time behave as if it “cared for the goal”. The problem is that this may just be a superficial correlation, as we now illustrate with the distribution shift problem.

Distribution Shifts and Alignment Failures

With the terms “distribution” or “data distribution”, one typically refers to the properties, frequencies, and nature of the data that the ML system receives. A “distribution shift” is any situation in which the distribution changes, that is:

  • The ML system encounters data points it has never encountered before;
  • The ML system encounters data points with substantially different frequencies than it did before.

Very often,  ML systems are trained on a well-curated data set or environment and then deployed in the real world, which may come with substantial distribution shifts. This may lead to problems, for example:

  • An autoregressive language model may be trained on a data set of questions and answers. Potentially, the question-answer-pairs come from an online community with very specific cultural norms that may allow one to be “rude” when writing an answer. The hope is that the language model can then be used as a bot that answers questions in online communities.
    In deployment, someone may try to use the bot in a different online community that highly values good manners and being nice to each other. The language model may then come across as rude in that community and not be appreciated.
  • A reinforcement learning agent may be trained to act as a factory robot. The robot may be trained in a simulation to allow for faster iteration and algorithmic feedback. Potentially, the simulation does not contain simulated humans. Deployed in the real world, it may then catastrophically fail and lead to accidents when it encounters humans. 
    Other examples of how a change in the deployment distribution can reveal that the reinforcement learning agent did not learn the right thing can be found here

These examples constitute an alignment failure: the ML system’s actions are not aligned with its given goal, as revealed by the shift in distribution. One often specifically calls this an inner alignment problem, to contrast it with outer alignment, which is roughly about choosing the right goal for the ML system in the first place. More clarifications on these notions can be found in the appendix.

Usually, such alignment failures are recoverable: 

  • If a question-answer–bot produces inappropriate answers in certain subcommunities, one may simply finetune it to predict answers in the subcommunities in question.
  • If a deployed reinforcement learning agent fails in situations it has not encountered before, one may shut it off. Then, one can try to either enrich the simulated environment to encompass the situation in question (e.g., by including simulated humans), or include a second training phase in the real world with sufficient safety measures. 

Therefore, it seems like we can proceed with trial and error: 

  • Train the ML system on a varied data set or simulation;
  • Deploy it with care to diminish harm;
  • Whenever issues appear, try to detect them early and improve the training setup.

However, there is to date no principled solution that can spot all potential issues under distribution shift in advance. This may lead to considerably worse issues once ML systems become very powerful, as we explain next.

High-Level Machine Intelligence Changes the Picture

Currently, ML systems have limited applicability. There have been many attempts to formalize the notion of artificial intelligence that goes beyond what’s currently possible and matches or exceeds the generality and power of human intelligence. One such notion is High-Level Machine Intelligence (HLMI), which Katja Grace defined as follows:

“High-level machine intelligence” (HLMI) is achieved when unaided machines can accomplish every task better and more cheaply than human workers.”

There are similar concepts with somewhat different meanings, e.g. Artificial General Intelligence (AGI), Transformative AI (TAI), and Process for Automating Scientific and Technological Advancements (PASTA). All of them have in common that reaching them would likely constitute a significant historical event with an enormous impact on human society. Machines would then be able to do much of the intellectual work currently conducted by humans. In this section, we explain how the advent of HLMI would make humanity increasingly dependent on AI and lead to unprecedented levels of innovation. We then argue that this might lead to humanity’s disempowerment if the distribution shift problem is not solved.

A Push Toward AI Autonomy

Once we have HLMI, AI systems will, by definition, be able to do every task better and more cheaply than human workers. Consequently, humans would be the main bottleneck in the supply chain, which makes it highly economically valuable to help AI act more autonomously. This could take the form of giving AI more access to crucial computer systems, and building up infrastructure that AI can use to effectively act in the world. For example, robot factories might be built that allow AI to quickly design and deploy robots that can do all manual tasks currently done by humans.

With humanity increasingly relying on powerful AI systems, it will be hard to unilaterally discontinue their use. States and companies that will do so will simply be outcompeted by ones that make use of AI’s great innovative power. This will lead to the share of intellectual work done by HLMI systems increasing over time compared to humanity. It will then be hard to have effective oversight over all behavior of all HLMI systems combined. 

Increased Rate of Innovation

Once AI acts autonomously in the world and does all human jobs, it will by definition also do most or all of the innovation that drives our growth. AI will have two key advantages compared to humans that will allow it to increase the rate of innovation to unprecedented levels:

First of all, AI can simply be copied and scaled up: As long as there is enough raw material and electricity in the world, one can increase the number of computers that run AI systems. This is in contrast to humanity, whose growth slows considerably — the human population will likely peak in the 21st century. Additionally, not only the quantity but also the quality of artificial intelligence can increase through sustained AI research: HLMI can, by definition, do all human jobs, and will therefore be able to do AI research itself, leading to a positive feedback loop that increases artificial intelligence.

Since the amount and quality of intelligence is likely the key driver of innovation, these dynamics would lead to huge increases in the rate of innovation.

The Resulting Disempowerment of Humanity

In summary, if we reach HLMI, we will likely see AI being very autonomous and widespread, replacing most or all jobs currently done by humans. In particular, AI will replace jobs in research in development, leading to increased rates of innovation. Over longer timeframes, AI systems will then dominate the intellectual work on earth and humanity will be subject to the decisions of AI systems that determine the future trajectory of humanity. In an ideal world, AI remains aligned with us, shares our values, and defers to humans for key decisions. 

However, AI autonomy and increased rates of innovation will also lead to large increases in distribution shifts: once the AI acts autonomously, it is put in situations it has not encountered in previously constrained environments. And with new innovations and the world transforming, the environment will change in unpredictable ways. This means that over time, the distribution will drift from the one the AI will have been trained on. If the distribution shift problem is not solved by then, then AI will misbehave in ways that may be unaligned with humanity. This may lead to a disempowerment of humanity whose fate is determined by AIs that show increasingly alien behavior in an increasingly changing world. In the worst case, the future may lose everything which is valuable to humans.

None of this would be a problem we need to focus on now if HLMI were never developed or were developed only in the very distant future. However, we cannot rely on this, which we explain below. 

High-Level Machine Intelligence Might Arrive Soon

The last ten years have seen remarkable progress in artificial intelligence. In 2012, AlexNet made tremendous progress in classifying images. Since then, research into deep learning has sped up dramatically, leading to a cascade of spectacular successes:

Much of this success was driven by scaling up neural network architectures to enormous sizes. This is made possible by scalable architectures like the transformer that are able to show large and predictable improvements by being scaled up.[2] They can have hundreds of billions of parameters and cost tens of millions of dollars to train. Another factor in the success of DL was the continued fall of the price of GPUs, the core processing units used in deep learning. Additionally, though not independent of inventions like the transformer, large algorithmic efficiency gains made it possible to achieve better performance with the same compute budget.

Where does that leave us for the future? Certainly, some trends cannot continue forever. For example, the price of GPUs might stagnate. Furthermore, current top-performing models already cost tens of millions of dollars. Projects in the billions of dollars are in principle possible, but one can likely expect a slow-down in the scale-up until then, and much more expensive networks might not be trained in the foreseeable future.

However, the development of new architectures, key insights, and training processes will likely continue and has the usual unpredictability of basic science. This means that precise predictions are in principle hard.

Nevertheless, HLMI might arrive earlier than many people think. In 2016, Katja Grace surveyed AI experts that successfully published at the 2015 NIPS and ICML conferences — two leading venues for machine learning. The median respondent predicted a 50% chance of HLMI for the year 2061. She repeated the survey in 2022, with the experts predicting the year 2059. A report by Ajeya Cotra made comparisons to biological neural networks and predicts transformative AI by 2050. She has recently updated her prediction down towards 2040. Finally, forecasters at Metaculus predict artificial general intelligence (AGI) for 2041 — though note that this forecast fluctuates strongly over time. 

While these predictions are about different operationalizations of strong AI — namely, HLMI, TAI, and AGI — they still overall point in the direction of HLMI arriving in the next few decades. This is within the lifetime of many readers of this post or their children.

Conclusion — AI Safety Should be a Key Research Priority of Our Time

In summary, the previous sections provide the following chain of reasoning leading to a potential disempowerment of humanity in this century:

  • In the current ML paradigm, ML systems receive data and produce outputs. Training an ML system means optimizing it so that the outputs achieve some goal.
  • Since the goal is a priori not explicitly encoded in the ML systems, the outputs may only be correlated with the goal in the training distribution. Under distribution shifts, ML systems can misbehave. 
  • HLMI would lead to more AI autonomy and innovation and consequently to more drastic distribution shifts. 
  • With humanity increasingly dependent on powerful AI systems, this could lead to a disempowerment of humanity and in the worst case a loss of all value.
  • HLMI could plausibly be developed in the next decades.

Thus, we might see a disempowerment of humanity in this century.

To mitigate our potential disempowerment, it seems crucial that more people work on researching AI Safety. If you want to think about getting involved, I can recommend these three articles. If you want to go deeper, then consider reading through the AI Alignment Curriculum. I have written summaries for many texts in the curriculum that you may want to read as well. Finally, I want to clarify once again that the distribution shift problem is only one of many safety concerns, and I want to highly encourage you to become familiar with several of them before “getting to work”.

Appendix: A Broader Overview of Safety Concerns

This post focused on the distribution shift problem as the core motivator for research into AI Safety, but it is not the only concern. I now shortly summarize the bigger picture. I start with a discussion of inner and outer alignment, which together form what Paul Christiano calls (intent) alignment:

Inner Alignment

The distribution shift problem itself is part of the inner alignment problem. This is, very roughly, about any situation in which the AI’s behavior does not correspond to the specified goal.[3] A specific subclass concerns the situation when the AI system itself becomes an optimizer for some goal possibly acquired during training; the AI system is then called a mesa optimizer

It is unclear whether AI systems actually acquire their own goals, but it seems plausible on evolutionary grounds: evolution selected humans according to the “goal” of inclusive genetic fitness, and yet, humans rarely explicitly maximize this goal. Rather, we have evolved many subgoals that have been useful for our fitness in the ancestral environment, including eating sugary food, gaining status, or finding deep enjoyment in life.

Recently, the inner alignment problem has been demonstrated in deep reinforcement learning. This work even argued that the inner-misaligned reinforcement learning systems optimize for unaligned goals instead of "merely" misbehaving. This is an example of a more general concern that capabilities might generalize further than alignment.

Outer Alignment

Another central concern is outer alignment, which I completely omitted from this article. An outer misalignment failure occurs when the goal we give to the AI system in the first place is not actually aligned with our own intentions. Thus, even if the AI perfectly satisfies its specified goal, one may still encounter problems. Viktoriya Krakovna wrote about this under the name of specification gaming. The problem is usually caused by our inability to fully specify all constraints we care about; The AI may then achieve its goal by causing some potentially irreversible and harmful side effects.

There have been some attempts to solve the specification problem by learning the goal from human preferences. This, however, has the problem that human’s stated preferences do not necessarily agree with their actual values. And even if the human’s preferences would agree with the visible AI behavior, we might still have the problem left that the AI possibly conceals harmful externalities of its actions.

Risks from Superintelligence

Prior to the rise of deep learning, there already were many discussions on AI Safety. They often revolved around the notion of a superintelligence — i.e., an AI that is much more intelligent than all of humanity combined. The canonical introduction is Nick Bostrom’s book Superintelligence. This book models the AI as being given an explicit representation of its goal that it tries to maximize in expectation.

While his discussion of the topic may seem archaic from today’s machine learning perspective, the text remains relevant to this day. Three very important notions are:

  • The Orthogonality Thesis: In principle, almost any level of intelligence can be combined with almost any final goal. Intelligence alone does not make the AI system moral.
  • Instrumental Convergence Thesis: There are instrumental goals — like goal preservation, resource acquisition, and cognitive enhancement — that are helpful for reaching almost any final goal. Therefore, we expect superintelligences to optimize them.
  • Intelligence Explosion: The subgoal of enhancing intelligence will lead to increased intelligence, which increases the Superintelligence’s abilities to increase its own intelligence in turn. This may lead to a positive feedback loop that drastically increases intelligence in a short amount of time.

Joseph Carlsmith wrote a report that recasts many of these concerns in relation to modern machine learning.

AI Governance

There is also the question of how to govern AI. This is in itself an active research field about which you can learn in the AI Governance Curriculum. To quote from there:

AI governance can be thought of as a cluster of activities and professional fields that seek to best navigate the transition to a world with advanced AI systems. It involves not just government decisions but also corporate decisions, and not just formal policies but also institutions and norms.

For a short introduction, see this talk by Allan Dafoe.

  1. ^

    However, there have been concerns that a representation of a goal — equal or different from the specified goal — may emerge in the ML system. See the section on inner alignment in the appendix.

  2. ^

    See also the Chinchilla paper for an updated view on scaling laws.

  3. ^

    Some would narrow this down further and replace the “AI’s behavior” with what the AI is “trying” to do, see Paul Christiano’s definition of the full alignment problem encompassing both inner and outer alignment.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:06 AM

Great post!

I like that you point out that we'd normally do trial and error, but that this might not work with AI. I think you could possibly make clearer where this fails in your story. You do point out how HLMI might become extremely widespread and how it might replace most human work. Right now it seems to me like you argue essentially that the problem is a large-scale accident that comes from a distribution shift. But this doesn't yet say why we couldn't e.g. just continue trial-and-error and correct the AI once we notice that something is going wrong. 

I think one would need to invoke something like instrumental convergence, goal preservation and AI being power-seeking, to argue that this isn't just an accident that could be prevented if we gave some more feedback in time. It is important for the argument that the AI is pursuing the wrong goals and thus wouldn't want to be stopped, etc.

Of course, one has to simplify the argument somehow in an introduction like this (and you do elaborate in the appendix), but maybe some argument about instrumental convergence should still be included in the main text.

Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text.