In this sequence of posts, I want to lay out why I am worried about risks from powerful AI and where I think the specific dangers come from. In general, I think it's good for people to be able to form their own inside views of what’s going on, and not just defer to people. There are surprisingly few descriptions of actual risk models written down. 

I think writing down your own version of the AI risk story is good for a few reasons

  • It makes you critically examine the risk models of other people, and work out what you actually believe.
  • It may be helpful for finding research directions which seem good. Solving a problem seems much more doable if you can actually point at it.
  • It seems virtuous to attempt to form your own views rather than just the consensus view (whatever that is).
  • People can comment on where your story may be weak or inconsistent, which will hopefully push towards the truth.

Additionally, I have some friends and family who are not in EA/AI-safety/Longtermism/Rationality, and it would be nice to be able to point them at something describing why I’m doing what I’m doing (technical AI-safety). Although, admittedly my views are more complicated than I initially thought, so this isn’t a great first introduction to AI-risk.

I don’t expect much or any of these posts to be original and there will be many missing links and references. This can maybe be viewed as a more quick and dirty AGI safety from first principles, with less big picture justification and more focus on my specific risk models. In general, I am concerned most about deceptively aligned AI systems, as discussed in Risks from Learned Optimization

How do we train neural networks?

In the current paradigm of AI, we train neural networks to be good at tasks, and then we deploy them in the real world to perform those tasks. 

We train neural networks on a training distribution

  • For image classifiers, this is a set of labeled images. E.g. a set of images of cats and dogs with corresponding labels.
  • For language models like GPT-3, this is a very large amount of text from the internet.
  • For reinforcement learning, this is some training environment where the AI agent can ‘learn’ to take actions. For example, balancing a pole on a cart or playing the Atari game Breakout. 

We start with an untrained AI system and modify it to perform better on the training distribution. For our cat vs dog classifier, we feed in images of cats and dogs (our training data) and modify the AI so that it is able to accurately label these images. For GPT-3, we feed in the start of a string of text and modify it such that it can accurately predict what will come next. For our reinforcement learning agent playing Breakout, the agent takes actions in the game (move the platform left and right), we reward actions which lead to high score, and use this reward to train the agent to do well in this game. 

If the training is good then our AI system will then be able to generalize and perform well in slightly different domains. 

  • Our cat vs dog classifier can correctly label images of cats and dogs it hasn’t seen before
  • Our language model can ‘predict’ the next word for a given sensible prompt, and use this to generate coherent text 
  • Our reinforcement learning agent can play the game, even if it hasn’t seen this exact game state before. Or even generalize to slightly different environments; what if there are more or differently positioned blocks?

This ability to generalize is limited by the task it was trained on; we can only really expect good generalization on data that is similar enough to the training distribution. Our cat vs dog classifier might struggle if we show it a lion. If our language model has only been trained on English and German text, it won’t be able to generate French. Our Breakout agent can really only play Breakout. 


When we train our AI systems we are optimizing them to perform well on the training distribution. By optimizing I mean that we are modifying these systems such that they do well on some objective. 

Optimized vs Optimizers

It is important to make a distinction between something which is optimized and something which is an optimizer. When we train our AI systems, we end up with an optimized system; the system has been optimized to perform well on a task, be that cat vs dog classification, predicting the next word in a sentence, or achieving a high score at Breakout. These systems have been optimized to do well on the objective we have given them, but they themselves (probably) aren’t optimizers; they don’t have any notion of improving on an objective. 

Our cat vs dog classifier likely just has a bunch of heuristics which influence the relative likelihood of ‘cat’ or ‘dog’. Our Breakout agent is probably running an algorithm which looks like “The ball is at position X, the platform is at position Y, so take action A”, and not something like “The ball is at position X, the platform is at position Y, if I take action A it will give me a better score than action B, so take action A”. 

We did the optimizing with our training and ended up with an optimized system. 

However, there are reasons to expect that we will get ‘optimizers’ as we build more powerful systems which operate in complex environments. AI systems can solve a task in 2 main ways (although the boundary here is fuzzy)

  • They can use a bunch of heuristics that mechanistically combine to choose an action. For example, “If there is <this texture> +3 to the dog number, if there is <this triangle shape> +4 to the cat number, if there is <this color> next to <this color> +2 to the dog number… etc”.
  • They can ‘do optimization’; search over some space of actions and find which one performs best on some criteria. For example, a language model evaluating outputs on “Which of these words is most likely to come next?”, or our Breakout AI asking “What will be my expected overall score if I take this action?”

As our tasks get more complex, if we are using the heuristic strategy, we will need to pile on more and more heuristics to perform well on the task. It seems like the optimization approach will become favored as things become more complex because the complexity of the optimization (search and evaluate) algorithm doesn’t increase as much with task complexity. If we are training on a very complex task and we want to achieve a certain level of performance on the training distribution, the heuristic-based algorithm will be more complex than the optimization-based algorithm. One intuition here is that for very varied and complex tasks, the AI may require some kind of “general reasoning ability” which is different from the pile of heuristics. One relatively simple way of doing “general reasoning” is to have an evaluation criterion for the task, and then evaluate possible actions on this criterion. 

If AI systems are capable of performing complex tasks, then it seems like there will be very strong economic pressures to develop them. I expect by default for these AI systems to be running some kind of optimization algorithm. 

What do we tell the optimizers to do?

Assuming that we get optimizers, we need to be able to tell them what to do. By this I mean when we train a system to achieve a goal, we want that goal to actually be one that we want. This is the “Outer Alignment Problem”. 

The classic example here is that we run a paperclip factory, so we tell our optimizing AI to make us some paperclips. This AI has no notion of anything else that we want or care about so it would sacrifice literally anything to make more paperclips. It starts by improving the factory we already have and making it more efficient. This still isn’t making the maximal number of paperclips, so it commissions several new factories. The human workers are slow, so it replaces them with toilless robots. At some point, the government gets suspicious of all these new factories, so the AI uses its powers of superhuman persuasion to convince them this is fine, and in fact, this is in the interest of National Security. This is still very slow compared to the maximal rate of paperclip production, so the AI designs some nanobots which convert anything made of metal into paperclips. At this point, it is fairly obvious to the humans that something is very, very wrong, but this feeling doesn’t last very long because soon the iron in the blood of every human is used to make paperclips (approximately 3 paperclips per person). 

This is obviously a fanciful story, but I think it points at an important point; it’s not enough to tell the AI what to do, we also have to be able to tell it what not to do. Humans have pretty specific values, and it seems extremely difficult to specify. 

There are more plausible stories we can tell which lead to similarly disastrous results. 

  • If we want our AI system to maximize the number in a bank account, if it is powerful enough it might hack into the bank’s computer system to modify the number and then take further actions to ensure the number is not modified back. 
  • If we tell our AI to maximize the number in a bank account but not ‘break any laws’, then it may aim to build factories which are technically legal but detrimental for the environment. Or it may just blackmail lawmakers to create legal loopholes for it to abuse. 
  • If we ask our AI to do any task where success is measured by a human providing a reward signal (e.g. every hour rating the AI’s actions out of 10 via a website), then the AI has a strong incentive to take control of the reward mechanism. For example, hacking into the website, or forcing the human to always provide a high reward. 

I think there are some methods for telling AI systems to do things, such that they might not optimize catastrophically. Often these methods involve the AI learning the humans’ preferences from feedback, rather than just being given a metric to optimize for. There is still a possibility that the AI learns an incorrect model of what the human wants, but potentially if the AI is appropriately uncertain about its model of human values then it can be made to defer to humans when it might be about to do something bad. 

Other strategies involve training an AI to mimic the behavior of a human (or many humans), but with some ‘amplification’ method which allows the AI to outperform what humans can actually achieve. For example, an AI may be trained to answer questions by mimicking the behavior of a human who can consult multiple copies of (previous versions of) the AI. At its core, this is copying the behavior of a human who has access to very good advisors, and so hopefully this will converge on a system which is aligned with what humans want. This approach also has the advantage that we have simply constructed a question-answering AI, rather than a “Go and do things in the world” AI, and so this AI may not have strong incentives to attempt to influence the state of the world. 

I think approaches like these (and others) are promising, and at least give me some hope that there might be some ways of specifying what we want an AI to do. 

How do we actually put the objective into the AI?

There is an additional (and maybe harder) problem: even if we knew how to specify the thing that we want, how do we put that objective into the AI? This is the ‘Inner Alignment Problem”. This is related to the generalization behavior of neural networks; a network could learn a wide range of functions which perform well on the training distribution, but it will only have learned what we want it to learn if it performs ‘well’ on unseen inputs.

Currently, neural networks generalize surprisingly well; 

  • Image classifiers can work on images they’ve never seen before
  • Language models can generate text and new ‘ideas’ which aren’t in the training corpus

In some sense this is obvious, if the models were only able to do well on things we already knew the answer to then they wouldn’t be very useful. I say “surprisingly” because there are many different configurations of the weights which lead to good performance on the training data without any guarantees about their performance on new, unseen data. Our training processes reasonably robustly find weight configurations which do well on both the training and test data. 

One of the reasons neural networks seem to generalize is because they are biased towards simple functions, and the data in the world is also biased in a similar way. The data generating processes in the real world (which map “inputs” to “outputs”) are generally “simple” in a similar way to the functions that neural networks learn. This means that if we train our AI to do well on the training data, when we show it some new data it doesn’t go too wild with its predictions and is able to perform reasonably well. 

This bias towards simplicity is also why we might expect to learn a function which acts as an optimizer rather than a pile of heuristics. For a very complex task, it is simpler to learn an optimization algorithm than a long mechanistic list of heuristics. If we learn an algorithm which does well on the training distribution by optimizing for something, the danger arises if we are not sure what the algorithm is optimizing for off the training distribution

There will be many objectives which are consistent with good performance on the training distribution but then cause the AI system to do wildly different things off distribution. Some of these objectives will generalize in ways that humans approve of, but many others will not. In fact, because human values are quite specific, it seems like the vast majority of objectives that an AI could learn will not be ones that humans approve of. It is an open question what kinds of objectives an AI will develop by default.

It does however seem like AIs will develop long term goals by default. If an AI is trained to do well on a task, it seems unlikely to arbitrarily not care about the future. For example, if we train an AI to collect apples, it will attempt to maximize the number of apples over all time (maybe with some temporal discount factor), rather than only maximize apples collected in a 10 minute interval. This is probably true even if the AI was only ever trained for 10 minute intervals. The objective “maximize apples” seems far less arbitrary than “maximize apples for 10 minutes and then don’t care about them”.


There is an additional danger if an AI system is ‘deliberately’ attempting to obscure its objective/intentions from the humans training it. The term ‘deception’ is often used to refer to two different things which could happen when we train AI systems, which I outline here. 

If the AI is being trained on a difficult task, it might be easier for the AI to trick the evaluator (maybe a human) into giving a high reward, rather than actually doing well on the task. I’ll call this ‘Goodhart deception’ because the AI is ‘Goodharting’ the reward rather than optimizing for what humans actually want. Importantly, this doesn’t require the AI to have any objective or be optimizing for anything, the behavior which led to high reward (tricking the human) was just reinforced. This seems bad, but not as catastrophically bad as the other type of deception might be.

The other type of deception is if an optimizing AI system intentionally deceives the humans about its true goals. In this scenario, the AI system develops an objective which is not aligned with the human objective. Here the objective is extended across time, which seems potentially like the default for learned objectives. The AI knows that if it attempts to directly go for its objective then it will either be turned off or be modified to remove this objective. So the AI will ‘pretend’ to not have this goal and instead ‘play along’ and do well on the task it is being trained for. After training, when the AI is deployed into the world, it is free to defect and pursue its own (misaligned) objective. 

I’ll call this ‘consequentialist deception’ because the AI is acting as a consequentialist (taking actions because of their consequences in the world, rather than just using mechanistic heuristics), or maybe just ‘deception’.

This requires 3 (possibly likely) things to happen

  • The AI system is an optimizer and develops an objective (not the human’s objective) which is extended across time
  • The AI’s ability to deceive[1] the humans is reinforced, or at least not removed by the training process
  • There will come a time when the system can ‘safely’ defect (or at least the AI thinks there is a chance that this might happen)

Consequentialist deception seems especially dangerous because it will lead to the same behavior on the training distribution as a non-deceptive AI. This makes it especially hard to tell if the model is deceptive. Even if the model performs well on the training distribution, we don’t know how it will behave when it is deployed. If the model is deceptive then it could perform arbitrarily badly in deployment, as it pursues its own objective. 

For a model to be deceptive it needs to develop a different objective than the objective it is being trained on. Currently, when we train neural networks, this looks approximately like we randomly pick a function from the set of functions which perform well on the training distribution. Functions which take fewer parameters to specify (have more free/unused parameters) have a higher likelihood of being chosen. Consider a function  which requires both  to do well on the training distribution, compared with a function  which only requires  to do well on the training distribution. If we are randomly sampling functions that do well on the training distribution, we are more likely to select  because we only need to get  correct, and any value of  works, while  requires us to pick both  and  correctly. 

It is pretty unclear how this “bias towards simple functions”/”bias towards functions with lots of free parameters” influences the probability of deception. Deceptive algorithms seem more complex than non-deceptive ones because they need to be able to both recognize they are in a training process, and then decide to do well on the training task. But deception might be the default behavior for any optimizing algorithm to develop, and so the simplicity of optimizing algorithms might mean that deception is more likely. 

Additionally, deceptive algorithms are persistent in a way that non-consequentialist algorithms are not. For normal neural network training, you will likely find a suboptimal algorithm early in training and then this will be modified into a different algorithm as training progresses; the algorithm does not ‘want’ to persist. But if you find a sufficiently capable deceptive algorithm early in training, then this will attempt to persist until the end of training. This means that if your AI becomes deceptive at any point during training, it will likely continue to be deceptive. This implies that the “randomly sample functions which perform well on the training distribution” lens may not be accurate, and in fact there is a lot of path-dependency in the development of deceptive algorithms. 


So to recap:

  1. Optimization is scary, and if we train an AI system to take actions in the world, as the task gets more complicated and our AI systems get more powerful we are more likely to develop optimizers.
    • By ‘optimizer’ I mean the AI runs a 'consequentialist' algorithm which looks something like “What actions can I take to maximize my objective?”, rather than just running through rote steps in a calculation.
  2. We don’t have strong guarantees that off distribution our AIs will do what we want them to do.
  3. There are reasons to expect that AIs we develop may have very different goals to our own, even if they perform well on the training distribution.
  4. Other than performing well on the training distribution, and having some sort of bias towards “simplicity", the AI could learn a whole range of objectives. 
  5. We already know that AI systems are capable of tricking humans, although in current systems this is a different phenomenon than deceiving humans for consequentialist reasons.
  6. If an AI develops its own (misaligned) objective during training, then it may simply ‘play along’ until it is able to safely defect and pursue its own objective.
    • The AI’s objective may be arbitrarily different from what we were training it for, and easily not compatible with human values or survival. 

In the next two posts I will lay out a more concrete story of how things go wrong, and then list some of my current confusions.

Thanks to Adam Jermyn and Oly Sourbut for helpful feedback on this post. 

  1. ^

    Including knowing that deception is even a possible strategy.

New Comment
2 comments, sorted by Click to highlight new comments since:

Nitpick: to the extent you want to talk about the classic example, paperclip maximisers are as much meant to illustrate (what we would now call) inner alignment failure.

See Arbital on Paperclip ("The popular press has sometimes distorted the notion of a paperclip maximizer into a story about an AI running a paperclip factory that takes over the universe. [...] The concept of a 'paperclip' is not that it's an explicit goal somebody foolishly gave an AI, or even a goal comprehensible in human terms at all.") or a couple of EY tweet threads about it: 1, 2

I'm getting more and more worried because the software I have dealt with in real life (as opposed to read about in scifi) is so defectively stupid it's actually evil (or else the programmers are). Of course often it's deliberately evil, like scanners that won't work because the incorporated printer is out of ink.