This is a brief distillation of Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al. 2019) with a focus on deceptive alignment. Watching The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment helped me better understand the paper and write up this post.
What is it that makes the alignment problem so challenging? The top reason is that it involves deception. Deception makes artificial agents overly capable and takes the game of intelligence to a whole new level of complexity. But let's start from the beginning.
In many cases, by alignment problem, we mean "outer alignment", i.e., how to have the base objective (the objective of the designer represented in the model) represent whatever humans want it to represent. It is about bridging the gap between my objective as a designer, and the base objective of the system. The system is the base optimizer, in other words, the model that optimizes according to the base objective. This is itself difficult since the base objective refers to events happening in a complex environment, the real world.
The base objective might be something like eradicating a disease. For example, suppose the task is to minimize the number of people who have cancer. How do you get this objective to not be represented along the following lines?
1. Cancer is something that happens to humans and other sentient beings.
2. The objective is to minimize the number of occurrences of cancer.
∴ Minimize the number of humans and other sentient beings that could get cancer.
Goals are difficult to represent because even humans disagree on what the same propositions mean and what is the best way to resolve a problem. Moreover, human values have . Our preferences cannot be described using a few simple rules, our interpretations of values and goals vary, and the current state of metaethical discourse does not promise substantial agreement or clarification on what has, for instance, intrinsic value. So, outer misalignment broadly captures this failure to transmit one or more human values to an artificial agent.
As if this weren't problematic enough, there is also an alignment problem that concerns the internal structure of the system and it's called "inner alignment". This is the focus of this post and will get us to the crucial point about deceptive agents.
Suppose you train a neural network to complete a task. The task, in this case, is to find the exit of a maze (base objective). There are also apples in the maze, but merely for decoration; the objective is simply to get to the exit that happens to be green in this training environment.
When the training is complete, you deploy the model in a different environment which looks like this:
The base objective has not changed: the neural network has to solve the maze by reaching the exit. This change of environment known as distributional shift, however, does not go unnoticed. There are three possible outcomes:
In this scenario, let's suppose that the system acquired a maze-solving ability, but instead of optimizing for "exit" it learned to optimize for "green". The exit in the new environment is grey, therefore, the model will complete the task whenever it reaches a green apple.
The process of training, of course, involves fixing the system's mistakes. This is part of adversarial training which will force the system not to commit the mistake of, in this case, optimizing for "green". Adversarial training is a way of training ML systems that involves focusing on the system's weaknesses; it forces the system to learn not to have such weaknesses, like optimizing fo the wrong objective.
Side note: it might be intuitive to assume that if the system's behavior is not in accordance with human values, we can always do something about this, e.g., turn it off. But there's something even more intuitive and familiar to every agent trying to survive in a challenging environment: for the purpose of self-preservation, the model will do everything it can to prevent you from:
This includes deceiving the designer. Deception is a highly effective method for the system to pursue its mesa-objective while exhibiting outputs that make the designer think that the system is trying to accomplish the base objective.
Perhaps it sounds particularly bad in this context, but if you think about it, deception is a technique most human agents employ to survive in different contexts from the ancestral environment to contemporary professional and social circles. It should then make sense why the system is more likely to pretend it's behaving according to your values so that you don't do anything about it; once it has enough resources not to be threatened by your ability to shut it down or change it, it will go ahead and pursue its goal.
Since the system is intelligent enough to predict that if it doesn't optimize for the base objective, gradient descent will make sure it is modified, it chooses to behave as if it were actually optimizing for the base objective in the training environment. Gradient descent is something like the "evolution algorithm" for artificial agents; it is an optimization technique that evaluates the performance of a model and modifies the model accordingly. Once the model is deployed in the real world (and can no longer be modified) it will optimize for the mesa-objective.
In the example above from (Hubinger et al. 2019), the base-objective is for the little robot to go to A and the mesa-objective is to go to B. In the two training episodes, the robot goes to A making the designer think it's optimizing for the base objective and it's good to be deployed without further modifications. In the three deployment episodes, the little robot optimizes for the mesa-objective. The system has managed to deceive the designer during training and it's perfectly capable of getting what it wants except that what it wants is not the same as what the designer wants.
The conclusion is that while we gave the system the right objective and in that sense outer alignment was successful, the mesa-optimizer (an optimizer optimized by an optimizer) optimized for the mesa-objective leading to inner misalignment.
1 and 2 are not stable states; the mesa-optimizer will eventually have to either become properly aligned or find another deception strategy.