*Developing a theory of values to solve extrapolation issues, and an approach to train AI systems to reason well*

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

## Highlights

**Why we need a theory of human values**

*(Stuart Armstrong)*: There are many different sources of information for human values, such as behavior, speech, facial expressions/emotions, and extrapolations of what a human would do. These have all been honed to produce similar preferences in our current environment. However, the environment will change a lot due to the AI's actions, inducing a distributional shift, after which we should no longer expect the values inferred from these different methods to agree with each other. In addition, we also have the problem that each method only applies in some circumstances -- for example, people are likely to misrepresent their values if asked in a courtroom. We could try to patch these problems by having a meta-method that chooses from the various methods of value learning, as well as predicting human judgments about when each method applies. However, then we'd likely have similar issues with the meta-method and predictions, which are also likely to be specific to the current environment. Instead, we should have a

*theory*of human values, from which we can get principled approaches to resolve these problems.

**Rohin's opinion:** I strongly agree that due to the problems mentioned in this post, we shouldn't be trying to mix and match value learning methods to infer a static notion of human preferences that is then optimized over the long term. I don't personally work on building a theory of human values because it seems like I could apply the same critiques to the result: the theory is specialized to our current situation and won't capture changes in the future. But of course this is hard to predict without knowing the theory. My preferred solution is not to build a system that is optimizing a goal-directed utility function over the long term, and to find other ways of making an AI system that are still just as useful.

# Technical AI alignment

### Iterated amplification sequence

Factored Cognition *(Andreas Stuhlmüller)*: This was previously summarized in AN #12. While this post describes a project from Ought, it explains many of the ideas behind iterated amplification (which is why it is in this sequence). There are two important and distinct topics addressed in this post: first, whether it is possible in principle to achieve good performance on tasks when you must use *explicit reasoning*, and second, how to use explicit reasoning to train an aligned AI system via iterated amplification.

Currently, most AI research is supervised based on *behavior*: we specify a domain and data or reward functions that identify good behavior, and hope that the AI develops a good decision-making system during training. We don't provide training data for decision-making itself. In contrast, Factored Cognition aims to make the decision-making algorithm itself explicit, in a form that could be used to train an AI system, which we might call internal supervision. Note that this is *not* a required feature of iterated amplification, which can work with external (i.e. behavioral) supervision, as in recursive reward modeling (AN #34). However, we might want to use internal supervision because then we get to control the decision-making process itself, and that is what determines whether the AI system is aligned or not.

Research currently does not use internal supervision because it requires training data on *how* to solve the problem, whereas a reward function only needs to tell *whether* the problem is solved. However, for sufficiently general AI systems, we could provide training data on general problem-solving, which could then be applied to many tasks. This could be competitive with the alternative of providing training data or reward functions for each task separately.

How might we make general problem-solving explicit enough that we could train AI systems to replicate it? Factored Cognition hypothesizes that we could do this by providing *decompositions* of tasks into simpler subtasks. While this is not how humans solve tasks (for example, expert Go players use a lot of intuition), it seems plausible that we could get similar or better performance using decomposition as long as we were willing to wait a long time (for example, by implementing minimax using decomposition to solve Go). Ought wants to test this hypothesis, by studying whether humans are in fact capable of providing decompositions in challenging problem domains, such as math and programming puzzles, textbook problems, fact-checking, interactive dialog, and task prioritization. (There will likely be no ML here, just humans decomposing problems for other humans to then solve.)

The post makes this concrete by considering a particular way in which humans might provide decompositions. They develop a "cognitive workspace" where a human can take "actions" such as editing the text in the workspace, each of which can only involve a small amount of cognitive work. There are a lot of details that I won't get into, but broadly it seems to me like what you would get if you tried to create a functional programming language where the primitive forms are things that a human can do. Note that the examples in the post use task-specific decompositions for clarity but in the long term we would need general decomposition strategies.

We could train an AI system to mimic this explicit reasoning. It seems like this would achieve human performance but no more, since after all we are imitating human decision-making algorithms. However, now we can use the iterated amplification trick. If a human got to use explicit reasoning over a long time, we would expect significantly better decisions. So, we can first distill an agent A that mimics a human using explicit reasoning over a short time period, then amplify it to get H[A], where a human performs a decomposition into subquestions that are answered by A, then distill that into A', etc. The post has a nice visualization of how this allows you to mimic an implicit exponentially large tree of decompositions in polynomial time. As before, distillation could be done using imitation learning, or RL with some good inferred reward.

**Rohin's opinion:** It seems a lot easier to generalize well to novel situations if you've learned the right decision-making algorithms. Since it's way easier to learn something if you have direct supervision of it, internal supervision seems particularly interesting as a method of generalizing well. For example, in Making Neural Programming Architectures Generalize via Recursion, the learned programs generalize perfectly, because they learn from execution traces, which show *how* to solve a task, and the learned algorithm is forced to use a recursive structure. (Typical approaches to neural program induction only use input-output examples, and often fail to generalize to inputs longer than those seen during training.)

I also like the point that iterated amplification allows you to mimic an implicit exponential-time computation. This is a key reason for my optimism: it seems like well-honed human intuition beats short explicit reasoning almost always. In fact, I think it is reasonable to view human intuition as the result of iterated amplification (see these posts).

### Learning human intent

**Why we need a theory of human values**

*(Stuart Armstrong)*: Summarized in the highlights!

Reinforcement learning and inverse reinforcement learning with system 1 and system 2 *(Alexander Peysakhovich)*

### Interpretability

Explicability? Legibility? Predictability? Transparency? Privacy? Security? The Emerging Landscape of Interpretable Agent Behavior *(Tathagata Chakraborti et al)*

### Adversarial examples

On the Geometry of Adversarial Examples *(Marc Khoury et al)*: This paper analyzes adversarial examples based off a key idea: even if the data of interest forms a low-dimensional manifold, as we often assume, the ϵ-tube *around* the manifold is still high-dimensional, and so accuracy in an ϵ-ball around true data points will be hard to learn.

For a given L_p norm, we can define the optimal decision boundary to be the one that maximizes the margin from the true data manifold. If there exists some classifier that is adversarially robust, then the optimal decision boundary is as well. Their first result is that the optimal decision boundary can change dramatically if you change p. In particular, for concentric spheres, the optimal L_inf decision boundary provides an L_2 robustness guarantee √d times smaller than the optimal L_2 decision boundary, where d is the dimensionality of the input. This explains why a classifier that is adversarially trained on L_inf adversarial examples does so poorly on L_2 adversarial examples.

I'm not sure I understand the point of the next section, but I'll give it a try. They show that a nearest neighbors classifier can achieve perfect robustness if the underlying manifold is sampled sufficiently densely (requiring samples exponential in k, the dimensionality of the manifold). However, a learning algorithm with a particular property that they formalize would require exponentially more samples in at least some cases in order to have the same guarantee. I don't know why they chose the particular property they did -- my best guess is that the property is meant to represent what we get when we train a neural net on L_p adversarial examples. If so, then their theorem suggests that we would need exponentially more training points to achieve perfect robustness with adversarial training compared to a nearest neighbor classifier.

They next turn to the fact that the ϵ-tube around the manifold is d-dimensional instead of k-dimensional. If we consider ϵ-balls around the training set X, this covers a very small fraction of the ϵ-tube, approaching 0 as d becomes much larger than k, even if the training set X covers the k-dimensional manifold sufficiently well.

Another issue is that if we require adversarial robustness, then we severely restrict the number of possible decision boundaries, and so we may need significantly more expressive models to get one of these decision boundaries. In particular, since feedforward neural nets with Relu activations have piecewise linear decision boundaries, it is hard for them to separate concentric spheres. Suppose that the spheres are separated by a distance d. Then for accuracy on the manifold, we only need the decision boundary to lie entirely in the shell of width d. However, for ϵ-tube adversarial robustness, the decision boundary must lie in a shell of width d - 2ϵ. They prove a lower bound on the number of linear regions for the decision boundary that grows as τ^(-d), where τ is the width of the shell, suggesting that adversarial robustness would require more parameters in the model.

Their experiments show that for simple learning problems (spheres and planes), adversarial examples tend to be in directions orthogonal to the manifold. In addition, if the true manifold has high codimension, then the learned model has poor robustness.

**Rohin's opinion:** I think this paper has given me a significantly better understanding of how L_p norm balls work in high dimensions. I'm more fuzzy on how this applies to adversarial examples, in the sense of any confident misclassification by the model on an example that humans agree is obvious. Should we be giving up on L_p robustness since it forms a d-dimensional manifold, whereas we can only hope to learn the smaller k-dimensional manifold? Surely though a small enough perturbation shouldn't change anything? On the other hand, even humans have *some* decision boundary, and the points near the decision boundary have some small perturbation which would change their classification (though possibly to "I don't know" rather than some other class).

There is a phenomenon where if you train on L_inf adversarial examples, the resulting classifier fails on L_2 adversarial examples, which has previously been described as "overfitting to L_inf". The authors interpret their first theorem as contradicting this statement, since the optimal decision boundaries are very different for L_inf and L_2. I don't see this as a contradiction. The L_p norms are simply a method of label propagation, which augments the set of data points for which we know labels. Ultimately, we want the classifier to reproduce the labels that we would assign to data points, and L_p propagation captures some of that. So, we can think of there as being many different ways that we can augment the set of training points until it matches human classification, and the L_p norm balls are such methods. Then an algorithm is more robust as it works with more of these augmentation methods. Simply doing L_inf training means that by default the learned model only works on one of the methods (L_inf norm balls) and not all of them as we wanted, and we can think of this as "overfitting" to the imperfect L_inf notion of adversarial robustness. The meaning of "overfitting" here is that the learned model is too optimized for L_inf, at the cost of other notions of robustness like L_2 -- and their theorem says basically the same thing, that optimizing for L_inf comes at the cost of L_2 robustness.

Adversarial Vulnerability of Neural Networks Increases With Input Dimension *(Carl-Johann Simon-Gabriel et al)*: The key idea of this paper is that imperceptible adversarial vulnerability happens when small changes in the input lead to large changes in the output, suggesting that the gradient is large. They first recommend choosing ϵ_p to be proportional to d^(1/p). Intuitively, this is because larger values of p behave more like maxing instead of summing, and so using the same value of ϵ across values of p would lead to more points being considered for larger p. They show a link between adversarial robustness and regularization, which makes sense since both of these techniques aim for better generalization.

Their main point is that the norm of the gradient increases with the input dimension d. In particular, a typical initialization scheme will set the variance of the weights to be inversely proportional to d, which means the absolute value of each weight is inversely proportional to √d. For a single-layer neural net (that is, a perceptron), the gradient is exactly the weights. For L_inf adversarial robustness, the relevant norm for the gradient is the L_1 norm. This gives the sum of the d weights, which will be proportional to √d. For L_p adversarial robustness, the corresponding gradient is L_q with q larger than 1, which decreases the size of the gradient. However, this is exactly offset by the increase in the size of ϵ_p that they proposed. Thus, in this simple case the adversarial vulnerability increases with input dimension. They then prove theorems that show that this generalizes to other neural nets, including CNNS (albeit still only at initialization, not after training). They also perform experiments showing that their result also holds after training.

**Rohin's opinion:** I suspect that there is some sort of connection between the explanation given in this paper and the explanation that there are many different perturbation directions in high-dimensional space which means that there are lots of potential adversarial examples, which increases the chance that you can find one. Their theoretical result comes primarily from the fact that weights are initialized with variance inversely proportional to d. We could eliminate this by having the variance be inversely proportional to d^2, in which case their result would say that adversarial vulnerability is constant with input dimension. However, in this case the variance of the activations would be inversely proportional to d, making it hard to learn. It seems like adversarial vulnerability should be the product of "number of directions", and "amount you can search in a direction", where the latter is related to the variance of the activations, making the connection to this paper.

Intrinsic Geometric Vulnerability of High-Dimensional Artificial Intelligence *(Luca Bortolussi et al)*

# Other progress in AI

### Reinforcement learning

AlphaZero: Shedding new light on the grand games of chess, shogi and Go *(David Silver et al)*: If you didn't already believe that AlphaZero is excellent at Go, Chess and Shogi, this post and the associated paper show it more clearly with a detailed evaluation. A few highlights:

- AlphaZero can beat Stockfish starting from common human openings, suggesting that it generalizes well

- The amount of computation given to AlphaZero to choose a move has a larger effect on the win probability than I was expecting

- I always wondered why they use MCTS and not alpha-beta search. They speculate that alpha-beta search with a neural net evaluation function succumbs to the winner's curse since alpha-beta involves a lot of maxes and mins, whereas MCTS averages over evaluations and so is more robust. In contrast, evaluation functions designed by humans are much more likely to generalize well, and alpha-beta outperforms MCTS.

Visual Model-Based Reinforcement Learning as a Path towards Generalist Robots *(Frederik Ebert, Chelsea Finn et al)*: How can we get general robots that can perform a diverse array of tasks? We could collect a lot of data from robots acting randomly, train a dynamics model on pixels, and then use model-predictive control to plan. The dynamics model is a neural net trained to predict the next image given the current image and action. It helps to use temporal skip connections, because this allows the robot to get some object permanence since it can now "remember" objects it saw in the past that are currently blocked by something else. Model predictive control then samples sequences of actions (called plans), predicts the final image achieved, chooses the plan that best achieves the goal, and takes the first action of that plan. This is then repeated to choose subsequent actions. (Their method is slightly more sophisticated but this is the basic idea.) We can specify the goal by choosing a particular pixel and asking that the object at that pixel be moved to some other pixel. Alternatively, Few-Shot Goal Inference for Visuomotor Learning and Planning (AN #27) trains a classifier that can take a few demonstrations and output a goal.

**Rohin's opinion:** This is probably the easiest way to get a robot to do interesting things, since you just need it to collect experience autonomously with very little human involvement, you don't need to have good object detection, and in many cases goal specification can be done without too much effort. I'm surprised that using random actions is enough -- how does the robot get enough examples of picking up an object with random actions? Maybe the robot's random strategy is actually coded up in such a way that it is particularly likely to do interesting things like picking up an object.

It does seem like this approach will need something else in order to scale to more advanced capabilities, especially hierarchical tasks -- for example, you'll never have an example of picking up a napkin, getting it wet, and wiping down a table. But perhaps we can iterate the process, where after we learn how to grasp and push, we start collecting data again using grasping and pushing instead of random low-level actions. Safe exploration would become more of a concern here.

An Introduction to Deep Reinforcement Learning *(Vincent Francois-Lavet et al)*

Quantifying Generalization in Reinforcement Learning *(Karl Cobbe)*

### Applications

AlphaFold: Using AI for scientific discovery *(Andrew Senior et al)*: This post briefly describes AlphaFold, a system that does well at the protein folding problem. They train neural networks that can be used to evaluate how good a particular proposed protein structure is. This can be used to guide an evolutionary search that repeatedly replaces pieces of a protein structure with new protein fragments from a generative model. Alternatively, it can be used to construct a loss function for entire proteins, which allows us to use gradient descent to optimize the protein structure.

**Rohin's opinion:** The approach here is to learn heuristics that can guide a top-level search algorithm, which is the sort of thing that I think deep learning is particularly well poised to improve right now. Note that gradient descent is a top-level search algorithm here, because a separate loss function is constructed *for every protein*, rather than having a single loss function that is used to train a network that works on all proteins. However, unlike other applications such as SMT solvers, the top-level search algorithm does not have some sort of "correctness" guarantee.

*Copyright © 2018 Rohin Shah, All rights reserved.*