The Alignment Newsletter #2: 04/16/18

Rohin Shah

Highlights

OpenAI Charter: In their words, this is "a charter that describes the principles we use to execute on OpenAI’s mission".

My opinion: I'm very excited by this charter, it's a good sign suggesting that we can get the important actors to cooperate in building aligned AI, and in particular to avoid a competitive race. Key quote: "if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project".

Lessons Learned Reproducing a Deep Reinforcement Learning Paper (Matthew Rahtz): It's exactly what the title says. There were a lot of points that I can't easily summarize, but some highlights:

My opinion: If you do deep RL research regularly, you probably won't get too much out of it (though you might still get some handy tips on things you can do with Tensorflow), but I think everyone else should read it to get a more concrete sense of what deep RL research actually looks like and to be able to communicate more effectively with deep RL researchers.

A voting theory primer for rationalists and 5 voting pathologies: lesser names of Moloch (Jameson Quinn): Voting theory, or social choice theory, studies voting methods, which take a set of preferences over outcomes from voters, and decides which outcome should occur. The field is littered with impossibility results and difficult problems that are hard to resolve, but there are voting methods that do reasonably well in practice. The second post elaborates more on the 5 problems that are hard to resolve.

My opinion: A major challenge for an AI would be to figure out how to aggregate preferences across humans. This is not necessarily a problem that we have to solve immediately -- it's possible that we build an AI that mimics what we do initially and then develops its own theory of voting. However, if you are doing any kind of reward learning (such as inverse reinforcement learning), you will have to confront this problem head on. This article points out a lot of problems that we would have to be aware of in this case. The solutions seem less likely to transfer, because they are optimized for a different scenario (such as presidential elections).

Read more: Will MacAskill's PhD thesis, Normative Uncertainty, considers how to combine different moral theories that have different preferences over world states.

Technical AI alignment

Problems

Clarifying “AI alignment” (Paul Christiano): As Paul uses the term, "AI alignment" refers only to the problem of figuring out how to build an AI that is trying to do what humans want. In particular, an AI can be aligned but still make mistakes due to eg. an incorrect understanding of when it's okay to interrupt humans. While it is important to also make sure that an AI doesn't make catastrophic mistakes, this is less urgent than the problem of aligning the AI in the first place.

My opinion: It's short and readable, though if you nodded along with the summary, then maybe you won't get much out of it.

Iterated distillation and amplification

Can corrigibility be learned safely? (Wei Dai): We hope that iterated distillation and amplification is sufficient to learn to be corrigible, because the subagents in the amplification procedure are given simple enough tasks that the tasks are not "attacks" that lead to incorrigible behavior. However, if we are forced to break down all of our tasks into simple subtasks that are solved without access to any of the surrounding context, then we will end up with an AI that interprets us very literally and is unable to use "common sense", which could lead to incorrigible behavior. You might think that we could get around this by learning about corrigible behavior from the environment (eg. by observing how humans interact with each other), but then we have the same problem that if you learn "large" chunks from the environment you could be attacked, and if you learn "small" chunks you will be too literal.

My opinion: I agree with the criticism that breaking tasks down into subtasks causes you to lose a lot of capability. I'm mostly confused by the analysis in terms of "small" and "large" chunks and whether there are "attacks" or not so don't want to comment on it, but I think this is addressing an important open question and I'd like to see people try to understand what's going on and have more diversity of thought here.

The limits of corrigibility (Stuart Armstrong): There are situations in which an AI would be able (or even required) to take some actions that can influence the human's values. For example, if a billionaire asked the AI to help him figure out how to prioritize between charities to give to, based on its answer the billionaire could move towards effective altruism, or focus more on administrative costs, or help animals instead of humans -- and all of these are "compatible" with the billionaire's current values. In such a situation, it's unclear what "corrigibility" means. It seems like the only "corrigible" behavior here is to explicitly figure out what the billionaire values, and then help him optimize those values -- but then in order to get corrigible behavior, we need to learn human values, which we know is hard.

My opinion: This feels to me like a misunderstanding of (Paul's version of) corrigibility. The post takes a perspective where it looks at the outcomes of an AI acting in the world, whereas my understanding is that corrigibility is also about the motivation underlying an AI's choices, regardless of what outcomes actually happen.

Two guarantees (Paul Christiano): The "minimum viable product" of AI alignment research would be an AI that satisfies two guarantees -- first, that it achieves good average-case behavior (the performance guarantee), and second, that it achieves reasonable worst-case behavior (the control guarantee). There's then some discussion of how we might establish these guarantees inductively about amplification.

My opinion: I like this framing of what guarantees we want to achieve. Hopefully we can apply this to other AI systems as well.

Interpretability

Programmatically Interpretable Reinforcement Learning (Abhinav Verma et al): This work uses program synthesis in order to get interpretable reinforcement learning policies. Some of you can probably guess that I'm very excited by this paper :P As with most program synthesis techniques, they define a space of possible programs (policies), and then search through the space for the program that achieves the highest reward. Since they are using program synthesis, they can take advantage of standard tricks such as sketching. They also train a deep RL agent and use the agent to give feedback to the program synthesis algorithm, so that the algorithm produces the program whose outputs are closest to the outputs of the deep RL policy. They evaluate on TORCS (a racecar simulator) and find that the policy does almost as well as deep RL. However, it has a few major advantages over deep RL. Since it is a program, it is much more interpretable -- a human can actually look at the resulting program and understand it (and hence the title of the paper). It is also possible to use formal verification methods to prove properties about the program (whereas neural nets are often too large for these techniques to work). But perhaps most importantly, restricting your class of functions to the space of (small) programs is often a very useful inductive bias, and it is no different in this case -- the learned programs perform much better than deep RL when run on a new unseen track, showing good generalization.

My opinion: I want people to read this paper, but I'm a bit biased here -- I've thought about this area a lot, and I expect that we'll need to have inductive biases of the form "something like an algorithm/program" to get good generalization from AI, and this paper is some evidence towards that. It's also the first paper I know of that makes RL-learned policies look like simple programs.

Near-term concerns

Adversarial examples

Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations (Alex Lamb et al)

AI strategy and policy

OpenAI Charter: Summarized in the highlights!

The Advent of Huang's Law (Bharath Ramsundar): See Import AI

The Deep Roots and Long Branches of Chinese Technonationalism (Evan Feigenbaum): See Import AI

China Now Has the Most Valuable AI Startup in the World: See Import AI
My opinion: It's a short, interesting piece, and it's got some actual numbers and quotes from Xu Li (one of the co-founders of the startup, SenseTime), so you should read it.

AI capabilities

Reinforcement learning

Latent Space Policies for Hierarchical Reinforcement Learning (Tuomas Haarnoja, Kristian Hartikainen et al)

Lessons Learned Reproducing a Deep Reinforcement Learning Paper (Matthew Rahtz): Summarized in the highlights!

Deep learning

Spherical CNNs (Taco Cohen, Mario Geiger et al)

Learning Unsupervised Learning Rules (Luke Metz et al)

News

MIRI's April 2018 Newsletter (Rob Bensinger): Lots of links to things MIRI has done, and some links to other people's work as well.

8