The Alignment Newsletter #3: 04/23/18

Rohin Shah

Highlights

Incomplete Contracting and AI Alignment (Dylan Hadfield-Menell et al): This paper explores an analogy between AI alignment and incomplete contracting. In human society, we often encounter principal-agent problems, where we want to align the incentives of the agent with those of the principal. In theory, we can do this with a "complete" contract, that is an enforceable contract that fully specifies the optimal behavior in every possible situation. Obviously in practice we cannot write such contracts, and so we end up using incomplete contracts instead. Similarly, in AI alignment, in theory we could perfectly align an AI with humans by imbuing it with the true human utility function, but in practice this is impossible -- we cannot consider every possible situation that could come up. The difference between the behavior implied by the reward function we write down and the utility function we actually want leads to misalignment. The paper then talks about several ideas from incomplete contracting and their analogues in AI alignment. The main conclusion is that our AI systems will have to learn and use a "common sense" understanding of what society will and will not sanction, since that is what enables humans to solve principal-agent problems (to the extent that we can).

My opinion: I'm excited to see what feels like quite a strong connection to an existing field of research. I especially liked the section about building in "common sense" (Section 5).

Understanding Iterated Distillation and Amplification: Claims and Oversight (William_S): The post introduces a distinction between flavors of iterated distillation and amplification -- whether the overseer is low bandwidth or high bandwidth. Let's think of IDA as building a deliberation tree out of some basic overseer. In the high bandwidth case, we can think of the overseer as a human who can think about a problem for 15 minutes, without access to the problem's context. However, there could be "attacks" on such overseers. In order to solve this problem, we can instead use low-bandwidth overseers, who only look at a sentence or two of text, and verify through testing that there are no attacks on such overseers. However, it seems much less clear that such an overseer would be able to reach high levels of capability.

My opinion: This is an excellent post that improved my understanding of Paul Christiano's agenda, which is not something I usually say about posts not written by Paul himself. I definitely have not captured all of the important ideas in my summary, so you should read it.

Prerequisities: Iterated Distillation and Amplification

Announcement: AI alignment prize round 2 winners and next round (cousin_it): The winners of the second round of the AI alignment prize have been announced! All of the winners have already been sent out in this newsletter, except for the first place winner, "The Alignment Problem for History-Based Bayesian Reinforcement Learners". The deadline for the next iteration of the AI alignment prize is June 30, 2018.

Technical AI alignment

Problems

Implicit extortion (Paul Christiano): Explicit extortion occurs when an attacker makes an explicit threat to harm you if you don't comply with their demands. In contrast, in implicit extortion, the attacker always harms you if you don't do the thing that they want, which leads you to learn over time to do what the attacker wants. Implicit extortion seems particularly hard to deal with because you may not know it is happening.

My opinion: Implicit extortion sounds like a hard problem to solve, and the post argues that humans don't robustly solve it. I'm not sure whether this is a problem we need to solve in order to get good outcomes -- if you can detect that implicit extortion is happening, you can take steps to avoid being extorted, and so it seems that a successful implicit extortion attack would have to be done by a very capable adversary that knows how to carry out the attack so that it isn't detected. Perhaps we'll be in the world where such adversaries don't exist.

Technical agendas and prioritization

Incomplete Contracting and AI Alignment (Dylan Hadfield-Menell et al): Summarized in the highlights!

Iterated distillation and amplification

Understanding Iterated Distillation and Amplification: Claims and Oversight (William_S): Summarized in the highlights!

My confusions with Paul's Agenda (Vaniver)

Agent foundations

Computing an exact quantilal policy (Vadim Kosoy)

Reward learning

Shared Autonomy via Deep Reinforcement Learning (Siddharth Reddy et al): In shared autonomy, an AI system assists a human to complete a task. The authors implement shared autonomy in a deep RL framework by simply extending the state with the control input from the human, and then learning a policy that chooses actions given the extended state. They show that the human-AI team performs better than either one alone in the Lunar Lander environment.

My opinion: Shared autonomy is an interesting setting because the human is still necessary in order to actually perform the task, whereas in typical reward learning settings, once you have learned the reward function and the AI is performing well, the human does not need to be present in order to execute a good policy.

Handling groups of agents

Multi-winner Voting: a question of Alignment (Jameson Quinn)

On the Convergence of Competitive, Multi-Agent Gradient-Based Learning (Eric Mazumdar et al)

Near-term concerns

Security

Adversarial Attacks Against Medical Deep Learning Systems (Samuel G. Finlayson et al)

AI strategy and policy

Game Changers: AI Part III, AI and Public Policy (Subcomittee on Information Technology)

AI capabilities

Reinforcement learning

Evolved Policy Gradients (Rein Houthooft et al): In this meta-learning approach for reinforcement learning, the outer optimization loop proposes a new loss function for the inner loop to optimize (in contrast to eg. MAML, where the outer optimization leads to better initializations for the policy parameters). The outer optimization is done using evolution strategies, while the inner optimization is stochastic gradient descent. The authors see good results on generalization to out-of-distribution tasks, which other algorithms such as RL2 don't achieve.

On Learning Intrinsic Rewards for Policy Gradient Methods (Zeyu Zheng et al): To get better performance on deep RL tasks, we can learn an "intrinsic reward" (intuitively, a shaped reward function), in contrast to the "extrinsic reward" which is the true reward function associated with the task. The policy is trained to maximize the sum of the intrinsic and extrinsic reward, and at the same time the intrinsic reward is optimized to lead to good performance on the extrinsic reward.
My opinion: I'm somewhat surprised that this method works -- it seems like the proposed algorithm does not leverage any new information that was not already present in the extrinsic reward function, and I don't see any obvious reasons why learning an intrinsic reward would lead to a good inductive bias that lets you learn faster. If anyone has an explanation I'd love to hear it!

Deep learning

DAWNBench: This is a collection of statistics for time and compute costs, both for training and inference, for various common models and benchmarks.
My opinion: It's worth skimming through the page to get a sense of concrete numbers for various benchmarks used in the ML community.

Large scale distributed neural network training through online distillation (Rohan Anil et al)

Capsules for Object Segmentation (Rodney LaLonde et al)

Machine learning

Introducing TensorFlow Probability (Josh Dillon et al): Tensorflow now also supports probabilistic programming.

My opinion: Probabilistic programming is becoming more and more important in machine learning, and is in some sense a counterpart to deep learning -- it lets you have probability distributions over parameters (as opposed to the point estimates provided by neural nets), but inference is often intractable and must be performed approximately, and even then you are often limited to smaller models than with deep learning. It's interesting to have both of these provided by a single library -- hopefully we'll see applications that combine both approaches to get the best of both worlds. In particular, probabilistic programming feels more principled and amenable to theoretical analysis, which may make it easier to reason about safety.

Deep Probabilistic Programming Languages: A Qualitative Study (Guillaume Baudart): This is an overview paper of deep probabilistic programming languages, giving examples of how to use them and considering their pros and cons.

My opinion: I read this after writing the summary for TensorFlow Probability, and it talks about the advantages and tradeoffs between deep learning and PPLs in much more detail than I did there, so if that was interesting I'd recommend reading this paper too. It did seem pretty accessible but I used to do research with PPLs so I'm not the best judge of its accessibility.

AGI theory

Believable Promises (Douglas_Reay)

Critiques

Artificial Intelligence — The Revolution Hasn’t Happened Yet (Michael Jordan): There is a lot of hype at the moment around AI, particularly around creating AI systems that have human intelligence, since the thrill (and fear) of creating human intelligence in silicon causes overexuberance and excessive media attention. However, we actually want to create AI systems that can help us improve our lives, often by doing things that humans are not capable of. In order to accomplish this, it is likely better to work directly on these problems, since human-like intelligence is neither necessary nor sufficient to build such systems. However, as with all new technologies, there are associated challenges and opportunities with these AI systems, and we are currently at risk of not seeing these because we are too focused on human intelligence in particular.

My opinion: There certainly is a lot of hype both around putting human intelligence in silicon, as well as the risks that surround such an endeavor. Even though I focus on such risks, I agree with Jordan that these are overhyped in the media and we would benefit from having more faithful coverage of them. I do disagree on some specific points. For example, he says that human-imitative AI is not sufficient to build some AI systems such as self-driving cars, but why couldn't an AI with human intelligence just do whatever humans would do to build self-driving cars? (I can think of answers, such as "we don't know how to give the AI system access to all the data that humans have access to", but I wish he had engaged more with this argument.) I do agree with the overall conclusion that in the near future humans will make progress on building such systems, and not by trying to give the systems "human intelligence". I also suspect that we disagree either on how close we are to human-imitative AI, or at what point it is worth it to start thinking about the associated risks, but it's hard to tell more from the article.

Miscellaneous (Capabilities)

Talk to Books: See Import AI.

News

Announcement: AI alignment prize round 2 winners and next round (cousin_it): Summarized in the highlights!

9