The Alignment Newsletter #11: 06/18/18

byrohinmshah1y18th Jun 2018No comments


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Turns out the survey link in the last email was broken, sorry about that and thanks to everyone who reported it to me. Here's the correct link.


Learning to Follow Language Instructions with Adversarial Reward Induction (Dzmitry Bahdanau et al): Adversarial Goal-Induced Learning from Examples (AGILE) is a way of training an agent to follow instructions. The authors consider a 5x5 gridworld environment with colored shapes that the agent can manipulate. The agent is given an instruction in a structured domain-specific language. Each instruction can correspond to many goal states -- for example, the instruction corresponding to "red square south of the blue circle" has many different goal states, since only the relative orientation of the shapes matters, not their absolute positions.

The key idea is to learn two things simultaneously -- an encoding of what the agent needs to do, and a policy that encodes how to do it, and to use these two modules to train each other. The "what" is encoded by a discriminator that can classify (state, instruction) pairs as either being a correct goal state or not, and the "how" is encoded by a policy. They assume they have some human-annotated goal states for instructions. The discriminator is then trained with supervised learning, where the positive examples are the human-annotated goal states, and the negative examples are states that the policy achieves during training (which are usually failures). The policy is trained using A3C with a reward function that is 1 if the discriminator says the state is more likely than not to be a goal state, and 0 otherwise. Of course, if the policy actually achieves the goal state, there is no way of knowing this apart from the discriminator -- so by default all of the states that the policy achieves (including goal states) are treated as negative examples for the dsicriminator. This leads to the discriminator getting slightly worse over time as the policy becomes better, since it is incorrectly told that certain states are not goal states. To fix this issue, the authors drop the top 25% of states achieved by the policy that have the highest probability of being a goal state (according to the discriminator).

The authors compare AGILE against A3C with the true reward function (i.e. the reward function implied by a perfect discriminator) and found that AGILE actually performed better, implying that the inaccuracy of the discriminator actually helped with learning. The authors hypothesize that this is because when the discriminator incorrectly rewards non-goal states, it is actually providing useful reward shaping that rewards progress towards the goal, leading to faster learning. Note though that A3C with an auxiliary reward prediction objective performed best. They have several other experiments that look at individual parts of the system.

My opinion: I like the idea of separating "what to do" from "how to do it", since the "what to do" is more likely to generalize to new circumstances. Of course, this can also be achieved by learning a reward function, which is one way to encode "what to do". I'm also happy to see progress on the front of learning what humans want where we can take advantage of adversarial training that leads to a natural curriculum -- this has been key in many systems, most notably AlphaZero.
I'm somewhat surprised that dropping the top 25% of states ranked highly by the discriminator works. I would have guessed that states that are "near" the goal states might be misclassified by the discriminator, and the mistake will never be fixed because those states will always be in the top 25% and so will never show up as negative examples. I don't know whether I should expect this problem to show up in other environments, or whether there's a good reason to expect it won't happen.

I'm also surprised at the results from one of their experiments. In this experiment, they trained the agent in the normal environment, but then made red squares immovable in the test environment. This only changes the dynamics, and so the discriminator should work just as well (about 99.5% accuracy). The policy performance tanks (from 98% to 52%), as you'd expect when changing dynamics, but if you then finetune the policy, it only gets back to 69% success. Given that the discriminator should be just as accurate, you'd expect the policy to get back to 98% accuracy. Partly the discrepancy is that some tasks become unsolvable when red squares are immovable, but they say that this is a small effect. My hypothesis is before finetuning, the policy is very certain of what to do, and so doesn't explore enough during finetuning, and can't learn new behaviors effectively. This would mean that if they instead retrained the policy starting from a random initialization, they'd achieve better performance (though likely requiring many more samples).

A general model of safety-oriented AI development (Wei Dai): A general model for developing safe powerful AI systems is to have a team of humans and AIs, which continually develops and adds more AIs to the team, while inductively preserving alignment.

My opinion: I'm glad this was finally written down -- I've been calling this the "induction hammer" and have used it a lot in my own thinking. Thinking about this sort of a model, and in particular what kinds of properties we could best preserve inductively, has been quite helpful for me.

AGI Strategy - List of Resources: Exactly what it sounds like.

Technical AI alignment

Agent foundations

Counterfactual Mugging Poker Game (Scott Garrabrant): This is a variant of counterfactual mugging, in which an agent doesn't take the action that is locally optimal, because that would provide information in the counterfactual world where one aspect of the environment was different that would lead to a large loss in that setting.

My opinion: This example is very understandable and very short -- I haven't summarized it because I don't think I can make it any shorter.

Weak arguments against the universal prior being malign (X4vier): In an earlier post, Paul Christiano has argued that if you run Solomonoff induction and use its predictions for important decisions, most of your probability mass will be placed on universes with intelligent agents that make the right predictions so that their predictions will influence your decisions, and then use that influence to manipulate you into doing things that they value. This post makes a few arguments that this wouldn't actually happen, and Paul responds to the arguments in the comments.

My opinion: I still have only a fuzzy understanding of what's going on here, so I'm going to abstain from an opinion on this one.

Prerequisities: What does the universal prior actually look like?

Learning human intent

Learning to Follow Language Instructions with Adversarial Reward Induction (Dzmitry Bahdanau et al): Summarized in the highlights!

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning (Dhruv Malik, Malayandi Palaniappan et al): Previously, Cooperative Inverse Reinforcement Learning (CIRL) games were solved by reducing them to a POMDP with an exponentially-sized action space, and then solving with POMDP algorithms that are exponential in the size of the action space, leading to a doubly-exponential algorithm. This paper leverages the fact that the human has perfect information to create a modified Bellman update that still computes the optimal policy, but no longer requires an exponential action space. The modified Bellman update works with the human's policy, and so we can now swap in more accurate models of the human, including eg. noisy rationality (whereas previously the human had to be exactly optimal). They show huge speedups in experiments, and discuss some interesting qualitative behavior that arises out of CIRL games -- for example, sometimes the human waits instead of making progress on the task, because it is a good signal to the robot of what the human wants.

My opinion: I'm excited by this improvement, since now we can actually solve non-trivial CIRL games -- one of the games they solve has around 10 billion states. With this we can run experiments with real humans, which seems really important, and the paper does mention a very preliminary pilot study run with real humans.

Prerequisities: Cooperative Inverse Reinforcement Learning

Learning a Prior over Intent via Meta-Inverse Reinforcement Learning (Kelvin Xu et al): For complex rewards, such as reward functions defined on pixels, standard IRL methods require a large number of demonstrations. However, many tasks are very related, and so we should be able to leverage demonstrations from one task to learn rewards for other tasks. This naturally suggests that we use meta learning. The authors adapt MAML to work with maximum entropy IRL (which requires differentiating through the MaxEnt IRL gradient). They evaluate their approach, called MandRIL, on a navigation task whose underlying structure is a gridworld, but the state is represented as an image so that the reward function is nonlinear and requires a convnet.

My opinion: In one of the experiments, the baseline of running IRL from scratch performed second best, beating out two other methods of meta-learning. I'd guess that this is because both MandRIL and standard IRL benefit from assuming the maxent IRL distribution over trajectories (which I believe is how the demonstrations were synthetically generated), whereas the other two meta learning baselines do not have any such assumption, and must learn this relationship.

Imitating Latent Policies from Observation (Ashley D. Edwards et al): Typically in imitation learning, we assume that we have access to demonstrations that include the actions that the expert took. However, in many realistic settings we only have access to state observations (eg. driving videos). In this setting, we could still infer a reward function and then use reinforcement learning (RL) to imitate the behavior, but this would require a lot of interaction with the environment to learn the dynamics of the environment. Intuitively, even demonstrations with only states and no actions should give us a lot of information about the dynamics -- if we can extract this information, then we would need much less environment interaction during RL. (For example, if you watch a friend play a video game, you only see states, not actions; yet you can infer a lot about the game rules and gameplay.) The key idea is that each action probably causes similar effects on different states. So, they create a model with hidden action nodes z, and use the state observations to learn a policy P(z | s) and dynamics s' = g(s, z) (they assume deterministic dynamics). This is done end-to-end with neural nets, but essentially the net is looking at the sequence of states and figuring out how to assign actions z to each s (this is P(z | s)), such that we can learn a function g(s, z) that outputs the next observed state s'. Once this is trained, intuitively g(s, z) will already have captured most of the dynamics, and so now we only require a small number of environment actions to figure out how the true actions a correspond to the hidden actions z -- concretely, we train a model P(a | s, z). Then, in any state s, we first choose the most likely hidden action z according to P(z | s), and then the most likely action a according to P(a | s, z*).

My opinion: The intuition behind this method makes a lot of sense to me, but I wish the experiments were clearer in showing how the method compares to other methods. They show that, on Cartpole and Acrobat, they can match the results of behavioral cloning with 50,000 state-action pairs using 50,000 state observations and 200 environment interactions, but I don't know if behavioral cloning actually needed that many state-action pairs. Similarly, I'm not sure how much environment interaction would be needed if you inferred a reward function but not the dynamics, since they don't compare against such a method. I'm also unclear on how hard it is to assign transitions to latent actions -- they only test on MDPs with at most 3 actions, it's plausible to me that with more actions it becomes much harder to figure out which hidden action a state transition should correspond to.

Preventing bad behavior

Worrying about the Vase: Whitelisting (TurnTrout): It's really hard to avoid negative side effects because explicitly listing out all possible side effects the agent should avoid would be far too expensive. The issue is that we're trying to build a blacklist of things that can't be done, and that list will never be complete, and so some bad things will still happen. Instead, we should use whitelists, because if we forget to add something to the whitelist, that only limits the agent, it doesn't lead to catastrophe. In this proposal, we assume that we have access to the agent's ontology (in current systems, this might be the output of an object detection system), and we operationalize an "effect" as the transformation of one object into another (i.e. previously the AI believed an object was most likely an A, and now it believes it is most likely a B). We then whitelist allowed transformations -- for example, it is allowed to transform a carrot into carrot slices. If the agent causes any transformations not on the whitelist (such as "transforming" a vase into a broken vase), it incurs a negative reward. We also don't have to explicitly write down the whitelist -- we can provide demonstrations of acceptable behavior, and any transitions in these demonstrations can be added to the whitelist. The post and paper have a long list of considerations on how this would play out in a superintelligent AI system.

My opinion: Whitelisting seems like a good thing to do, since it is safe by default. (Computer security has a similar principle of preferring to whitelist instead of blacklist.) I was initially worried that we'd have the problems of symbolic approaches to AI, where we'd have to enumerate far too many transitions for the whitelist in order to be able to do anything realistic, but since whitelisting could work on learned embedding spaces, and the whitelist itself can be learned from demonstrations, this could be a scalable method. I'm worried that it presents generalization challenges -- if you are distinguishing between different colors of tiles, to encode "you can paint any tile" you'd have to whitelist transitions (redTile -> blueTile), (blueTile -> redTile), (redTile -> yellowTile) etc. Those won't all be in the demonstrations. If you are going to generalize there, how do you not generalize (redLight -> greenLight) to (greenLight -> redLight) for an AI that controls traffic lights? On another note, I personally don't want to assume that we can point to a part of the architecture as the AI's ontology. I hope to see future work address these challenges!

Handling groups of agents

Adaptive Mechanism Design: Learning to Promote Cooperation (Tobias Baumann et al)

Multi-Agent Deep Reinforcement Learning with Human Strategies (Thanh Nguyen et al)


Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing (Fabian B. Fuchs et al)

Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning (Leilani H. Gilpin et al)

Miscellaneous (Alignment)

A general model of safety-oriented AI development (Wei Dai): Summarized in the highlights!

To Trust Or Not To Trust A Classifier (Heinrich Jiang, Been Kim et al): The confidence scores given by a classifier (be it logistic regression, SVMs, or neural nets) are typically badly calibrated, and so it is hard to tell whether or not we should trust our classifier's prediction. The authors propose that we compute a trust score to tell us how much to trust the classifier's prediction, computed from a training set of labeled datapoints. For every class, they filter out some proportion of the data points, which removes outliers. Then, the trust score for a particular test point is the ratio of (distance to nearest non-predicted class) to (distance to predicted class). They have theoretical results showing that a high trust score means that the classifier likely agrees with the Bayes-optimal classifier, as well as empirical results showing that this method does better than several baselines for determining when to trust a classifier. One cool thing about this method is that it can be done with any representation of the input data points -- they find that working with the activations of deeper layers of a neural net improves the results.

My opinion: I'm a big fan of trying to understand when our AI systems work well, and when they don't. However, I'm a little confused by this -- ultimately the trust score is just comparing the given classifier with a nearest neighbor classifier. Why not just use the nearest neighbor classifier in that case? This paper is a bit further out of my expertise than I'd like to admit, so perhaps there's an obvious answer I'm not seeing.

Podcast: Astronomical Future Suffering and Superintelligence with Kaj Sotala (Lucas Perry)

Near-term concerns

Adversarial examples

Defense Against the Dark Arts: An overview of adversarial example security research and future research directions (Ian Goodfellow)

AI strategy and policy

AGI Strategy - List of Resources: Summarized in the highlights!

Accounting for the Neglected Dimensions of AI Progress (Fernando Martinez-Plumed et al)

Artificial Intelligence and International Affairs: Disruption Anticipated (Chatham House)

India's National Strategy for Artificial Intelligence

AI capabilities

Reinforcement learning

Self-Imitation Learning (Junhyuk Oh et al)

Deep learning

Neural scene representation and rendering (S. M. Ali Eslami, Danilo J. Rezende et al)

Improving Language Understanding with Unsupervised Learning (Alec Radford et al)

Meta learning

Unsupervised Meta-Learning for Reinforcement Learning (Abhishek Gupta et al)

Bayesian Model-Agnostic Meta-Learning (Taesup Kim et al)


Research Scholars Programme: From the website: "The Future of Humanity Institute is launching a Research Scholars Programme, likely to start in October 2018. It is a selective, two-year research programme, with lots of latitude for exploration as well as significant training and support elements. We will offer around six salaried positions to early-career researchers who aim to answer questions that shed light on the big-picture questions critical to humanity’s wellbeing. We are collecting formal applications to the programme from now until 11 July, 2018."

Announcing the second AI Safety Camp (Anne Wissemann): I forgot to mention last week that the second AI safety camp will be held Oct 4-14 in Prague.

Human-aligned AI Summer School: The first Human-aligned AI Summer School will be held in Prague from 2nd to 5th August, with a focus on “learning from humans” (in particular, IRL and models of bounded rationality). Applications are open till July 14, but may close sooner if spots are filled up.