Alignment Newsletter #14


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've created a public database of almost all of the papers I've summarized in the Alignment Newsletter! Most of the entries will have all of the data I put in the emails.


One-Shot Imitation from Watching Videos (Tianhe Yu and Chelsea Finn): Can we get a robot to learn a task by watching a human do it? This is very different from standard imitation learning. First, we want to do it with a single demonstration, and second, we want to do it by watching a human -- that is, we're learning from a video of a human, not a trajectory where the robot actions are given to us. Well, first consider how we could do this if we have demonstrations from a teleoperated robot. In this case, we do actually have demonstrations in the form of trajectories, so normal imitation learning techniques (behavioral cloning in this case) work fine. We can then take this loss function and use it with MAML to learn from a large dataset of tasks and demonstrations how to perform a new task given a single demonstration. But this still requires the demonstration to be collected by teleoperating the robot. What if we want to learn from a video of a human demonstrating? They propose learning a loss function that given the human video provides a loss from which gradients can be calculated to update the policy. Note that at training time there are still teleoperation demonstrations, so the hard task of learning how to perform tasks is done then. At test time, the loss function inferred from the human video is primarily used to identify which objects to manipulate.

My opinion: This is cool, it actually works on a real robot, and it deals with the issue that a human and a robot have different action spaces.

Prerequisities: Some form of meta-learning (ideally MAML).

Capture the Flag: the emergence of complex cooperative agents (Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning et al): DeepMind has trained FTW (For The Win) agents that can play Quake III Arena Capture The Flag from raw pixels, given only the signal of whether they win or not. They identify three key ideas that enable this -- population based training (instead of self play), learning an internal reward function, and operating at two timescales (enabling better use of memory). Their ablation studies show that all of these are necessary, and in particular it even outperforms population based training with manual reward shaping. The trained agents can cooperate and compete with a wide range of agents (thanks to the population based training), including humans.

But why are these three techniques so useful? This isn't as clear, but I can speculate. Population based training works well because the agents are trained against a diversity of collaborators and opponents, which can fix the issue of instability that afflicts self-play. Operating at two timescales gives the agent a better inductive bias. They say that it enables the agent to use memory more effectively, but my story is that it lets it do something more hierarchical, where the slow RNN makes "plans", while the fast RNN executes on those plans. Learning an internal reward function flummoxed me for a while, it really seemed like that should not outperform manual reward shaping, but then I found out that the internal reward function is computed from the game points screen, not from the full trajectory. This gives it a really strong inductive bias (since the points screen provides really good features for defining reward functions) that allows it to quickly learn an internal reward function that's more effective than manual reward shaping. It's still somewhat surprising, since it's still learning this reward function from the pixels of the points screen (I assume), but more believable.

My opinion: This is quite impressive, since they are learning from the binary win-loss reward signal. I'm surprised that the agents generalized well enough to play alongside humans -- I would have expected that to cause a substantial distributional shift preventing good generalization. They only had 30 agents in their population, so it seems unlikely a priori that this would induce a distribution that included humans. Perhaps Quake III is simple enough strategically that there aren't very many viable strategies, and most strategies are robust to having slightly worse allies? That doesn't seem right though.

DeepMind did a lot of different things to analyze what the agents learned and how they are different from humans -- check out the paper for details. For example, they showed that the agents are much better at tagging (shooting) at short ranges, while humans are much better at long ranges.

Technical AI alignment

Technical agendas and prioritization

An introduction to worst-case AI safety (Tobias Baumann): Argues that people with suffering-focused ethics should focus on "worst-case AI safety", which aims to find technical solutions to risks of AIs creating vast amounts of suffering (which would be much worse than extinction).

My opinion: If you have strongly suffering-focused ethics (unlike me), this seems mostly right. The post claims that suffering-focused AI safety should be more tractable than AI alignment, because it focuses on a subset of risks and only tries to minimize them. However, it's not necessarily the case that focusing on a simpler problem makes it easier to solve. It feels easier to me to figure out how to align an AI system to humans, or how to enable human control of an AI system, than to figure out all the ways in which vast suffering could happen, and solve each one individually. You can make an analogy to mathematical proofs and algorithms -- often, you want to try to prove a stronger statement than the one you are looking at, because when you use induction or recursion, you can rely on a stronger inductive hypothesis.

Learning human intent

One-Shot Imitation from Watching Videos (Tianhe Yu and Chelsea Finn): Summarized in the highlights!

Learning Montezuma’s Revenge from a Single Demonstration (Tim Salimans et al): Montezuma's Revenge is widely considered to be one of the hardest Atari games to learn, because the reward is so sparse -- it takes many actions to reach the first positive reward, and if you're using random exploration, it will take exponentially many actions (in N, the number of actions till the first reward) to find any reward. A human demonstration should make the exploration problem much easier. In particular, we can start just before the end of the demonstration, and train the RL agent to get as much score as the demonstration. Once it learns that, we can start it at slightly earlier in the demonstration, and do it again. Repeating this, we eventually get an agent that can perform the whole demonstration from start to finish, and it takes time linear in the length of the demonstration. Note that the agent must be able to generalize a little bit to states "around" the human demonstration -- when it takes random actions it will eventually reach a state that is similar to a state it saw earlier, but not exactly the same, and it needs to generalize properly. It turns out that this works for Montezuma's Revenge, but not for other Atari games like Gravitar and Pitfall.

My opinion: Here, the task definition continues to be the reward function, and the human demonstration is used to help the agent effectively optimize the reward function. Such agents are still vulnerable to misspecified reward functions -- in fact, the agent discovers a bug in the emulator that wouldn't have happened if it was trying to imitate the human. I would still expect the agent to be more human-like than one trained with standard RL, since it only learns the environment near the human policy.

Atari Grand Challenge (Vitaly Kurin): This is a website crowdsourcing human demonstrations for Atari games, which means that the dataset will be very noisy, with demonstrations from humans of vastly different skill levels. Perhaps this would be a good dataset to evaluate algorithms that aim to learn from human data?

Beyond Winning and Losing: Modeling Human Motivations and Behaviors Using Inverse Reinforcement Learning (Baoxiang Wang et al): How could you perform IRL without access to a simulator, or a model of the dynamics of the game, or the full human policy (only a set of demonstrations)? In this setting, as long as you have a large dataset of diverse human behavior, you can use Q-learning on the demonstrations to estimate separate Q-function for each feature, and then for a given set of demonstrations you can infer the reward for that set of demonstrations using a linear program that attempts to make all of the human actions optimal given the reward function. They define (manually) five features for World of Warcraft Avatar History (WoWAH) that correspond to different motivations and kinds of human behavior (hence the title of the paper) and infer the weights for those rewards. It isn't really an evaluation because there's no ground truth.

Preventing bad behavior

Overcoming Clinginess in Impact Measures (TurnTrout): In their previous post, TurnTrout proposed a whitelisting approach, that required the AI not to cause side effects not on the whitelist. One criticism was that it made the AI clingy, that is, the AI would also prevent any other agents in the world from causing non-whitelisted effects. In this post, they present a solution to the clinginess problem. As long as the AI knows all of the other agents in the environment, and their policies, the AI can be penalized for the difference of effects between its behavior, and what the human(s) would have done. There's analysis in a few different circumstances, where it's tricky to get the counterfactuals exactly right. However, this sort of impact measure means that while the AI is punished for causing side effects itself, it can manipulate humans to perform those side effects on its behalf with no penalty. This appears to be a tradeoff in the impact measure framework -- either the AI will be clingy, where it prevents humans from causing prohibited side effects, or it could cause the side effects through manipulation of humans.

My opinion: With any impact measure approach, I'm worried that there is no learning of what humans care about. As a result I expect that there will be issues that won't be handled properly (similarly to how we don't expect to be able to write down a human utility function). In the previous post, this manifested as a concern for generalization ability, which I'm still worried about. I think the tradeoff identified in this post is actually a manifestation of this worry -- clinginess happens when your AI overestimates what sorts of side effects humans don't want to happen in general, while manipulation of humans happens when your AI underestimates what side effects humans don't want to happen (though with the restriction that only humans can perform these side effects).

Prerequisities: Worrying about the Vase: Whitelisting

Game theory

Modeling Friends and Foes (Pedro A. Ortega et al): Multiagent scenarios are typically modeled using game theory. However, it is hard to capture the intuitive notions of "adversarial", "neutral" and "friendly" agents using standard game theory terminology. The authors propose that we model the agent and environment as having some prior mixed strategy, and then allow them to "react" by changing the strategies to get a posterior strategy, but with a term in the objective function for the change (as measured by the KL divergence). The sign of the environment's KL divergence term determines whether it is friendly or adversarial, and the magnitude determines the magnitude of friendliness or adversarialness. They show that there are always equilibria, and give an algorithm to compute them. They then show some experiments demonstrating that the notions of "friendly" and "adversarial" they develop actually do lead to behavior that we would intuitively call friendly or adversarial.

Some notes to understand the paper: while normally we think of multiagent games as consisting of a set of agents, in this paper there is an agent that acts, and an environment in which it acts (which can contain other agents). The objective function is neither minimized nor maximized -- the sign of the environment's KL divergence changes whether the stationary points are maxima or minima (which is why it can model both friendly and adversarial environments). There is only one utility function, the agent's utility function -- the environment is only modeled as responding to the agent, rather than having its own utility function.

My opinion: This is an interesting formalization of friendly and adversarial behavior. It feels somewhat weird to model the environment as having a prior strategy that it can then update. This has the implication that a "somewhat friendly" environment is unable to change its strategy to help the agent, even though it would "want" to, whereas when I think of a "somewhat friendly" environment, I think of a group of agents that share some of your goals but not all of them, so a limited amount of cooperation is possible. These feel quite different.


This looks like that: deep learning for interpretable image recognition (Chaofan Chen, Oscar Li et al)


Towards Mixed Optimization for Reinforcement Learning with Program Synthesis (Surya Bhupatiraju, Kumar Krishna Agrawal et al): This paper proposes a framework in which policies are represented in two different ways -- as neural nets (the usual way) and as programs. To go from neural nets to programs, you use program synthesis (as done by VIPER and PIRL, both summarized in previous newsletters). To go from programs to neural nets, you use distillation (basically use the program to train the neural net with supervised training). Given these transformations, you can then work with the policy in either space. For example, you could optimize the policy in both spaces, using standard gradient descent in neural-net-space, and program repair in program-space. Having a program representation can be helpful in other ways too, as it makes the policy more interpretable, and more amenable to formal verification of safety properties.

My opinion: It is pretty nice to have a program representation. This paper doesn't delve into specifics (besides a motivating example worked out by hand), but I'm excited to see an actual instantiation of this framework in the future!

Near-term concerns

Adversarial examples

Adversarial Reprogramming of Neural Networks (Gamaleldin F. Elsayed et al)

AI strategy and policy

Shaping economic incentives for collaborative AGI (Kaj Sotala): This post considers how to encourage a culture of cooperation among AI researchers. Then, when researchers try to create AGI, this culture of cooperation may make it more likely that AGI is developed collaboratively, instead of with race dynamics, making it more likely to be safe. It specifically poses the question of what external economic or policy incentives could encourage such cooperation.

My opinion: I am optimistic about developing AGI collaboratively, especially through AI researchers cooperating. I'm not sure whether external incentives from government are the right way to achieve this -- it seems likely that such regulation would be aimed at the wrong problems if it originated from government and not from AI researchers themselves. I'm more optimistic about some AI researchers developing guidelines and incentive structures themselves, that researchers buy into voluntarily, that maybe later get codified into law by governments, or adopted by companies for their AI research.

An Overview of National AI Strategies (Tim Dutton): A short reference on the AI policies released by various countries.

My opinion: Reading through this, it seems that countries are taking quite different approaches towards AI. I don't know what to make of this -- are they acting close to optimally given their geopolitical situation (which must then vary a lot by country), or does no one know what's going on and as a result all of the strategies are somewhat randomly chosen? (Here by "randomly chosen" I mean that the strategies that one group of analysts would select with is only weakly correlated with the strategies another group would select.) It could also be that the approaches are not actually that different.

Joint Artificial Intelligence Center Created Under DoD CIO (Sydney J. Freedberg Jr.)

AI capabilities

Reinforcement learning

Capture the Flag: the emergence of complex cooperative agents (Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning et al): Summarized in the highlights!

Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization (Alexandre Laterre et al)

Procedural Level Generation Improves Generality of Deep Reinforcement Learning (Niels Justesen et al)