[AN #73]: Detecting catastrophic failures by learning how agents tend to break

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures (Jonathan Uesato, Ananya Kumar, Csaba Szepesvari et al) (summarized by Nicholas): An important problem in safety-critical domains is accurately estimating slim probabilities of catastrophic failures: one in a million is very different from one in a billion. A standard Monte Carlo approach requires millions or billions of trials to find a single failure, which is prohibitively expensive. This paper proposes using agents from earlier in the training process to provide signals for a learned failure probability predictor. For example, with a Humanoid robot, failure is defined as the robot falling down. A neural net is trained on earlier agents to predict the probability that the agent will fall down from a given state. To evaluate the final agent, states are importance-sampled based on how likely the neural network believes they are to cause failure. This relies on the assumption that the failure modes of the final agent are similar to some failure mode of earlier agents. Overall, the approach reduces the number of samples required to accurately estimate the failure probability by multiple orders of magnitude.

Nicholas's opinion: I am quite excited about the focus on preventing low likelihood catastrophic events, particularly from the standpoint of existential risk reduction. The key assumption in this paper, that earlier in training the agent will fail in related ways but more frequently, seems plausible to me and in line with most of my experience training neural networks, and the experiments demonstrate a very large increase in efficiency.

I’d be interested to see theoretical analysis of what situations would make this assumption more or less likely in the context of more powerful future agents. For example, one situation where the failure modes might be distinct later in training is if an agent learns how to turn on a car, which then makes states where the agent has access to a car have significantly higher likelihood of catastrophic failures than they did before.

Technical AI alignment

Learning human intent

AI Alignment Podcast: Synthesizing a human’s preferences into a utility function (Lucas Perry and Stuart Armstrong) (summarized by Rohin): Stuart Armstrong's agenda (AN #60) involves extracting partial preferences from a human and synthesizing them together into an adequate utility function. Among other things, this podcast goes into the design decisions underlying the agenda:

First, why even have a utility function? In practice, there are many pressures suggesting that maximizing expected utility is the "right" thing to do -- if you aren't doing this, you're leaving value on the table. So any agent that isn't maximizing a utility function will want to self-modify into one that is using a utility function, so we should just use a utility function in the first place.

Second, why not defer to a long reflection process, as in Indirect Normativity, or some sort of reflectively stable values? Stuart worries that such a process would lead to us prioritizing simplicity and elegance, but losing out on something of real value. This is also why he focuses on partial preferences: that is, our preferences in "normal" situations, without requiring such preferences to be extrapolated to very novel situations. Of course, in any situation where our moral concepts break down, we will have to extrapolate somehow (otherwise it wouldn't be a utility function) -- this presents the biggest challenge to the research agenda.

Full toy model for preference learning (Stuart Armstrong) (summarized by Rohin): This post applies Stuart's general preference learning algorithm to a toy environment in which a robot has a mishmash of preferences about how to classify and bin two types of objects.

Rohin's opinion: This is a nice illustration of the very abstract algorithm proposed before; I'd love it if more people illustrated their algorithms this way.

Forecasting

AlphaStar: Impressive for RL progress, not for AGI progress (orthonormal) (summarized by Nicholas): This post argues that while it is impressive that AlphaStar can build up concepts complex enough to win at StarCraft, it is not actually developing reactive strategies. Rather than scouting what the opponent is doing and developing a new strategy based on that, AlphaStar just executes one of a predetermined set of strategies. This is because AlphaStar does not use causal reasoning, and that keeps it from beating any of the top players.

Nicholas's opinion: While I haven’t watched enough of the games to have a strong opinion on whether AlphaStar is empirically reacting to its opponents' strategies, I agree with Paul Christiano’s comment that in principle causal reasoning is just one type of computation that should be learnable.

This discussion also highlights the need for interpretability tools for deep RL so that we can have more informed discussions on exactly how and why strategies are decided on.

Addendum to AI and Compute (Girish Sastry et al) (summarized by Rohin): Last year, OpenAI wrote (AN #7) that since 2012, the amount of compute used in the largest-scale experiments has been doubling every 3.5 months. This addendum to that post analyzes data from 1959-2012, and finds that during that period the trend was a 2-year doubling time, approximately in line with Moore's Law, and not demonstrating any impact of previous "AI winters".

Rohin's opinion: Note that the post is measuring compute used to train models, which was less important in past AI research (e.g. it doesn't include Deep Blue), so it's not too surprising that we don't see the impact of AI winters.

Etzioni 2016 survey (Katja Grace) (summarized by Rohin): Oren Etzioni surveyed 193 AAAI fellows in 2016 and found that 67.5% of them expected that ‘we will achieve Superintelligence’ someday, but in more than 25 years. Only 7.5% thought we would achieve it sooner than that.

AI strategy and policy

GPT-2: 1.5B Release (Irene Solaiman et al) (summarized by Rohin): Along with the release of the last and biggest GPT-2 model, OpenAI explains their findings with their research in the time period that the staged release bought them. While GPT-2 can produce reasonably convincing outputs that are hard to detect and can be finetuned for e.g. generation of synthetic propaganda, so far they have not seen any evidence of actual misuse.

Rohin's opinion: While it is consistent to believe that OpenAI was just generating hype since GPT-2 was predictably not going to have major misuse applications, and this has now been borne out, I'm primarily glad that we started thinking about publication norms before we had dangerous models, and it seems plausible to me that OpenAI was also thinking along these lines.

Other progress in AI

Reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning (AlphaStar Team) (summarized by Nicholas): AlphaStar (AN #43), DeepMind’s StarCraft II AI, has now defeated a top professional player and is better than 99.8% of players. While previous versions were limited to only a subset of the game, it now plays the full game and has limitations on how quickly it can take actions similar to top human players. It was trained initially via supervised learning on human players and then afterwards trained using RL.

A challenge in learning StarCraft via self-play is that strategies exhibit non-transitivity: Stalker units beat Void Rays, Void Rays beat Immortals, but Immortals beat Stalkers. This can lead to training getting stuck in cycles. In order to avoid this, they set up a League of exploiter agents and main agents. The exploiter agents train only against the current iteration of main agents, so they can learn specific counter-strategies. The main agents then train against a mixture of current main agents, past main agents, and exploiters, prioritizing opponents that they have a lower win rate against.

Nicholas's opinion: I think this is a very impressive display of how powerful current ML methods are at a very complex game. StarCraft poses many challenges that are not present in board games such as chess and go, such as limited visibility, a large state and action space, and strategies that play out over very long time horizons. I found it particularly interesting how they used imitation learning and human examples to avoid trying to find new strategies by exploration, but then attained higher performance by training on top of that.

I do believe progress on games is becoming less correlated with progress on AGI. Most of the key innovations in this paper revolve around the League training, which seems quite specific to StarCraft. In order to continue making progress towards AGI, I think we need to focus on being able to learn in the real world on tasks that are not as easy to simulate.

Deep Dynamics Models for Dexterous Manipulation (Anusha Nagabandi et al) (summarized by Flo): For hard robotic tasks like manipulating a screwdriver, model-free RL requires large amounts of data that are hard to generate with real-world hardware. So, we might want to use the more sample-efficient model-based RL, which has the additional advantage that the model can be reused for similar tasks with different rewards. This paper uses an ensemble of neural networks to predict state transitions, and plans by sampling trajectories for different policies. With this, they train a real anthropomorphic robot hand to be able to rotate two balls in its hand somewhat reliably within a few hours. They also trained for the same task in a simulation and were able to reuse the resulting model to move a single ball to a target location.

Flo's opinion: The videos look impressive, even though the robot hand still has some clunkiness to it. My intuition is that model-based approaches can be very useful in robotics and similar domains, where the randomness in transitions can easily be approximated by Gaussians. In other tasks where transitions follow more complicated, multimodal distributions, I am more sceptical.

Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Sparse Reward Environments (Vinicius G. Goecks et al) (summarized by Zach): This paper contributes to the effort of combining imitation and reinforcement learning to train agents more efficiently. The current difficulty in this area is that imitation and reinforcement learning proceed under rather different objectives which presents a significant challenge to updating a policy learned from a pure demonstration. A major portion of this difficulty stems from the use of so-called "on-policy" methods for training which require a significant number of environment interactions to be effective. In this paper, the authors propose a framework dubbed "Cycle-of-Learning" (CoL) that allows for the off-policy combination of imitation and reinforcement learning. This allows the two approaches to be combined much more directly which grounds the agent's policy in the expert demonstrations while simultaneously allowing for RL to fine-tune the policy. The authors show that CoL is an improvement over the current state of the art by testing their algorithm in several environments and performing an ablation study.

Zach's opinion: At first glance, it would seem as though the idea of using an off-policy method to combine imitation and reinforcement learning is obvious. However, the implementation is complicated by the fact that we want the value functions being estimated by our agent to satisfy the optimality condition for the Bellman equation. Prior work, such as Hester et al. 2018 uses n-step returns to help pre-training and make use of on-policy methods when performing RL. What I like about this paper is that they perform an ablation study and show that simple sequencing of imitation learning and RL algorithms isn't enough to get good performance. This means that combining the imitation and reinforcement objectives into a single loss function is providing a significant improvement over other methods.

News

Researcher / Writer job (summarized by Rohin): This full-time researcher / writer position would involve half the time working with Convergence on x-risk strategy research and the other half with Normative on environmental and climate change analysis documents.

11