[AN #100]: What might go wrong if you learn a reward function while acting

Rohin Shah

Newsletter #100 (!!)

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

Pitfalls of learning a reward function online (Stuart Armstrong et al) (summarized by Rohin): It can be dangerous to learn the metric that you are trying to optimize: if you don't set it up correctly, you may end up incentivizing the agent to "update in a particular direction" in the metric learning for the sake of future optimization (a point previously made in Towards Interactive Inverse Reinforcement Learning). This paper analyzes the problems that can arise when an agent simultaneously learns a reward function, and optimizes that reward function.

The agent may have an incentive to "rig" the reward learning process, such that it finds a reward that is easy to optimize. For example, consider a student Sandra who must figure out the deadline and evaluation criteria for a project from a teacher Trisha. Sandra expects that if she asks Trisha when the deadline is, she will say that the deadline is later this week. So, Sandra might cleverly ask, "Is the project due next week, or the week after", to which Trisha might respond "next week". In this way, Sandra can rig the deadline-learning process in order to obtain a more favorable deadline.

Worse, in such scenarios the need to rig the learning process can destroy value for every reward function you are considering. For example, let's suppose that if Trisha couldn't be manipulated, Sandra's optimal policy would be to start the project today, regardless of when the actual deadline is. However, given that Trisha can be manipulated, Sandra will spend today manipulating Trisha into setting a later deadline -- an action that seems clearly suboptimal from the perspective of any fixed deadline. The paper describes this as sacrificing reward with certainty.

To avoid such situations, we need unriggable learning processes, that is, ones where at all times, the expected final learned reward (deadline) is independent of the agent's (Sandra's) policy. This unriggability property is nearly equivalent to the property of uninfluencability, in which we must be able to posit some background variables in the environment such that the learning process can be said to be "learning" these variables. Technically, an unriggable process need not be uninfluenceable, though it usually is (see the paper for details).

However, these properties only constrain the expectation over environments of the final reward distribution: it doesn't prevent the agent from somehow shuffling around reward functions to be matched with suitable environments. For example, without knowing which projects are easy or hard, Sandra could manipulate Trisha into giving early deadlines for easy projects, and late deadlines for hard projects, in a manner that preserved the distribution over early and late deadlines; this would satisfy the unriggable property (and probably also the uninfluenceable property, depending on the exact formalization).

The authors demonstrate these problems in a simple gridworld example. They also point out that there's a simple way to make any learning process uninfluenceable: choose a specific policy π that gathers information about the reward, and then define the new learning process to be "whatever the original learning process would have said if you executed π".

Rohin's opinion: I would explain this paper's point somewhat differently than the paper does. Consider an AI system in which we build in a prior over rewards and an update rule, and then have it act in the world. At the end of the trajectory, it is rewarded according to the expected reward of the trajectory under the inferred posterior over rewards. Then, the AI system is incentivized to choose actions under which the resulting posterior is easy to maximize.

This doesn't require the reward function to be ambiguous; it just requires that the update rule isn't perfect. For example, imagine that Alice has a real preference for apples over bananas, and you use the update rule "if Alice eats an apple, infer that she likes apples; if Alice eats a banana, infer that she likes bananas". The robot finds it easier to grasp the rigid apple, and so can get higher expected reward in the worlds where Alice likes apples. If you train a robot in the manner above, then the robot will learn to throw away the bananas, so that Alice's only choice is an apple (that we assume she then eats), allowing the robot to "infer" that Alice likes apples, which it can then easily maximize. This sort of problem could happen in most current reward learning setups, if we had powerful enough optimizers.

It seems to me that the problem is that you are training the actor, but not training the update rule, and so the actor learns to "trick" the update rule. Instead, it seems like we should train both. This is kind of what happens with assistance games / CIRL (AN #69), in which you train a policy to maximize expected reward under the prior, and so the policy is incentivized to take the best information gathering actions (which, if you squint, is like "training to update well"), and to maximize what it thinks is the true reward. Of course, if your prior / update rule within the game are misspecified, then bad things can happen. See also Stuart's reactions here and here, as well as my comments on those posts.

TECHNICAL AI ALIGNMENT

INTERPRETABILITY

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (Peter Hase et al) (summarized by Robert): In this paper the authors perform user tests on 5 different model agnostic interpretability methods: LIME, Anchor, Decision Boundary, Prototype Model and a Composite model (LIME Anchor and Decision Boundary). The use cases they test are a tabular dataset predicting income, and a movie-review dataset predicting sentiment of the review from a single sentence.

Their experimental setup consists of 2 tests: forward prediction and counterfactual prediction. In forward prediction, the user is shown 16 examples of inputs and corresponding outputs and explanations, and then must predict the model’s output on new inputs (without the explanation, which often gives away the answer). In counterfactual prediction, after seeing 16 examples, the user is given an input-output-explanation triple, and then must predict how the output changes for a specific perturbation of the input.

Throughout the results they use a significance threshold of p < 0.05 (they don't use Bonferroni corrections). Their study has responses from 32 different students who'd taken at least 1 computer science course, with some screened out for outliers or low accuracy during training. There are approximately 200 individual predictions for each method/dataset-type combination, and each method/prediction-type combination.

Overall, their results show that only LIME (Local Interpretable Model-agnostic Explanation) helps improve performance with statistical significance on the tabular dataset across both prediction settings, and only the Prototype model in counterfactual prediction across both datasets. No other result was statistically significant. The improvement in accuracy for the statistically significant results is around 10% (from 70% to 80% in the Tabular dataset with LIME, and 63% to 73% for Prototype in counterfactual prediction).

They also showed that user's ratings of the explanation method didn't correlate in a statistically significant way with the improvement the model gave to their predictions.

Robert's opinion: I'm happy a paper like this exists, because I think this kind of work is crucial in evaluating whether interpretability methods we're building are actually useful. I'm not surprised by the results, because this hasn't been done rigorously before, so researchers have never had any idea whether their method has produced good explanations or not.

The study is weakened by the low sample size, which makes many of the p-values not significant. My intuition says a few more of the methods would produce statistically significant positive results in one of the domains/prediction settings if the sample size was bigger, but it seems like some settings (forward prediction, and textual data) are very hard to improve, with none of the methods getting a better improvement in performance than 5.7% (which had a p-value of 0.197).

A really interesting point is the lack of strong correlation between user-preference and performance improvement. This could be explained by the fact that most of the methods are ineffective at performance improvement, but it seems plausible (to me) that it could hold even if some methods were effective: If the model behaviour being explained can't be explained cleanly, then methods which do explain the behaviour might produce messy and confusing (but true) explanations and hence get lower ratings from users than methods which give clean and clear (but false) explanations. I think this stems from the problem of a lack of definition of what exactly the goal is for these interpretation methods. Without a goal in mind, it's impossible to measure whether the method achieves this goal. I think working towards some form of quantifiable measurement is useful particularly for comparing methods as, if this study's evidence is anything to go by, asking humans to evaluate the model's output might not be the most useful evaluation.

Towards Interpretable Reinforcement Learning Using Attention Augmented Agents (Alexander Mott et al) (summarized by Robert): In this paper the authors train a reinforcement learning agent with a soft attention module built into it. The attention module forms a bottleneck between the visual input and the network choosing the next action, which forces the model to learn to attend to only important parts of the scene. This means they can visualise which parts of the input the model thinks are important, as those are the parts of the input that the model is attending to. The queries to the attention model are determined by a top level recurrent network, without input from the current image, so act as a form of "top down" attention, where the top controller can be imagined to be querying the processed image for various locations and objects.

Having trained this agent (which still gets competitive performance with SOTA RL models on a fair few ATARI games), they qualitatively evaluate the attention visualisation on a variety of games. They find several common strategies in the attention schemes, such as the agents attending to specific points until an object crosses the point ("Tripwires"). The attention is computed over both regular pixels, as well as Fourier-based positional encoding. Thanks to this and other aspects of their architecture, the authors can check whether the queries are focused on pixel values (i.e. looking for a specific pattern of pixels anywhere) or on location features (i.e. asking what pixels are present at a specific location). For example, they find that the agent often queries the location where the score is displayed, presumably because it is useful for calculating the value function. They also compare their method with self-attention based models, and with other saliency methods.

The best way to get a feel for the visualisations is to go to the paper's website and watch the example videos.

Read more: The paper's website

Robert's opinion: This paper isn't revolutionary in its approach, but it's interesting to see work on interpreting RL agents, and the fact that the interpretability is built-in is interesting: it gives us a harder guarantee that this visualisation is actually showing us the parts of the input that the model thinks of as important, as they actually are important in its processing. It's promising to see that the in-built interpretability also didn't seem to penalise the performance much - it would be interesting to see this method applied to other, stronger kinds of models and see whether it still produces useful visualisations and how it affects their performance.

FIELD BUILDING

AI Governance Career Paths for Europeans (Anonymous) (summarized by Rohin): Exactly what it sounds like.

MISCELLANEOUS (ALIGNMENT)

A Guide to Writing the NeurIPS Impact Statement (Carolyn Ashurst et al) (summarized by Nicholas): NeurIPS 2020 requires paper submissions to include a statement on the broader impact of their work. This post provides a guide for how to write an effective impact statement. They recommend focusing on the most significant, neglected, and tractable impacts, both positive and negative, while also conveying the uncertainties involved. They also suggest integrating this into the research process by reading the tech governance literature and building institutional structures, and including this information in introductions.

Their guide then recommends considering 3 questions:

How could your research affect ML applications?

What are the societal implications of these applications?

What research or other initiatives could improve social outcomes?

There is more information in the guide on how to go about answering those questions, along with some examples.

Nicholas's opinion: I am definitely in favor of considering the impacts of ML research before conducting or publishing it. I think the field is currently either at or near a threshold where papers will start having significant real world effects. While I don’t think this requirement will be sufficient for ensuring positive outcomes, I am glad NeurIPS is trying it out.

I think the article makes very strong points and will improve the quality of the impact statements that get submitted. I particularly liked the point about communicating uncertainty, which is a norm that I think the ML community would benefit from greatly. One thing I would add here is that giving explicit probabilities is often more helpful than vague words like “might” or “could”.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

"Other-Play" for Zero-Shot Coordination (Hengyuan Hu et al) (summarized by Rohin): How can we build AI systems that can coordinate with humans? While past (AN #70) work (AN #70) has assumed access to some amount of human data, this paper aims to coordinate without any human data at all, which they call zero-shot coordination. In order to develop an algorithm, they assume that their partner is also "trained" for zero-shot coordination.

Their key idea is that in zero-shot coordination, since you can't break symmetries by agreeing upon a protocol in advance (i.e. you can't agree on things like "we'll drive on the left, not the right"), you need a policy that is robust to relabelings that preserve these symmetries. This is easy to train for: you just train in self-play, but randomly relabel the states, actions and observations separately for each side in a way that preserves the MDP structure (i.e. uses one of the symmetries). Thus, each side must play a policy that works well without knowing how the other agent's observations and actions have been relabeled. In practice, for an N-player game you only need to randomize N-1 of the relabelings, and so in the two player games they consider they only randomly relabel one side of the self-play.

They evaluate this in Hanabi (where the game is invariant to relabeling of the colors), and show that the resulting agents are better at playing with other agents trained on different seeds or with slightly different architectures, and also that they play better with humans, achieving an average score of 15.75 with non-expert human players, compared to 9.15 for agents trained via regular self-play.

Rohin's opinion: For comparison, I think I get around 17-22 when playing with new players, out of a max of 25, so 15.75 is quite a healthy score given that it doesn't use any human data. That being said, it seems hard to use this method in other settings -- even in the relatively simple Overcooked environment (AN #70), there aren't any obvious symmetries to use for such training. Perhaps future work will allow us to find approximate symmetries in games somehow, that we can then train to be robust to?

Towards Learning Multi-agent Negotiations via Self-Play (Yichuan Charlie Tang) (summarized by Rohin): While the previous paper introduces other-play to become robust to unknown partners, this paper takes the other approach of simply training an agent that is robust to a wide, diverse population of possible agents. In particular, it studies a self-driving car "zipper merge" environment, and trains an agent to be robust to a variety of rule-based agents, as well as past versions of itself, and finds that this leads to a much more successful merging policy. However, this is evaluated against the population it is trained with, and not against any previously unseen agents.

Building AI that can master complex cooperative games with hidden information (Adam Lerer et al) (summarized by Flo): This paper improves on the state of the art for AI agents playing Hanabi (AN #45), a cooperative multiplayer game that is challenging because of distributed hidden information and restricted communication.

The approach works by improving a baseline policy using search. In the simplest case, only one agent performs search while all other agents follow a fixed policy, such that the problem is reduced to search in a POMDP. This alone leads to relevant improvements, even when the search is very shallow. The fixed policies help because they allow the searching agent to correctly update its belief about hidden information when it sees other agents behaving (as it knows how other agents would behave given different states of the hidden information). This idea can be generalized to the case where all agents perform search by letting the agents simulate each other's search process. This can get expensive quickly as agent A's beliefs in the second round also depend on agent B's search process in counterfactual scenarios in the first round, such that agent B's search in round two also has to simulate these counterfactuals. A computation budget is introduced to make this computationally feasible and all agents know that the other agents will only use search in a turn if the cost of this is below the budget.

As search can be performed on top of any policy and allows to leverage compute during inference, not just training, it nicely complements more direct approaches using deep RL, which is a theme that has also been observed in Go and Poker.

Flo's opinion: This solution seems stunningly obvious in retrospect. While the authors informally report that their approach improves robustness to replacing other agents by humans, the example they give seems to indicate that this is because search prevents obvious mistakes in novel situations induced by human behaviour. Thus, I still expect (implicit) human models (AN #52) to be a vital component of human-machine cooperation.

DEEP LEARNING

Growing Neural Cellular Automata (Alexander Mordvintsev et al) (summarized by Zach): The process of an organism's shape development (morphogensis) is an active area of research. One central problem is determining how cells decide how to grow and when to stop. One popular model for investigating this is Cellular Automata (CA). These model cells as living on a grid and interacting with each other via rules generated by looking at their nearest neighbors. The authors contribute to this research direction by introducing rule-sets that depend continuously on their local surroundings. The central insight connecting CA and deep learning is that because the rule-sets are constant the update rules work similarly to a convolutional filter. This allows the authors to take advantage of methods available to train neural networks to simulate CA. Using this insight, the authors train CA that can form into images that are resistant to perturbations and deletions. In other words, the CA are capable of regeneration.

Zach's opinion: The main relevance of an approach like this is that it provides proof-of-concept that complex goals, such as shape formation, can be programmed in an embarrassingly parallel fashion amenable to deep learning methodology. This naturally has implications in multi-agent settings where communication is expensive. I'd recommend checking out the main web app which allows you to watch and interact with the CA while they're growing. They also have a code repository that is easily adaptable to training on your own patterns. For example, I grew a regenerating Patrick Star here.

META LEARNING

Gradient Surgery for Multi-Task Learning (Tianhe Yu et al) (summarized by Nicholas): In multi-task learning, an algorithm is given data from multiple tasks and tries to learn them all simultaneously, ideally sharing information across them. This paper identifies a tragic triad of conditions that can prevent gradient descent from finding a good minimum when all three are present:

Conflicting gradients occur when the gradient from one task points in a different direction from another.

Dominating gradients occur when the gradient from one task is much larger in magnitude than another.

High curvature is when the multi-task curvature is high in the direction of the gradient.

In this situation, the linear approximation of the gradient to the high curvature area leads to an overestimation of the increase in performance on the dominant gradient’s task and an underestimation of the performance degradation from the conflicting gradient’s task. I find picturing the parabola y=x^2 and seeing that a gradient descent step overestimates progress while a gradient ascent step underestimates to be helpful in understanding this.

To solve this, they propose PCGrad, which projects all gradients into the normal plane of the others in a pairwise fashion. Their theoretical analysis establishes convergence properties of PCGrad, and they empirically show it can be combined with other multi-task algorithms to improve performance and that it makes optimization easier for multi-task supervised learning and RL. They also show plots confirming that the necessary conditions for their theorems appear in these contexts.

Nicholas's opinion: I like how this paper analyzes the loss landscape of a particular problem, multi-task learning, and uses that knowledge to derive a new algorithm. One thing I always find tricky in ML papers is that it is hard to establish that the theory of why an algorithm works (typically shown on toy models) is also the reason it improves performance (typically shown using complex neural networks). I appreciate that this paper checks for the conditions of their theorem in the multi-task RL models that they train. That said, I think that in order to confirm that the tragic triad they describe is the mechanism by which PCGrad improves performance, they would require some way to toggle each element of the triad while keeping everything else fixed.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

33

[AN #100]: What might go wrong if you learn a reward function while acting

33

Ω 21

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

INTERPRETABILITY

FIELD BUILDING

MISCELLANEOUS (ALIGNMENT)

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

DEEP LEARNING

META LEARNING

FEEDBACK

PODCAST

33

Ω 21

33

Ω 21