[AN #110]: Learning features from human feedback to enable reward learning

Rohin Shah

Newsletter #110

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

Feature Expansive Reward Learning: Rethinking Human Input (Andreea Bobu, Marius Wiggert et al) (summarized by Rohin): One goal we might have with our algorithms is that after training, when the AI system is deployed with end users, the system would be personalized to those end users. You might hope that we could use deep inverse RL algorithms like AIRL (AN #17), but unfortunately they require a lot of data, which isn’t feasible for end users. You could use earlier IRL algorithms like MCEIRL (AN #12) that require you to specify what features of the environment you care about, but in practice you’ll never successfully write down all of these features. Can we somehow get the best of both worlds?

Past work (AN #28) made progress on this front, by allowing the agent to at least detect when it is missing some feature, by checking whether the human feedback is surprisingly inefficient given the existing features. But what do you do once you detect it? The key insight of this paper is that applying a deep IRL algorithm here would be inefficient because it has to implicitly learn the unknown feature, and we can do much better by explicitly querying the human for the unknown feature.

In particular, their method Feature Expansive Reward Learning (FERL) asks the human for a few feature traces: demonstrations in which the new feature’s value monotonically decreases. For example, suppose a robot arm carrying a cup of water gets too close to a laptop, but the arm doesn’t know the feature “close to a laptop”. Then a feature trace would start with the arm close to the laptop, and move it successively further away. Given a set of feature traces, we can convert this into a dataset of noisy comparisons, where earlier states are more likely to have higher feature values than later states, and use this to train a neural net to predict the feature value (similarly to the reward model in Deep RL from Human Preferences). We can then add this to our set of features, and learn rewards over the new set of features.

They evaluate their method with a few human-robot interaction scenarios (though without a user study due to COVID), comparing it against deep MaxEnt IRL, and find that their method does better on a variety of metrics.

Rohin's opinion: I really liked this paper -- it seems like a far more efficient use of human feedback to figure out what features of the environment are important. This doesn’t need to be limited to reward learning: I expect that learning the right features to focus on would help with exploration in reinforcement learning, out-of-distribution generalization, etc. It also seems plausible that in more complex environments you could learn a set of features that was useful for all of these tasks, thus being somewhat general (though still specific to the environment).

It’s worth noting that in this setting you wouldn’t really want to use a vanilla deep IRL algorithm -- you’d instead want to do something like meta-IRL.

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

Parallels Between AI Safety by Debate and Evidence Law (Cullen O'Keefe) (summarized by Rohin): Debate (AN #86) requires us to provide a structure for a debate as well as rules for how the human judge should decide who wins. This post points out that we have an existing system that has been heavily optimized for this already: evidence law, which governs how court cases are run. A court case is high-stakes and involves two sides presenting opposing opinions; evidence law tells us how to structure these arguments and how to limit the kinds of arguments debaters can use. Evidence is generally admissible by default, but there are many exceptions, often based on the fallibility of fact-finders.

As a result, it may be fruitful to look to evidence law for how we might structure debates, and to see what types of arguments we should be looking for.

Rohin's opinion: This seems eminently sensible to me. Of course, evidence law is going to be specialized to arguments about innocence or guilt of a crime, and may not generalize to what we would like to do with debate, but it still seems like we should be able to learn some generalizable lessons.

Weak HCH accesses EXP (Evan Hubinger) (summarized by Rohin): This followup to last week's Alignment proposals and complexity classes (AN #109) shows that the amplification-based proposals can access EXP.

LEARNING HUMAN INTENT

Multi-Principal Assistance Games (Arnaud Fickinger et al) (summarized by Rohin): So far the work in the assistance games framework (AN #69) (previously called CIRL) has focused on the case where there is a single human and a single AI assistant. Once we have multiple humans (or principals, as the paper calls them), things get much trickier.

One problem is that we don’t know how to aggregate the values across different principals. Rather than taking a stance on the problem, this paper assumes that we have some mechanism that can combine reward functions in some reasonable way. It instead focuses on a second problem: while previously we could trust the human to report their preferences accurately (as the human and agent were aligned), when there are multiple principals whose preference will be aggregated, the principals have an incentive to misrepresent their preferences (which we’ll call non-straightforward play).

Let’s consider the case where the principals provide demonstrations, and get reward for those demonstrations. For now our agent will assume that the principals are playing straightforwardly, and so the agent simply infers their preferences, aggregates them, and optimizes the results. In this setting, if the agent will act far more often than the principals provide demonstrations (so that the reward of the demonstrations is almost irrelevant), we can apply the Gibbard-Satterthwaite theorem to show that any non-trivial mechanism will be vulnerable to non-straightforward play. In contrast, if the principals provide lots of demonstrations, while the agent only acts for a short period of time, then optimal principals primarily want to ensure their demonstrations are good, and so will be straightforward most of the time (provably). In the middle, the fact that principals get rewarded for demonstrations does help reduce non-straightforward play, but does not eliminate it.

Now let’s consider the case where the agent can design a mechanism. Here, when the principals are providing demonstrations, the agent can override their action choice with one of its own (a setting considered previously (AN #70)). Roughly speaking, the algorithm only executes a proposed human action if it hasn’t executed it before. By doing so, it incentivizes the principals to report second-best actions, and so on, giving the agent more information about the principals' utility functions. The mechanism incentivizes straightforward play, and is approximately efficient (i.e. there is an upper bound on the worst case social welfare achieved).

Rohin's opinion: According to me, the main insight of this paper is that it is both necessary and difficult to design mechanisms that incentivize principals to report not just the best thing to do, but a comparison amongst different alternatives. Within the formalism of paper, this is done by overriding a principal’s action unless it is a novel action, but I expect in practice we’ll do this in some other way (it seems rather unusual to imagine the agent overriding a human, I’d be surprised if that was how we ended up building our AI systems).

Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization (Paul Barde, Julien Roy, Wonseok Jeon et al) (summarized by Sudhanshu): This work aims to simplify algorithms for adversarial imitation learning by using a structured discriminator, which is parameterised by the current generator and a learned policy. They prove that if so formulated, the policy that yields the optimal discriminator is exactly the same as the policy that generated the expert data, which is also precisely what we hope the generator will learn. As long as the discriminator's learned policy is parameterised correctly such that it can be sampled and evaluated, this eliminates the need for a reinforcement learning outer loop for policy improvement, as this learned policy can be substituted in for the generator's policy in the next training iteration. They empirically show the competitiveness of their method with state-of-the-art algorithms across a small but increasingly complex suite of tasks.

Sudhanshu's opinion: Since their theoretical results are only for optimal values, it's unclear whether starting from random initial policies will necessarily converge to these optimal values -- indeed, they make this point themselves, that they do not train to convergence as gradient descent cannot hope to find the global optimum for GAN-like non-convex loss functions. In light of that, it's not evident why their algorithms outperform the competition. Additionally, they do not report computational speed-up or wall-clock comparisons, which to me felt like the broad motivation behind this work. Nonetheless, the work illuminates new territory in adversarial imitation learning, provides positive evidence for a novel technique, and raises interesting questions for future work, such as how to learn robust reward functions via this method, or what kind of convergence properties can be expected.

Explanation Augmented Feedback in Human-in-the-Loop Reinforcement Learning (Lin Guan, Mudit Verma et al) (summarized by Rohin): This paper starts from a similar position as the highlighted paper: that we can improve on algorithms by having humans provide different kinds of feedback that help with learning. They ask humans to provide “explanations” to improve sample efficiency in deep RL, which in this case means asking a human to segment parts of the image observation that are important (similar to a saliency map). They use this to define auxiliary losses that incentivize the agent to be invariant to augmentations of the irrelevant parts of the image. Their empirical evaluation shows improvements in sample efficiency relative to simple good/bad evaluative feedback.

Rohin's opinion: The idea is cool, but the empirical results are not great. On Taxi, training with the reward signal and binary good/bad evaluative feedback takes 180k environment steps, and adding in explanations for a quarter of the steps brings it down to 130k environment steps. However, this seems like it would increase the human effort required by an order of magnitude or more, which seems way too high for the benefit provided.

It does seem to me that saliency explanations could contain a fair amount of information, and so you should be able to do better -- maybe a future algorithm will do so.

FORECASTING

Alignment As A Bottleneck To Usefulness Of GPT-3 (John S. Wentworth) (summarized by Rohin): Currently, many people are trying to figure out how to prompt GPT-3 into doing what they want -- in other words, how to align GPT-3 with their desires. GPT-3 may be capable of the task, but that doesn’t mean it will do it (potential example). This suggests that alignment will soon be a bottleneck on our ability to get value from large language models.

Certainly GPT-3 isn’t perfectly capable yet. The author thinks that in the immediate future the major bottleneck will still be its capability, but we have a clear story for how to improve its capabilities: just scale up the model and data even more. Alignment on the other hand is much harder: we don’t know how to translate (AN #94) the tasks we want into a format that will cause GPT-3 to “try” to accomplish that task.

As a result, in the future we might expect a lot more work to go into prompt design (or whatever becomes the next way to direct language models at specific tasks). In addition, once GPT is better than humans (at least in some domains), alignment in those domains will be particularly difficult, as it is unclear how you would get a system trained to mimic humans to do better than humans (AN #31).

Rohin's opinion: The general point of this post seems clearly correct and worth pointing out. I’m looking forward to the work we’ll see in the future figuring out how to apply these broad and general methods to real tasks in a reliable way.

MISCELLANEOUS (ALIGNMENT)

Generalizing the Power-Seeking Theorems (Alex Turner) (summarized by Rohin): Previously (AN #78) we’ve seen that if we take an MDP, and have a distribution over state-based reward functions, such that the reward for two different states is iid, then farsighted (i.e. no discount) optimal agents tend to seek “power”. This post relaxes some of these requirements, giving sufficient (but not necessary) criteria for determining instrumental convergence.

Some of these use a new kind of argument. Suppose that action A leads you to a part of the MDP modeled by a graph G1, and B leads you to a part of the MDP modeled by a graph G2. If there is a subgraph of G2 that is isomorphic to G1, then we know that whatever kinds of choices the agent would have by taking action A, the agent would also have those choices from action B, and so we know B is at least as likely as A. This matches our intuitive reasoning -- collecting resources is instrumentally convergent because you can do the same things that you could if you didn’t collect resources, as well as some additional things enabled by your new resources.

AI STRATEGY AND POLICY

AI Benefits (Cullen O'Keefe) (summarized by Rohin): This sequence of posts investigates AI Benefits: how a benefactor can leverage advanced AI systems to benefit humanity. It focuses on what can be done by a single benefactor, outside of what we might think of as the “norm” -- in particular, the sequence ignores benefits that would be provided by default market incentives. This is relevant to OpenAI (where the author works) given their focus on ensuring AI is beneficial to humanity.

Note that AI Benefits is distinct from AI alignment. Sometimes AI alignment is defined broadly enough to encompass AI Benefits, but often it is not, e.g. if the notion of being “aligned” depends on an AI system being aligned with some principal, that would not be AI Benefits, since AI Benefits are meant to accrue to all of humanity. While it is about maximizing well-being by default, it should also have secondary goals of equality, autonomy, democratization, and epistemic modesty.

The obvious approach to AI Benefits is the direct approach: figuring out how to apply advanced AI to directly generate benefits for humanity, e.g. by producing electricity more efficiently to mitigate climate change. However, it is important to also consider the indirect approach of making money using AI, and then donating the surplus to a different organization that can better produce benefits.

Given the massive number of potential ways to benefit humanity and our uncertainty about their efficacy, it is important to have a portfolio approach to AI Benefits, rather than scaling up a single intervention. In addition, since any given intervention will probably primarily benefit some subset of humanity, a portfolio approach should help lead to more equal distribution of benefits.

There are many outstanding questions on how AI Benefits should be done in practice. Should the benefactor pursue a direct or indirect approach? To what extent should they explore potential approaches for generating benefits, relative to exploiting approaches that we know work? Should they generate benefits now, or invest in the ability to generate benefits later? Should they focus on global (supranational) approaches, or allocate resources to each nation that can be used in a manner specialized to their citizens?

There are many questions on the governance side as well. We will presumably want some Benefits Group involving external experts to help distribute benefits optimally. When should such a group get democratic input? How do we evaluate such a group to ensure they are actually benefiting humanity optimally? To what extent will we also need internal governance within the group and benefactor, and how can this be done?

Rohin's opinion: AI Benefits is effectively asking how we can answer the question of how to do the most good in the future, and as such many of the considerations also come up in effective altruism, especially at the current high level of abstraction. Nonetheless, there are differences in the situation, which will matter: for example, the effective altruism community does not currently need to plan for the situation where they control a majority of the world’s resources; a sufficiently ambitious and optimistic AI company may need to. Such a situation vastly increases the importance of e.g. democratic input, portfolio approaches, and information value. I’m glad that these questions are being tackled now and look forward to seeing more details in the future.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

An Optimistic Perspective on Offline Reinforcement Learning (Rishabh Agarwal et al) (summarized by Zach): Off-policy reinforcement learning (RL) that can be done using offline-logged interactions is an important aspect of real-world applications. However, most RL algorithms assume that an agent interacts with an online environment or simulator and learns from its own collected experience. Moreover, the authors show that DQN trained offline on its own experience replay buffer has markedly decreased performance on most of the Atari suite. The authors attempt to address this discrepancy by introducing a robust Q-learning algorithm that randomly mixes estimates for particular Q-values. Specifically, by creating convex combinations from an underlying basis of Q-value estimates the authors are able to create a much larger ensemble. This is similar in spirit to dropout in deep learning where connections in the network are randomly turned off. The authors then go on to show that offline DQN is feasible by training this algorithm and other related algorithms on the DQN Replay Dataset and show it has comparable performance to, and occasionally even surpasses, the original RL baselines. The DQN Replay Dataset is released at https://offline-rl.github.io/.

Zach's opinion: What I learned from this paper is that estimating the mean Q-value is not always enough for robustness. By leveraging distributional information, via ensembles or quantiles, these methods can become quite effective at offline DQN. The release of the dataset is also impressive. I think the dataset will have broad applicability to researchers interested in offline RL as well as imitation learning.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[-]Pattern6y20

(it seems rather unusual to imagine the agent overriding a human, I’d be surprised if that was how we ended up building our AI systems).

It might be workable in Minecraft, simulations*, or with a robot in a safe environment.

*Like the ones involved in making the one hand rubix cube solver (OpenAI).

[-]Rohin Shah6y20

Yeah, I agree it is doable in some environments, but if I imagine a world with AGI in it that's good at aggregating human preferences, I'd be pretty shocked if this happened via the AGI asking humans to provide actions into some system that both the human and AGI understand and then sometimes overriding those actions in order to provide the right incentives to the human.

It's more plausible if this happens just at training, but I expect that we'll want our AI systems to be learning and aggregating our values all the time, not just during the initial training.

LESSWRONG
LW