[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning


Ω 8

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).


Learning to Interactively Learn and Assist (Mark Woodward et al) (summarized by Zachary Robertson): Cooperative Inverse Reinforcement Learning proposed a model in which an AI assistant would help a human principal, where only the principal knows the task reward. This paper explores this idea in the context of deep reinforcement learning. In their grid-world environment, two agents move around and pick up lemons or plums. The principal is penalized for moving, but is the only one who knows whether plums or lemons should be picked up. The authors hypothesize that simply by jointly training the two agents to maximize rewards, they will automatically learn to interact in order for the assistant to learn the task, rather than requiring an explicit mechanism like comparisons or demonstrations.

Recurrent Q-networks are used for the agents, which are then trained via deep Q-learning. The authors run several experiments that show emergent interaction. In the first experiment, when the principal is penalized for moving it learns to demonstrate the task to the assistant, and then let the assistant finish the job. In the second experiment, when the assistant has a restricted field of view, it learns to follow the principal to see what it does, until it can infer whether the principal wants plums or lemons. In the third, they tell the assistant the task 50% of the time, and so the principal is initially unsure whether the agent needs any direction (and due to the motion cost, the principal would rather not do anything). When the agent knows the task, it performs it. When the agent doesn't know the task, it moves closer to the principal, in effect "asking" what the reward is, and the principal moves until it can see the object, and then "answers" by either moving towards the object (if it should be collected) or doing nothing (if not). Finally, the authors run an experiment using pixels as input. While they had to switch to dueling DQNs instead of vanilla DQNs, they show that the joint reward is competitive with the grid approach. They also run an experiment with human principals and show that the human/assistant pair outperforms the solo-human setup.

Zach's opinion: Overall, I found the idea expressed in this paper to be well-articulated. While I think that the grid-world environment is a bit simplistic, their results are interesting. Being able to learn intent in an online manner is an important problem to solve if we’re interested in robust collaboration between humans and autonomous agents. However, the authors point out that training on pixel input fails in the majority of cases, 64% of the time, which raises concerns about how well the method would generalize to non-trivial environments.

Rohin's opinion: I'm excited that the ideas from CIRL are making their way to deep RL. Ultimately I expect we'll want an agent that takes all of its sensory data as evidence about "what the human wants", rather than relying on a special reward channel, or a special type of data called "comparisons" or "demonstrations", and this work takes that sort of approach.

For these simple environments, an agent trained to perform well with another artificial agent will generalize reasonably well to real humans, because there's only a few reasonable strategies for the principal to take. However, with more complex environments, when there are many ways to interact, we can't expect such generalization. (I'll have a paper and blog post coming out soon about this phenomenon.)

Technical AI alignment

Technical agendas and prioritization

Four Ways An Impact Measure Could Help Alignment (Matthew Barnett) (summarized by Asya Bergal): Much recent (AN #25) work (AN #49) has focused on quantifying the effect an AI has on the world, aka measuring impact, though some are skeptical. This post presents four potential ways impact measures could help with AI alignment. First, impact could act as a regularizer: an untrained AI attempting to do value learning could have an impact penalty that prevents it from taking dangerous actions before it is confident it has learned the right utility function. Second, impact could act as a safety protocol: if our training process is dangerous, e.g. due to mesa optimization (AN #58), we can penalize impact during training to safely test models that may be misaligned. Third, impact could act as an influence-limiter: impact measures could help us construct AIs with intentionally limited scope that won’t heavily optimize the world as a side effect. Fourth, impact could help us with deconfusion: even if impact measures themselves aren't used, conceptual clarity about impact could help us gain conceptual clarity about other important concepts such as corrigibility, mild optimization, etc.

Asya's opinion: I am most excited about impact as a regularizer and impact as a safety protocol. I feel like AIs that are impact-limited at runtime (the influence-limiter case) are unlikely to be competitive with other AIs that have no impact penalty (this is discussed in the post). I found the argument that impact could be particularly useful for deconfusion uncompelling.

Rohin's opinion: It seems to me like the safety protocol argument is for limited actions at training time, while the influence limiter argument is for limited actions at test time. I don't really get how the regularizer is supposed to be different from these two cases -- perhaps the idea is that it is a regularizer specifically on the distribution over utility functions that the AI is optimizing? This is still confusing I would have expected the influence limiter case to also be a change to the utility function. Like Asya, I am worried about competitiveness: see the post about reversible changes below.

Preventing bad behavior

Reversible changes: consider a bucket of water (Stuart Armstrong) (summarized by Rohin): This post argues that impact regularization methods require preference information in order to work well. Consider a robot that has to navigate to a location, and the fastest way of doing so involves kicking a bucket of water into a pool to get it out of the way. Kicking the bucket is acceptable even though it is irreversible, but it may not be if the water has a special mixture of salts used for an industrial process. In order to determine the appropriate penalty for kicking the bucket, we need preference information -- it is not enough to think about anything value-agnostic like reversibility.

Rohin's opinion: I agree with this -- as I've said before, it seems hard to simultaneously avoid catastrophes, be useful, and be value agnostic. This post is arguing that if we want to avoid catastrophes and be useful, then we can't be value agnostic.

Adversarial examples

Natural Adversarial Examples (Dan Hendrycks et al) (summarized by Flo Dorner): This paper introduces a new dataset to evaluate the worst-case performance of image classifiers. ImageNet-A consists of unmodified natural images that are consistently misclassified by popular neural-network architectures trained on ImageNet. Based on some concrete misclassifications, like a dragonfly on a yellow plastic shovel being classified as a banana, the authors hypothesize that current classifiers rely too much on color, texture and background cues. Neither classical adversarial training nor training on a version of ImageNet designed to reduce the reliance on texture helps a lot, but modifying the network architecture can increase the accuracy on ImageNet-A from around 5% to 15%.

Flo's opinion: This seems to show that current methods and/or training sets for image classification are still far away from allowing for robust generalization, even in naturally occuring scenarios. While not too surprising, the results might convince those who have heavily discounted the evidence provided by classical adversarial examples due to the reliance on artificial perturbations.

Rohin's opinion: I'm particularly excited about this dataset because it seems like a significantly better way to evaluate new techniques for robustness: it's much closer to a "real world" test of the technique (as opposed to e.g. introducing an artificial perturbation that classifiers are expected to be robust to).

Field building

AI Reading List (Vishal Maini)

AI strategy and policy

AI Alignment Podcast: China’s AI Superpower Dream (Lucas Perry and Jeffrey Ding) (summarized by Rohin): See also these (AN #55) three (AN #61) podcasts (AN #63).

Other progress in AI

Reinforcement learning

On Inductive Biases in Deep Reinforcement Learning (Matteo Hessel, Hado van Hasselt et al) (summarized by Sudhanshu Kasewa): The fewer inductive biases we use, the more general our algorithms will be. But how much does it really help to have fewer inductive biases? This paper replaces several hand-engineered components of an A2C agent with generic or adaptive variants to empirically answer this question.

Specifically, they compared: 1) reward clipping vs. reward normalization via PopArt (AN #24), 2) handpicked discount factor vs. online adaptive discounting via meta-learning, 3) fixed action repeats vs. learned action-commitment, and 4) standard Atari observation preprocessing vs. passing raw observations to a recurrent network. Over 57 Atari tasks, they found that the tuned algorithm outperformed the adaptive method only in (1). Performance was similar for (2) and (3), and the proposed method outperformed the baseline for (4). When the fully adaptive agent was compared to the vanilla agent (with heuristics designed for Atari) over 28 unseen continuous control tasks, the adaptive agent performed better in 14 of them, worse in one, and about the same in the rest, providing evidence that fewer inductive biases do lead to more general agents.

Sudhanshu's opinion: On net, I am quite happy to see work which argues in favour of reducing time spent hand-tuning and hand-crafting parts of a complex pipeline, and demonstrates the alternatives that currently exist to do so.

However, I feel the work did not fully compare the trade-off between tuning hyperparameters, and increasing the complexity of the pipeline by adding the adaptive components. I agree, though, that the latter is a one-time effort (per inductive bias), and is thus far more scalable than the former which needs to be repeated for each bias for every new task.

It would also be interesting to see how adaptive agents fare on problems where we care more about failures than successes, or if they are better/worse suited for safe exploration than baseline agents. My intuition is that adaptive internals of the agent cause it behave more noisily/unpredictably, and it may not fare as well as our current efforts for such problems.

Rohin's opinion: While it's certainly true that fewer inductive biases imply more general agents, it also usually means more compute and data requirements. For action repetition and learned discount factors, only one new parameter has to be learned, so it doesn't make much of a difference either way (and in fact performance on Atari doesn't change much). Clipped rewards do in fact learn faster than PopArt. I don't know why a recurrent network improves upon standard observation preprocessing for Atari -- perhaps initially RNNs were hard to train, and it became a de facto standard to use observation preprocessing, and no one checked about using recurrent networks later when RNNs became easier to train?

Miscellaneous (AI)

Stand-Alone Self Attention in Vision Models (Prajit Ramachandran et al) (summarized by Cody): Continuing with the more general rise of attention models across disciplines, this paper argues that attention-only models can perform comparably to convolutional networks on image classification tasks, a domain where convolution has been the reigning default method for years now. Because attention doesn't scale parameter-wise as you increase spatial scale, this comparable performance can be achieved at notably lower number of parameters and FLOPs. The authors perform a few interesting modifications to attention. Firstly, it's canonical, with attention, to include a representation of a pixel's position in the image, in addition to the vector storing the content of the image. In this paper, they found that storing this position information in relative terms (i.e. "how close is this pixel to the center one where attention is being calculated") performs better.

This can be seen as a sort of generalized form of convolution, where instead of having fixed weights for pixels in a kernel indexed by their relative position, attention takes both content and relative position as an input and generates a weight dynamically. Another modification is, at the lower parts of the network, to somewhat modify the attention paradigm such that the "value" at each location isn't just a neutrally transformed version of the input at that location, but rather one transformed differently according to the pixel's position relative to the anchor point where attention is being calculated. At the lower levels of the network, convolutions tend to outperform attention, but attention performs better at later layers of the network. This makes sense, the authors claim, because in early layers each individual pixel doesn't contain much content information that an attention mechanism could usefully leverage, whereas later the learned features at a given spatial location are richer, and more productively leveraged by attention.

Cody's opinion: I enjoyed and appreciated the way this paper questions the obvious default of convolutional models for image processing, and in particular the way it highlights various reachable points (neighborhood-aware value transformation, relative position position encoding, etc) on the interpolation path between weights based purely on relative distance (convolution) and weights based purely on content similarity (attention without any position representation). I'd be interested in the future to see more work in this space, exploring different places in a network architecture where content-focused or position-focused computation is most valuable.


Ω 8

8 comments, sorted by Highlighting new comments since Today at 2:28 AM
New Comment
I don't really get how the regularizer is supposed to be different from these two cases -- perhaps the idea is that it is a regularizer specifically on the distribution over utility functions that the AI is optimizing?

To clarify, the difference was that in the case of the regularizer, the impact penalty would gradually be discarded as the AI learned more. In the case of an influence limiter, the AI would always be restricted to low impact actions, until it is retired.

In that case, why isn't this equivalent to impact as a safety protocol? The period during which we use the regularizer is "training".

(Perhaps the impact as a safety protocol point was meant to be about mesa optimization specifically?)

The period during which we use the regularizer is "training".

This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered "training." To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there's the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.

In the case of "Impact as a regularizer" I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.

Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying "I don't currently think about impact measures this way anymore."

[EDIT: See my other comment which explains my reply much better] You're right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.

Neither classical adversarial training nor training on a version of ImageNet designed to reduce the reliance on texture helps a lot, but modifying the network architecture can increase the accuracy on ImageNet-A from around 5% to 15%.

(Section link.)

Wow, 15% sounds really low. How well do people perform on said dataset?

This reminds me of:


David Ha's
most recent paper, Weight Agnostic Neural Networks looks at what happens when you do architecture search over neural nets initialized with random weights to try and better understand how much work structure is doing in neural nets.

Given that there was a round of manual review, I would expect human accuracy to be over 80% and probably over 90%.

How well do people perform on said dataset?

You can download the dataset here and see how well you can classify them yourself.