"Learning to Summarize with Human Feedback" - OpenAI

by Rekrul1 min read7th Sep 20202 comments


OpenAIConservatism (AI)Machine LearningAI
2 comments, sorted by Highlighting new comments since Today at 1:56 AM
New Comment

Planned summary for the Alignment Newsletter:

OpenAI has been working on <@finetuning language models from human preferences@>(@Fine-Tuning GPT-2 from Human Preferences@). This blog post and paper show the progress they have made on text summarization in particular since their last release.

As a reminder, the basic setup is similar to that of [Deep RL from Human Preferences](https://deepmind.com/blog/learning-through-human-feedback/): we get candidate summaries by executing the policy, have humans compare which of two summaries is better, and use this feedback to train a reward model that can then be used to improve the policy. The main differences in this paper are:

1. They put in a lot of effort to ensure high data quality. Rather than having MTurk workers compare between summaries, they hire a few contractors who are paid a flat hourly rate, and they put a lot of effort into communicating what they care about to ensure high agreement between labelers and researchers.

2. Rather than collecting preferences in an online training setup, they collect large batches at a time, and run a relatively small number of iterations of alternating between training the reward model and training the policy. My understanding is that this primarily makes it simpler from a practical perspective, e.g. you can look at the large batch of data you collected from humans and analyze it as a unit.

3. They initialize the policy from a model that is first pretrained in an unsupervised manner (as in <@GPT-3@>(@Language Models are Few-Shot Learners@)) and then finetuned on the reference summaries using supervised learning.

On the Reddit task they train on, their summaries are preferred over the reference summaries (though since the reference summaries have varying quality, this does not imply that their model is superhuman). They also transfer the policy to summarize CNN / DailyMail news articles and find that it still outperforms the supervised model, despite not being trained at all for this setting (except inasmuch as the unsupervised pretraining step saw CNN / DailyMail articles).

An important ingredient to this success is that they ensure their policy doesn’t overoptimize the reward, by adding a term to the reward function that penalizes deviation from the supervised learning baseline. They show that if they put a very low weight on this term, the model overfits to the reward model and starts producing bad outputs.

Planned opinion:

This paper is a great look at what reward learning would look like at scale. The most salient takeaways for me were that data quality becomes very important and having very large models does not mean that the reward can now be optimized arbitrarily.

Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)


IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)

This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.

The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).

On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.


As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.

For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.


The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)