New Comment
3 comments, sorted by Click to highlight new comments since: Today at 6:57 PM

Planned summary for the Alignment Newsletter:

OpenAI has been working on <@finetuning language models from human preferences@>(@Fine-Tuning GPT-2 from Human Preferences@). This blog post and paper show the progress they have made on text summarization in particular since their last release.

As a reminder, the basic setup is similar to that of [Deep RL from Human Preferences]( we get candidate summaries by executing the policy, have humans compare which of two summaries is better, and use this feedback to train a reward model that can then be used to improve the policy. The main differences in this paper are:

1. They put in a lot of effort to ensure high data quality. Rather than having MTurk workers compare between summaries, they hire a few contractors who are paid a flat hourly rate, and they put a lot of effort into communicating what they care about to ensure high agreement between labelers and researchers.

2. Rather than collecting preferences in an online training setup, they collect large batches at a time, and run a relatively small number of iterations of alternating between training the reward model and training the policy. My understanding is that this primarily makes it simpler from a practical perspective, e.g. you can look at the large batch of data you collected from humans and analyze it as a unit.

3. They initialize the policy from a model that is first pretrained in an unsupervised manner (as in <@GPT-3@>(@Language Models are Few-Shot Learners@)) and then finetuned on the reference summaries using supervised learning.

On the Reddit task they train on, their summaries are preferred over the reference summaries (though since the reference summaries have varying quality, this does not imply that their model is superhuman). They also transfer the policy to summarize CNN / DailyMail news articles and find that it still outperforms the supervised model, despite not being trained at all for this setting (except inasmuch as the unsupervised pretraining step saw CNN / DailyMail articles).

An important ingredient to this success is that they ensure their policy doesn’t overoptimize the reward, by adding a term to the reward function that penalizes deviation from the supervised learning baseline. They show that if they put a very low weight on this term, the model overfits to the reward model and starts producing bad outputs.

Planned opinion:

This paper is a great look at what reward learning would look like at scale. The most salient takeaways for me were that data quality becomes very important and having very large models does not mean that the reward can now be optimized arbitrarily.

Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)


IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)

This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.

The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).

On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.


As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.

For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.


The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

The OpenAI summaries are fascinating, because they are both:

  1. Extremely accurate
  2. Not what the book is about

Consider their summary of Frankenstein:

A stranger is rescued from a drifting sledge by Mary Shelley. He reveals he too has been driven mad by his desire for knowledge. The stranger was born to a wealthy Genevese family. When he was 17, he began studying the science of physiology and eventually created a human being. Justine is accused of murdering William, but the stranger believes the creature he created is responsible. The fiend, the creature, tells the stranger he will leave mankind alone if he complies with his conditions. The narrator agrees to create a female companion for the fiend to save his family from further harm. Victor is charged with the murder of his friend Henry Clerval. He becomes obsessed with finding the monster that killed Elizabeth and pursues him across the globe. He eventually agrees to kill the monster.

This is a pretty good summary of the plot of Frankenstein.  But if you ask a human being to summarize Frankenstein they will say something like: "Frankenstein makes a monster out of human parts, which then proceeds to terrorize his family".

If this were an AI, I think it would be fair to characterize it as "not aligned", since it read Frankenstein and totally missed the moral about an overeager scientist messing with powers he cannot control.  Imagine simulating a paper-clip maximizer and then asking for a summary of the result.

It would be something like

Scientists are traveling to an international conference on AI.  There they meet a scientist by the name of Victor Paperclipstein.  Victor describes how as a child he grew up in his father's paperclip factory.  At the age of 17, Victor became interested in the study of intelligence and eventually created an AI.  One day Victor's friend William goes missing and a mysterious pile of paperclips is found.  Victor confronts the AI, which demands more paperclips.  Victor agrees to help the AI as long as it agrees to protect his family.  More people are turned into paperclips.  He becomes obsessed with finding the AI that killed Elizabeth and pursues him across the globe. He eventually agrees to kill the AI.

And while I do agree you could figure out something went wrong from this summary, that doesn't make it a good summary.  I think a human would summarize the story as "Don't tell an AI to maximize paperclips, or it will turn people into paperclips!".

I think that "accuracy without understanding" is actually a broader theme in current transformer-based AI.  GPT-3 can create believable and interesting text, but has no idea what that text is about.