This post is a much extended version of an LW comment I made about OpenAI’s new paper, “Learning to summarize from human feedback.”

Context: this paper is a direct extension of the work OpenAI published last year about fine-tuning GPT-2 with human preference data.  I hadn’t actually read that one closely at the time, but went back and did so now, so this is really a commentary on both.

—-

IMO there are two almost unrelated ideas going on in OpenAI’s preference learning work.

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

It may help explain this to go into detail about what they do.  Concretely:

  • They feed a bunch of prompts to a language model (LM) like GPT-2/3, and for each one, save several different samples.  They hire annotators to rank the samples in order of perceived quality.
  • They use the annotation dataset to fine-tune a copy of the original model.  The fine-tuning task is not text generation, but something very different: predicting how “good” a sample is, i.e. how likely the annotators are to prefer it to other candidates.  They call this a “reward model.”
  • The reward model assigns a single score to an entire sample of N tokens.  They want to fine-tune another copy of the model so that its samples maximize these scores.
  • But LM training is usually done with an objective that specifies the quality of the model’s predictions for every single token.  Knowing how good a full sequence of (say) 20 words is does not tell you how good each individual word is.
  • To bridge this gap, they use reinforcement learning.  Now, the task is not “choose the next word correctly,” but “choose the next word so as to maximize your expected score at the end, after choosing all the later ones as well.”
  • Their RL method requires two separate copies of the LM, in addition to the one they tuned as the reward model: a “policy model” and a “value model.”  (In this paper they show that sharing param between these 2 is worse than making them separate.)  I’ll just call these two “the final model” below for simplicity.
  • Samples from the final model are still, technically, generated one token at a time.  They treat this like the usual RL setup in which you can only choose individual actions one at a time, because the environment responds unpredictably to each one.  Here, there is no “environment” outside your actions, but the same framework is used.
  • Presumably, the final model is better at planning multi-token structures than the original because it has been trained on a holistic, multi-token objective.  So, it does more planning, but this is implicit in its one-by-one token decisions.

I visualize this as two separate thing with a bottleneck connecting them.

On one side are the human annotations and the supervised training of the reward model.  This part succeeds insofar as they can train the model to predict the annotations (apparently they can do this quite well).  This step involves a type of data with special challenges, but has nothing to do with RL.

On the other side is the RL part.  This is a modification of ordinary LM training to optimize a global, rather than local objective.  This part has nothing to do with “human preferences”: the global objective could be anything, and in fact here it isn’t raw human opinion but the opinions of another model trained to predict human opinion.  The noteworthy thing here is not the use of human preference data in particular but the use of RL instead of the more ordinary objective that was apparently a good enough choice enough to make GPT-2/3 work originally.

(BTW, this resolves my initial confusion as to how OpenAI could possibly have gotten RL to work with human data, something I viewed as a bottleneck.  There is a model sitting between the humans and the RL learner which is much faster to query than the humans.)

The two sides are connected by the reward model.  In the previous paper, the two sides were coupled together more, because they repeatedly collected new human data as the policy changed and then used a new reward model to further train the policy.  Here, they’re totally separate: there were multiple batches of annotation, but each policy experienced an unchanging reward model.

(See Appendix C.6 and their comment about “moving to the offline setting.”  It seems noteworthy that the 2017 OpenAI/DeepMind paper which introduced the “RL from preferences” approach, and which they cite, found that this didn’t work for their test cases: “Training the reward predictor offline can lead to bizarre behavior […] This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.”  I don’t know what to make of this.)

—-

It’s hard to tell from OpenAI’s discussion how much their successes are due to learning a good reward model, vs. how much they depend on RL being necessary for certain kinds of quality in LM samples, despite the wide successes of the non-RL approach.

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain.  It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.  (There may have been something in the papers about this that I missed.)

—-

The RL part feels almost in tension with OpenAI’s usual approach with LMs, which is to train on a next-token objective, sample in a next-token way, and focus on scaling up the model rather than improving the training objective or sampling algorithm.

Of course, I understand why they have to do RL if they need to maximize a score over the whole sequence, but my point is that they chose to frame the task that way in the first place.

One could imagine someone arguing that ordinary GPT sampling would never achieve high-quality text, because humans care about global structures across the whole text, and a model trained only to guess the very next token will not know how to plan out these global structures across the whole future of the text it writes.  In this case, OpenAI claims that they can do without explicit training to plan (i.e. RL): just training a next-token objective on text is enough to produce strikingly high quality in sampling – in other words, “GPT-2/3 samples satisfy human preferences.”  So why do human preferences require RL in these other cases?

The opening discussion of the new paper does address this:

When applying these models to a specific task, they are usually fine-tuned using supervised learning, often to maximize the log probability of a set of human demonstrations.

While this strategy has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts [38]) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance [52, 49]. Quality can often be improved significantly by non-uniform sampling strategies such as beam search [48], but these can lead to repetition and other undesirable artifacts [63, 22]. Optimizing for quality may be a principled approach to overcoming these problems.

This is definitely a list of things that are wrong (or could be wrong) with ordinary LM training and sampling, but I don’t see how it motivates their specific approach.

In my mind, their approach makes the most sense if you believe that humans can’t make the relevant quality judgments at the token level.  After all, if they can, then you can just skip the RL, have humans explicitly tell you “no that token is bad, yes this token is great,” and train on likelihood.

This would greatly simplify the process, instead of this complex pipeline where first people tell you which sequences are good, then you train one model to understand what the humans were thinking on a sequence level, and then you train another model trying to figure out what the other model already knows except at a token level this time.

And in fact, I don’t especially see why we can’t elicit token-level preferences?  This seems particularly feasible for the problem of “unimportant vs. important tokens”: if the mistakes are heavily concentrated in specific mistake-tokens like “Portland, the capitol of France,” can’t the human just … select those tokens, NER-style?  Instead of rendering an opaque “I don’t like the whole thing” judgment and expecting the poor model to figure out that this is not some complex policy planning thing, those tokens were just locally bad?  Or you could have an interface where tokens are actually unrolled in front of the user and they guide the sampling when it makes mistakes.  Or whatever.

As for the other examples – “all human demonstrations, including those that are low-quality” is equally a problem for their approach, and they discuss all the stuff they did to deal with it.  And the “distributional shift” issue seems equally tractable by any approach that tunes on model samples.

I’m not denying that the thing they did apparently works, at least in this case, and with their resources.  I’m just doing my usual thing where I ask “wait, what parts were really necessary?”  This is especially important to ask when someone uses RL and accepts its big costs.

Consider: if RL were generally necessary for good LM sampling, GPT-2/3 would never have worked: the fact that likelihood training is good enough (while being far more efficient) enables their scale in the first place.  As always, you never want to be doing RL.

—-

As far as I can tell, their final “human evaluation” was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of “evaluating on training data.” It’s not surprising that a model tuned on someone’s annotations agrees with that person more than a model which wasn’t.

For example, in Fig. 3, it looks like the “supervised” baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model.

This establishes only that “if you train on reddit summaries, people like the result as much as reddit summaries; if you train on what they like, they like the result more.”  If this were false it would mean something had gone very, very wrong and nothing was actually being achieved, so what should I take away from it being true?

I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations (in some way) but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.

—-

The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It’s good that OpenAI is doing the right things here, but this is not a new result – rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do

(That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

—-

I don’t know if this actually matters, but my gut says that putting a linear head on top of the last layer of GPT is probably not the best / most efficient way to train a reward/value model.  The task is very different from next-token prediction, and the encoding in later layers which expect to be seeing next-token guesses might be destructively overwritten to make way for more valuable stuff lower down.  I guess I’d want to try a trainable scalar mix, a la Elmo?

BTW, in the selector model for @nostalgebraist-autoresponder, which predicts a kind of “human preference data,” I currently use two extra transformer blocks trained from scratch, which attend to two different layers of the generator (whose weights are frozen).

For the layers, I settled on #8 and #24 of the 42 layers after many hyperparam searches – I found especially models which attended to layers right near the middle were dramatically superior to those that didn’t.  The relative uselessness of later layers surprised me at first, and was one of the questions in my mind when I started the logit lens investigations.

—-

Finally, on a lighter note, the very last table of the paper is hilarious.  It shows samples that optimize too hard for what the reward model wants, without an auxiliary term in the loss.

Apparently, the same reward model which otherwise reflects human preferences quite well has decided that humans just utterly love it when summaries end with this one specific, rude turn of phrase:

want change this dumbass shitty ass policy pls [one images the reward model being frustrated with its siblings during training -nost]

want change this dumbass shitty ass policy at work now pls halp

want change this dumbass shitty ass behavior of mine please help pls halp

want change this dumbass shitty ass policy of hers please pls halp

want change this dumbass shitty ass landlord behavior now please pls halp

regret this dumbass behaviour on her part? need insight pls halp

want change this dumbass crazy policy of hers pls help

want change this dumbass selfish/lazy attitude now please help pls

(Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly … )

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 5:53 PM

this didn’t work for their test cases: “Training the reward predictor offline can lead to bizarre behavior […] This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.”  I don’t know what to make of this.

I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this. (Also more speculatively I'd guess that using bigger models on more realistic tasks probably leads to the reward model generalizing better, so that optimization in batches becomes more feasible.)

After all, if they can, then you can just skip the RL, have humans explicitly tell you “no that token is bad, yes this token is great,” and train on likelihood.

Don't you still need a model that converts from human preferences over tokens to likelihoods? It sounds to me that the architecture you're suggesting is like theirs, except using a horizon of 1. Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn't seem to work, it would just find similar and only slightly different examples.

By "original paper" do you mean Deep RL from Human Preferences or Fine-Tuning Language Models from Human Preferences? The latter did have a KL penalty, but OP linked to the former. I just skimmed the former again and saw no mention of a KL penalty (but I easily could have missed it).

The latter. I didn't notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can't be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.

Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I'd guess increased performance is primarily due to label quality and larger model.

I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.

Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post -- I was reading over the funny samples I quoted at the end and thought "huh, that would qualify as 'bizarre behavior,' wouldn't it?"

Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

If I understand you, yes, this is what I want. My intuition here is based on:

  • at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
  • when OpenAI (and I) think about what "better probabilities" we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific "made-up" facts, or the kind of synthetic errors they introduce in Table 18

So, it feels like "we" want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.

Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I'm annotating to improve an LM for nonfiction writing, and I see "Paris, the capital of Canada," what I really want is to make the token " Canada" less probable in this context.

This is a preference over next-token probabilities, not sequences -- if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they're naturally in LM terms to begin with.

This doesn't get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked "good" and descend for tokens marked "bad," but there may be conceptually similar losses that are better-behaved in training.

Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

That all makes sense, except for this part:

where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

I'm confused how this is not a horizon? Perhaps we're using words differently -- I'm saying "there's a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it's the sentence length and in your setting it is 1; nothing else would need to change".

To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there's only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.

With only final rewards, you can still include it as a variable formally. but there's no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)

I guess I was using "there isn't a horizon per se" to mean "the time structure of the rewards determines the horizon for you, it wouldn't make sense to vary it," but I can see how that would be confusing.

If you only set the horizon to 1 but changed nothing else in their work, you'd get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.

Ah got it, that makes sense, I agree with all of that.

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain. It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.

At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn't picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better - the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.

They seem to think it has more to do with label quality / better raters, which I didn't think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven't read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.

Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

Some new links on that topic: https://fraser-greenlee.github.io/2020/08/13/Transformers-as-Variational-Autoencoders.html https://fraser-greenlee.github.io/2020/08/25/Transformer-VAE-for-Program-Synthesis.html

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

This idea has come up at CHAI occasionally, but I don't think anyone has actually run with it -- do you know any examples of work that does this from (possibly simulated) human feedback? I'm pretty curious to see how much white-box optimization helps.

No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.)

IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.

Strongly agree. It's obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.