on “learning to summarize”

[-]Rohin Shah5y40

this didn’t work for their test cases: “Training the reward predictor offline can lead to bizarre behavior […] This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.” I don’t know what to make of this.

I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this. (Also more speculatively I'd guess that using bigger models on more realistic tasks probably leads to the reward model generalizing better, so that optimization in batches becomes more feasible.)

After all, if they can, then you can just skip the RL, have humans explicitly tell you “no that token is bad, yes this token is great,” and train on likelihood.

Don't you still need a model that converts from human preferences over tokens to likelihoods? It sounds to me that the architecture you're suggesting is like theirs, except using a horizon of 1. Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

[-]gwern5y40

The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn't seem to work, it would just find similar and only slightly different examples.

[-]Rohin Shah5y40

By "original paper" do you mean Deep RL from Human Preferences or Fine-Tuning Language Models from Human Preferences? The latter did have a KL penalty, but OP linked to the former. I just skimmed the former again and saw no mention of a KL penalty (but I easily could have missed it).

[-]gwern5y60

The latter. I didn't notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can't be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.

[-]Rohin Shah5y20

Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I'd guess increased performance is primarily due to label quality and larger model.

[-]nostalgebraist5y30

I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.

Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post -- I was reading over the funny samples I quoted at the end and thought "huh, that would qualify as 'bizarre behavior,' wouldn't it?"

Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

If I understand you, yes, this is what I want. My intuition here is based on:

at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
when OpenAI (and I) think about what "better probabilities" we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific "made-up" facts, or the kind of synthetic errors they introduce in Table 18

So, it feels like "we" want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.

Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I'm annotating to improve an LM for nonfiction writing, and I see "Paris, the capital of Canada," what I really want is to make the token " Canada" less probable in this context.

This is a preference over next-token probabilities, not sequences -- if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they're naturally in LM terms to begin with.

This doesn't get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked "good" and descend for tokens marked "bad," but there may be conceptually similar losses that are better-behaved in training.

Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

[-]Rohin Shah5y20

That all makes sense, except for this part:

where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

I'm confused how this is not a horizon? Perhaps we're using words differently -- I'm saying "there's a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it's the sentence length and in your setting it is 1; nothing else would need to change".

[-]nostalgebraist5y10

To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there's only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.

With only final rewards, you can still include it as a variable formally. but there's no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)

I guess I was using "there isn't a horizon per se" to mean "the time structure of the rewards determines the horizon for you, it wouldn't make sense to vary it," but I can see how that would be confusing.

If you only set the horizon to 1 but changed nothing else in their work, you'd get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.

[-]Rohin Shah5y30

Ah got it, that makes sense, I agree with all of that.

[-]gwern5y40

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain. It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.

At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn't picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better - the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.

They seem to think it has more to do with label quality / better raters, which I didn't think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven't read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.

Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

Some new links on that topic: https://fraser-greenlee.github.io/2020/08/13/Transformers-as-Variational-Autoencoders.html https://fraser-greenlee.github.io/2020/08/25/Transformer-VAE-for-Program-Synthesis.html

[-]Rohin Shah5y20

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

This idea has come up at CHAI occasionally, but I don't think anyone has actually run with it -- do you know any examples of work that does this from (possibly simulated) human feedback? I'm pretty curious to see how much white-box optimization helps.

[-]gwern5y40

No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.)

[-]Rohin Shah5y20

IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.

Strongly agree. It's obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

25

on “learning to summarize”

25

25