RL with KL penalties is better seen as Bayesian inference

Ethan Perez

I broadly agree with this perspective, and think the modeling vs inference distinction is a valuable one to make.

That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs. (Old Eliezer post.)

For some applications you are sampling from your model multiple times and ensembling. In this case randomness can help if you have no memory. But in the same setting, optimizing the correct reward function ("solve the problem given that none of your previous tries worked") wouldn't involve converging to zero entropy, because the quality of an output decreases naturally as it becomes more similar to your previous outputs. (Though I think this is a minority of language model sampling in practice.)

It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that---staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I'm not sure what accounts for that difference. For example, if you want to have a model that creates 10 diverse samples when run 10 times in parallel, I don't see why "stay close to webtext" would be a particularly good way to do that. It seems like incorporating diversity directly into the reward function is the way you'd like to go in the long run, it poses some challenges in the short run, and "match the webtext distribution" is a shortcut you might use before solving the fundamental problem.

I'm personally mostly interested in KL minimization in order to ensure you continue exploring and to ensure the policy changes slowly enough for the reward model to keep up (avoiding Goodhart destroying the approximation quality of the preference model, and perhaps more speculatively avoiding it destroying the approximation quality of a limited overseer human). But as Jacob Hilton points out, right now it seems like a wide range of options will work out OK for that.

[-]evhub3yΩ1015-2

I think I broadly disagree with this—using RLHF in a way that actually corresponds to some conditional of the original policy seems very important to me.

That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs.

I think you probably expect that we'll eventually be able to give good feedback even for quite complex tasks like “actually say the truth”—but if you don't expect that to ever succeed (as I think I don't), then staying close to existing data seems like a pretty important tool to prevent yourself from Goodharting on bad feedback.

It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that

What if what you actually care about is predicting the world well? In that case, staying close to existing data actually seems like a very principled thing to do. In particular, the KL penalty here lets you interpret the feedback as just extracting a particular conditioned distribution, which is precisely the thing I think you want to do with such a predictor.

[-]Tomek Korbak4y20

Thanks for sharing your thoughts, I found these remarks extremely insightful!

It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that---staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I'm not sure what accounts for that difference.

A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inherent value? Maybe web text is already approximately aligned with human preferences and you only want tweak that distribution a bit to match true human preferences? Assume that's the case. Then, we can decompose LM alignment into (i) learning web text distribution and (ii) learning how to warp web text distribution. It seems that (ii) is easier than just learning aligned behaviour from scratch: your reward model doesn't have to work well on arbitrary text but only text from distributions similar to webtext.

Another way of phrasing that point: maybe the assumption that you can have a perfect reward model is unrealistic and we can offload some of the complexity of learning a reward model to a prior given by web text? Or more philosophically, if you're a Bayesian, you shouldn't trust your reward model blindly, you should still have some prior.

[-]adamShimi4yΩ340

Thanks for this post, it's clear and insightful about RLHF.

From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we're aiming for and debugging problems there, rather than thinking about the guarantees of the inference?

(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)

[-]Tomek Korbak4y50

I'm glad you found our post insightful!

I'm not sure what is the best energy allocation between modelling and inference here. I think, however, that the modelling part is more neglected (the target distribution is rarely even considered as something that can be written down and analysed). Moreover, designing good target distributions can be quite alignment-specific whereas designing algorithms for inference in probabilistic graphical models is an extremely generic research problem so we can expect progress here anyway.

[-]evhub3yΩ330

In the first equation under “KL-regularised RL as variational inference,” I think should be $π_{0} (x)$ .

[-]Tomek Korbak3y30

fixed, thanks!

[-]Jacob_Hilton4yΩ330

Great post! This seems like a useful perspective to keep in mind.

Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely - instead, PPO's implicit "local" KL penalty controls the rate of policy change.

If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

[-]Tomek Korbak4y20

I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice

Yes, that seems plausible. Though as you said, most methods that only change the policy a bit (early stopping, clipping in PPO) do that via implicit KL penalties and still can be seen as updating a prior.

there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

Definitely exploration-exploitation issues could make the distribution collapse more severe and traditional RL tricks could help with that. But I still believe distribution collapse does not reduce to insufficient exploration and good exploration alone won't solve it. In this specific instance, failing to find the optimal policy is not the problem, the optimal policy itself is the problem.

[-]Charlie Steiner4yΩ230

You mention converging to a deterministic policy is bad because of repetition, but did I miss you addressing that it's also bad because we want diversity? (Edit: now that I reread that sentence, it makes no sense. Sorry!) In some sense we don't want RL in the limit, we want something a little more aware that we want to sample from a distribution and get lots of different continuations that are all pretty good.

[-]Lucius Bushnaq4y20

Is a(x) in the formulas supposed to be pi_0(x)?

[-]Tomek Korbak4y10

good catch, yes, thanks!

[-]Nina Panickssery2y10

The arguments around RL here could equally apply to supervised fine-tuning.

Methods such as supervised fine-tuning also risk distributional collapse when the objective is to maximize the prediction's correctness without preserving the model's original distributional properties.

[-]Nina Panickssery2y*10

even if is a smooth, real-valued function and it perfectly captures human preferences across the whole space of possible sequences $X$ and if $x^{*}$ is truly the best thing, we still wouldn’t want the LM to generate only $x^{*}$

Is this fundamentally true? I understand why this is in practice the case, as a model can only capture limited information due to noninfinite parameters and compute. And therefore trying to model the optimal output is too hard, and you need to include some entropy/uncertainty in your model, which means you should aim to capture an accurate probability distribution over answers. However, if we were able to perfectly predict the optimal output at all times, surely this would be good?

As an analogy, if we are trying to model the weather, which is notoriously hard and chaotic, with a limited number of parameters, we should aim to output a probability distribution over weather conditions.

However, if we want to predict the shortest path through a maze, getting the exact correct shortest past is better than spreading probabilities over the top n shortest paths.

[-]RobertKirk4yΩ110

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations. This setting makes the distinction between a generative model and a policy clearer, and maybe changes the relevance of this statement:

The problem with the RL objective is that it treats the LM as a policy, not as a generative model. While a generative model is supposed to capture a diverse distribution of samples, a policy is supposed to chose the optimal action.

That is, in these settings we do want an optimal policy and not a generative model. This is quite similar to what Paul was saying as well.

Further, if we see aligning language models as a proxy for aligning future strong systems, it seems likely that these systems will be taking multiple steps of interaction in some environment, rather than just generating one (or several) sequences without feedback.

[-]Tomek Korbak4yΩ120

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that's what you want in RL.

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?

[-]RobertKirk4yΩ220

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?

I feel like this wasn't exactly made clear in the post/paper that this was the motivation. You state that distribution collapse is bad without really justifying it (in my reading). Clearly distributional collapse to a degenerate bad output is bad, and also will often stall learning so is bad from an optimisation perspective as well (as it makes exploration much less likely), but this seems different from distributional collapse to the optimal output. For example, when you say

if x∗ is truly the best thing, we still wouldn’t want the LM to generate only x∗

I think I just disagree, in the case where we're considering LLMs as a model for future agent-like systems that we will have to align, which to me is the reason they're useful for alignment research. If there's a normative claim that diversity is important, then you should just have that in your objective/algorithm.

I think the reason KL divergence is included in RLHF is an optimisation hack to make sure it does well. Maybe that's revealed that for alignment we actually wanted the Bayesian posterior distribution you describe rather than just the optimal distribution according to the reward function (i.e. a hardmax rather than a softmax on the reward over trajectories), although that seems to be an empirical question whether that was our preference all along or it's just useful in the current regime.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

115

RL with KL penalties is better seen as Bayesian inference

115

Ω 59

115

Ω 59

Introduction

Aligning language models via standard RL

Fine-tuning language models via KL-regularised RL

KL-regularised RL as variational inference

Separation of modelling and inference

Is RL a good framework for LM alignment?

Mathematical appendix