Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day some days for 25 days. I have now procrastinated enough that I probably have enough hot takes.

Edit - I should have cited Buck's recent post somewhere.

Hyperbolic title, sorry. But seriously, conditioning is better than RLHF for current language models. For agents navigating the real world, both have issues and it's not clear-cut where progress will come from.

By "conditioning", I mean the decision transformer trick to do conditional inference: get human ratings of sequences of tokens, and then make a dataset where you append the ratings to the front of the sequences. A model trained on this dataset for next-token prediction will have to learn the distribution of text conditional on the rating - so if you prompt it with a high rating and then the start of an answer, it will try to continue the answer in a way humans would rate highly.

This can be very similar to RLHF - especially if you augment the training data by building a model of human ratings, and train a model to do conditional inference by finetuning a model trained normally. But in the right perspective, the resulting AIs are trying to do quite different things.

RLHF is sorta training the AI to be an agent. Not an agent that navigates the real world, but an agent that navigates the state-space of text. It learns to prefer certain trajectories of the text, and takes actions (outputs words) to steer the text onto favored trajectories. Conditioning, on the other hand, is trying to faithfully learn the distribution of possible human responses - it's getting trained to be a simulator that can predict many different sorts of agents.

The difference is stark in their reactions to variance. RLHF wants to eliminate variance that might make a material difference in the trajectory (when the KL penalty is small relative to the Bayesian-updating KL penalty), while conditioning on rating still tries to produce something that looks like the training distribution.

This makes conditioning way better whenever you care about the diversity of options produced by a language model - e.g. if you're trying to get the AI to generate something specific yet hard to specify, and you want to be able to sift through several continuations. Or if you're building a product that works like souped-up autocorrect, and want to automatically get a diversity of good suggestions.

Another benefit is quantilization. RLHF is trying to get the highest score available, even if it means exploiting human biases. If instead you condition on a score that's high but still regularly gotten by humans, it's like you're sampling policies that get this high-but-not-too-high score, which are less exploitative of human raters than the absolute maximum-score policy.

This isn't a free lunch. Fine-tuning for conditional inference has less of an impact on what sort of problem the AI is solving than RLHF does, but it makes that problem way harder. Unsurprisingly, performance tends to be worse on harder problems. Still, research on decision transformers is full of results that are somewhat competitive with other methods.

It also still exploits the human raters some amount, increasing with the extremity of the score. Sam Marks has talked about a scheme using online decision transformers to improve performance without needing to make the score extreme relative to the distribution seen so far, which is definitely worth a read, but this seems like a case of optimality is the tiger. Whether found by RLHF or conditioning, the problem is with the policies that get the highest scores.

Looking out to the future, I'm uncertain about how useful conditioning will really be. For an AI that chooses policies to affect the real world (as opposed to generating text), it doesn't seem nearly so important to be able to produce a variety of on-distribution policies. On the other hand, maybe we'll come up with ways to leverage that capability that are useful for alignment.

Currently, I expect that many of the shared problems between RLHF and conditioning will be tackled by developing the capability for models to receive meta-level feedback that directly affects generalization properties. This capability is more consonant with RLHF, and is discordant with conditioning, because it means departing from the generator of the training distribution in order to do the right thing.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 12:35 AM

This was definitely a hot take. 🔥

RLHF is sorta training the AI to be an agent

I think this might be widely believed and also wrong, or at least not well-supported. What makes training with RL(HF) inherently produce "agents" in a way that finetuning on high-capability trajectories, or pretraining on datasets from human agents does not? IMO the community would think more clearly if we just talked about policies and policy optimization algorithms (of which teacher-forced finetuning, decision-transformer stuff, PPO, etc. are examples) directly rather than leaning on our confused understanding of the agency concept.

The difference is stark in their reactions to variance. RLHF wants to eliminate variance that might make a material difference in the trajectory (when the KL penalty is small relative to the Bayesian-updating KL penalty), while conditioning on rating still tries to produce something that looks like the training distribution.

Why is it that you think that decision transformer-style conditioning preserves variance better than PPO on a reward model? I don't recall seeing any LLMs that use that method. The closest we have is the text-davinci-002 "FeedMe" model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF. If what you say is true, we should see stark differences in variability/creativity between text-davinci-002 and text-davinci-003, no?

Another benefit is quantilization. RLHF is trying to get the highest score available, even if it means exploiting human biases. If instead you condition on a score that's high but still regularly gotten by humans, it's like you're sampling policies that get this high-but-not-too-high score, which are less exploitative of human raters than the absolute maximum-score policy.

I disagree. In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however you'd like (for ex. via top-p, which is a quantilization-flavored sampling algorithm). Also not sure in what sense you meant that "RLHF is trying to get the highest score possible". This seems like it's false in reference to the actual policies found by the algorithm (reward is not the optimization target yadda yadda), questionable in reference the learning algorithm absent something like a convergence guarantee, and false in reference to the overall humans-training-a-model process (i.e. researchers in practice check to see whether the model is just producing bad samples that happen to trick the reward model).

Thanks for this great comment.

 The closest we have is the text-davinci-002 "FeedMe" model, where they train the model to imitate the distribution of highly-rated samples. That model seems to produce mode collapse all the same, so much so that we mistakenly attributed the phenomenon to RLHF.

I don't actually understand the text-davinci-002 training method, so I'll have to look into it more later - thanks for the push! I think it's possible that fine-tuning on ratings of the model's own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like "97 is the most random number."

I would say the closest we have is people finetuning language models on more narrow decision transformer tasks, like chess. But I admit I've never seen anyone check for mode collapse in such a case, which now seems like a useful thing to check.

In both cases you get a model that outputs a shaped categorical distribution over next tokens that you can pick from however you'd like (for ex. via top-p, which is a quantilization-flavored sampling algorithm). 

I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you're training something with a proper scoring rule.

Here's a very slightly more specific way of thinking about what I mean by "agent acting in a text-universe" (which is related to how I think of the output type as something other than a probability distribution). When you have a KL loss term relative to the prior distribution, that gives you a "budget," that you get to spend on some function  that that multiplies . The more  deviates from 1, the more it costs (like ). And the RLHF loss is doing something like trying to get a good score while staying under-budget on .

This suggests that you can think of  as the policy of an agent - i.e. how "it" wants to spend its budget in each state - rather than , which is what the whole system actually outputs at each state. This is what I think of colorfully as "an agent living in a text-universe," where the "text-universe" is the dynamics  that the "agent" has only limited ability to push around.

Also not sure in what sense you meant that "RLHF is trying to get the highest score possible". This seems like it's false 

Yeah, this was sloppy.

I think it's possible that fine-tuning on ratings of the model's own outputs was done in a way equivalent to an RL step, with effective reward such that it makes sense why it would converge on stuff like "97 is the most random number."

I guess that's possible, but I don't see any particular reason to entertain that hypothesis. (Other than wanting to rescue our original hunch that RL is what's causing this issue 😛). I'm a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.

I mean, they both output lists of numbers. But interpreting those numbers as a probability distribution makes a lot more sense when you're training something with a proper scoring rule.

I agree that with the pure self-supervised case, it's clearer which distribution we're approximating. But in both cases, the network really does output a probability distribution. GPT outputs a probability distribution over next tokens conditioned on the current context. A policy function outputs a probability distribution over actions conditioned on the current state. In the case of InstructGPT, those actions are next-tokens and the current state is the current context.

Agreed that we can interpret these things as navigating a world of text. It is helpful IMO to realize that under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing "agents".

I'm a lot more inclined to believe that this issue is either some general symptom of over-finetuning language models, or some technical detail like weight decay etc.

Yeah, after looking into it the details they released recently are super sparse. But I did see one quote to the effect that it was overfitting already after 1 epoch, but they kept training it for some larger number of epochs because the test score kept improving.

under that interpretation, GPT training is behavioral cloning (BC), where the Internet is the expert policy that produced our training trajectories. Both BC and PPO are ways of producing "agents".

I think this is a different interpretation, actually, because it's a different division of what's "environment" and what's "agent." Talking about base GPT as doing behavior cloning means taking the "environment" to be a passive recorder of the output. In such an environment, everything is easily interpreted as an agent (whose "goal" is producing whatever output it produced), but the tradeoff is the "agent" abstraction isn't helping you compress your data.