I’ve been thinking about Reinforcement Learning from Human Feedback (RLHF) a lot lately, mostly as a result of my AGISF capstone project attempting to use it to teach a language model to write better responses to Reddit writing prompts, a la Learning to summarize from human feedback.

RLHF has generated some impressive outputs lately, but there seems to be a significant amount of disagreement regarding its potential as a partial or complete solution to alignment: some are excited to extend the promising results we have so farwhile others are more pessimistic and perhaps even opposed to further work along these lines. I find myself optimistic about the usefulness of RLHF work, but far from confident that all of the method’s shortcomings can be overcome.

How it Works

At a high level, RLHF learns a reward model for a certain task based on human feedback and then trains a policy to optimize the reward received from the reward model. In practice, the reward model learned is likely overfit - the policy can thus benefit from interpolating between a policy that optimizes the reward model’s reward and a policy trained through pure imitation learning. 

A key advantage of RLHF is the ease of gathering feedback and the sample efficiency required to train the reward model. For many tasks, it’s significantly easier to provide feedback on a model’s performance rather than attempting to teach the model through imitation. We can also conceive of tasks where humans remain incapable of completing the tasks themselves, but can evaluate various completions and provide feedback on them. This feedback can be as simple as picking the better of two sample completions, but it’s plausible that other forms of feedback might be more appropriate and/or more effective than this. The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning. The creators of the method, Andrew Ng and Stuart Russell, believe that “the reward function, rather than the policy, is the most succinct, robust, and transferable definition of the task,”. Think about training an AI to drive a car: we might not want it to learn to imitate human drivers, but rather learn what humans value in driving behavior in the abstract and then optimize against those preferences.

Outer Alignment Concerns

If a reward model trained through human feedback properly encoded human preferences, we might expect RLHF to be a plausible path to Outer Alignment. But this seems like a tall order, considering that humans can be assigned any values whatsoeverthe easy goal inference problem is still hard, and that it’s easy to misspecify any model that attempts to correct for human biases or irrationalityAmbitious value learning is hard, and I’m not particularly confident that RLHF makes it significantly more tractable. 

It’s also plausible that this approach of inferring a reward function for a task is just fundamentally misguided and that the way to get an outer aligned system is through the assistance-game or CIRL framework instead. There are definite advantages of this paradigm over the more standard reward learning setup that RLHF leverages. By treating humans as pieces of the environment and the reward function as a latent variable in the environment, an AI system can merge the reward learning and policy training functions that RLHF separates and thereby “take into account the reward learning process when selecting actions,”. This makes it easier to make plans conditional on future feedback, only gather feedback as and when it becomes necessary, and more fluidly learn from different forms of feedback.

Scalable oversight is hard

RLHF also relies upon humans being able to evaluate the outputs of models. This will likely be impossible for the kinds of tasks we want to scale AI to perform - it’s just going to be too hard for a human to understand why one output should be preferred over another. We’d simply have to hope that reward model generalization we’d seen previously, when oversight was still possible, continued to hold. Even if we thought we’d figured out how to evaluate our models’ outputs, there’s always the chance of an inner alignment failure or other deceptive behavior evading our oversight - we’d want to be absolutely certain that our reward and policy models were actually doing what we wanted them to do. 

The solutions to the scalable oversight problem seem to primarily rely on AI-assistance and/or breakthroughs in interpretability techniques. I think it’s clear how the latter might be useful: if we could just look at any model and be certain of its optimization objective, we’d probably feel pretty comfortable understanding the reward models and policy models we trained. AI-assistance might look something like recursive reward modeling: break the task that’s too hard to oversee into more manageable chunks that a human can oversee and train a model to optimize those tasks. Using the models trained on the narrower subtasks might make the original task possible to oversee: this is an idea that has been used for the task of summarizing books. It’s plausible that there are many tasks that resist this kind of decomposition, but the factored cognition approach might get us very far indeed.

Why I think RLHF is valuable

I’ll quote Paul Christiano here:

We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can't rely on average-case performance.

This seems like a good thing to do if you want to move on to research addressing the problems in RLHF: (i) improving the quality of the evaluations (e.g. by using AI assistance), and (ii) handling high-stakes objective misgeneralization (e.g. by adversarial training).

In addition to "doing the basic thing before the more complicated thing intended to address its failures," it's also the case that RLHF is a building block in the more complicated things.

I think that (a) there is a good chance that these boring approaches will work well enough to buy (a significant amount) time for humans or superhuman AIs to make progress on alignment research or coordination, (b) when they fail, there is a good chance that their failures can be productively studied and addressed.

I generally agree with this. Solving problems that crop up in RLHF seems likely to transfer to other alignment methods, or at least be productive mistakes. The interpretability techniques we develop, outer or inner alignment failures we find, and latent knowledge we elicit from our reward and policy models all seem broadly applicable to future AI paradigms. In other words, I think the textbook from the future on AI Alignment is likely to speak positively of RLHF, at the very least as an early alignment approach.

Promising RLHF Research Directions (according to me)

I’d like to see different kinds of feedback be used in addition to preference orderings over model outputs. This paper specifies a formalism for the reward learning in general and considers several different kinds of feedback that might be appropriate for different tasks, e.g. demonstration, correction, natural language feedback, etc. A reward model that can gracefully learn from a wide array of feedback types seems like a desirable goal. This kind of exploration might also help us figure out better and worse forms of feedback and what kinds of generalization arise from each type.

Relatedly, I think it might be interesting to see how the assistance game paradigm performs in settings where the RLHF paradigm has been applied, like text summarization. On a theoretical level it seems clear that the assistance game setup offers some unique benefits and it would be cool to see those realized. 

As we continue to scale RLHF work up, I want to see how we begin to decompose tasks so that we can apply methods like Recursive Reward Modeling. For book summarization, OpenAI used a fixed chunking algorithm to break the text down into manageable pieces, but it seems likely that other kinds of decomposition won’t be as trivial. We might need AI assistance to decompose tasks that we can’t oversee into tasks that we can. Training decomposition models that can look at a task and identify overseeable subtasks seems like a shovel-ready problem, perhaps one that we might even apply RLHF to. 

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 12:56 PM

This was a nice post! I appreciate the effort you're making to get your inside view out there.

A correction:

The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning.

Based on this sentence, you might be conflating value learning (the broad class of approaches to outer alignment that involve learning reward models) with IRL, which is the particular sub-type of value learning in which the ML model tries to infer a reward function by observing the behavior of some agent whose behavior is assumed (approximately) optimal for said reward function. So, for example, IRL includes learning how to fly a helicopter by watching an expert, but not the approach used in "Learning to summarize from human feedback," in which a reward model was trained via supervised learning from pairwise comparisons.

Relatedly, I'll note that much (though not all) of the skepticism about value learning linked in the "Outer alignment concerns" section is IRL-specific. In more detail, many of the linked posts revolve around the IRL-specific issue of "How do you correct for your 'expert demonstration' actually being performed by an suboptimal human?[1]" But this concern doesn't seem to apply to all types of value learning; for examples RLHF doesn't require that humans to be approximately optimal at the task, only that we are able to judge completions of the task. (That said, I haven't read the "Value Learning" sequence in detail, so it's possible I'm misunderstanding and they actually explain how this concern generalizes to all value learning approaches?[2])

Unrelated to the point about IRL, my inside view agrees with yours that an important next step in RLHF is making it possible for humans to give richer feedback, e.g. natural language feedback, trajectory corrections, etc. I, too, was excited by the Reward-rational choice paper (if not the particular formalism proposed there, then the general thrust that we should have a framework for giving lots of different types of feedback to our AI systems). Conversely, my inside view finds CIRL less promising than yours does.


  1. ^

    Human irrationality is one example of suboptimal human behavior, but there are others too. For example, a perfect IRL agent watching a human playing a video game in which perfect play requires super-human reflexes would infer that the human wanted to react slowly. So suboptimal behavior is an obstruction to both correctly inferring human values and producing a super-human agent via IRL.

  2. ^

    According to me, the generalized version of this concern is "How do you get an agent whose reward function was learned via some value learning approach to have super-human performance?" The avatar of this for RLHF is scalable oversight, as you address in your post.

Thanks for the feedback and corrections! You're right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you're also right that most of the "Outer alignment concerns" section doesn't really apply to RLHF as it's currently written, or at least it's not immediately clear how it does. Here's another attempt:

RLHF attempts to infer a reward function from human comparisons of task completions. But it's possible that a reward function learned from these stated preferences might not be the "actual" reward function - even if we could perfectly predict the human preference ordering on the training set of task completions, it's hard to guarantee that the learned reward model will generalize to all task completions. We also have to consider that the stated human preferences might be irrational: they could be intransitive or cyclical, for instance. It seems possible to me that a reward model learned from human feedback still has to account for human biases, just as a reward function learned through IRL does.

How's that for a start?

(I should clarify that I'm not an expert. In fact, you might even call me "an amateur who's just learning about this stuff myself"! That said...)

RLHF attempts to infer a reward function from human comparisons of task completions.

I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries.

On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have:

  1. What if human preferences aren't representable by a utility function (e.g. because they're intransitive)? This doesn't seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don't strongly expect that to happen -- I feel like human preferences shouldn't so unconstrained as to sink RLHF.
  2. What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn "biased human values" which ... I guess I'm okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if this is the biggest issue with RLHF then I feel like we've averted the worst outcomes.

It's possible I've misunderstood the central concern about how RLHF interacts with human irrationality, so feel free to say if there's a consideration I've missed!

What if human preferences aren't representable by a utility function

I'm responding to this specifically, rather than the question of RLHF and 'human irrationality'.

I'm not saying this is the case, but what if 'human preferences' are representable by something more complicated. Perhaps an array or vector? Can it learn something like that?

Reinforcement Learning from Human Feedback (RLHF)

If you don't want all of that in the title, something like

RL with Human Feedback (RLHF)

makes it more clear what it's about, and I think a lot of people will know what RL is here. (Although with and from may have different meanings and be different things in this space.)