Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks; Xander Davies

Overview: we're trying to form inside views about the value of alignment research related to RLHF. Recommend us reading material expressing relevant arguments.

The Harvard AI Safety Team (HAIST) is the AI safety student group at Harvard. Like the broader AI safety community, HAIST members have a variety of inside views about how useful research on RL from human feedback (RLHF) is. Broadly speaking, we can divide people into RLHF optimists and pessimists:

RLHF optimists, generally speaking, think that doing research in areas like the following is net positive:

Avoiding reward model overfitting, e.g. by improving reward model sample efficiency or querying the humans on more informative task completions.
Improving human oversight, e.g. with IDA or debate-esque schemes.
Making it possible for humans to give richer feedback, e.g. natural language feedback instead of pairwise comparisons of task completions.
Getting more aligned agents out of RLHF-esque paradigms by modifying RLHF, e.g. tweaks which result in more robust or conservative agents.

RLHF pessimists, on the other hand, tend to think that research like the above is net neutral or negative. Pessimists might have beliefs like:

RLHF-esque schemes don't fundamentally address outer alignment, e.g. because of Goodhart's Law.
Even if RLHF can produce agents which are aligned for sufficiently simple tasks, amplification schemes will either fail to agents which are capable on more difficult tasks or will result in unaligned agents.
Improving RLHF has the effect of sweeping misaligned behavior under the rug, making it harder to iterate on alignment failures and causing people to take alignment less seriously.

Of course, many people will have some optimistic views and some pessimistic views, or more complicated mixtures of these views. For example, someone might think that most RLHF research is net negative unless there's a breakthrough in improving human oversight, so that the only net positive research agenda listed above is improving human oversight for now.

An important clarification: many people think that RLHF is a viable approach to outer alignment but that most AI x-risk is due to inner alignment. Let's agree to call such a person an RLHF optimist even if they think that inner alignment agendas are more valuable to work on than RLHF agendas. One possible definition here is that we're calling Alice an RLHF optimist if, given a button which creates a new person who is only able to do RLHF research, Alice would pay money to push the button. (For example, as great as this Ajeya post is, it doesn't seem to imply a position on RLHF optimism/pessimism, only a position that inner alignment is important.)

For a while, HAIST has had an idea floating around of doing "the great RLHF debate," where we all try to converge with each other on the usefulness of RLHF research. The seems useful to do since at some point our members will graduate and actually begin alignment jobs, and they'll need some way of deciding which alignment jobs to take. But if we're going to do the great RLHF debate, it seems important to do it well, i.e. in a way that actually tracks the truth.

One thing that makes having this debate tricky is that -- despite general community disagreement -- there don't seem to be many examples of people writing out their full reasoning about the usefulness of RLHF research. So this post is a call for recommendations for reading materials which speak to the usefulness of RLHF research. (We'd also be quite happy for people to leave comments explaining their views).

Our current plan for the debate is to spend some time reading material in small groups and trying to precisely articulate the main points of disagreement; then one week later spend some more time debating the points of disagreement we identified and trying to converge with each other. Before and after the debate phase, we'll probably poll people about their views to see whether convergence happened and get a sense of how peoples' views changed. (We'd also be quite happy to get feedback on this format.)

Here are reading materials that we're already aware of:

Comment thread about whether RLHF is bad because it sweeps misalignment under the rug.
Comment thread about whether Goodhart's Law is a fundamental obstruction to any RLHF-esque outer alignment approach.
Section of a post making the pessimistic argument regarding both of the above.
Comment thread discussing concerns about RLHF agents learning to model and manipulate humans, and hashing out disagreements about how to interpret the behavior of current models.
Post presenting an argument that Goodharting might not be a big deal if value learning (including learning from human feedback) can keep up with capabilities.
Post arguing that IDA and debate are nonviable amplification strategies.

We're planning on doing the reading phase of the debate next Tuesday (11/8), so materials that know about by Tuesday are much more useful than materials we're made aware of after Tuesday.

I'd be especially excited if this debate produced an adversarial-collaboration-style synthesis document, laying out the various perspectives and cruxes. I think that collapsing onto an optimism/pessimism binary loses a lot of important nuance; but also that HAIST reading, summarizing, and clearly communicating the range of views on RLHF could help people holding each of those views more clearly understand each other's concerns and communicate with each other.

I agree that something like this would excellent. I unfortunately doubt that anything so cool will come out of this experiment. (The most important constraint is finding a HAIST member willing to take on the project of writing something like this up.)

If things go well, we are tentatively planning on sharing the list of core disagreements we identify (these will probably look like cruxes and subquestions) as well as maybe data about our members' distribution of views before and after the debate.

(if funding would get someone excited to do a great job of this, I'll help make that happen)

Worlds Where Iterative Design Fails, especially the section Why RLHF Is Uniquely Terrible.

I'd add that this bullet in the OP states part of the problem, but misses what I'd consider a central part:

Improving RLHF has the effect of sweeping misaligned behavior under the rug, causing people to take alignment less seriously which e.g. causes large labs to underinvest in safety teams.

It's not just that hiding misbehavior causes people to take alignment less seriously. It's that, insofar as misbehavior is successfully hidden, it cannot be fixed by further iteration. We cannot iterate on a problem we cannot see.

Thanks, that's a useful clarification; I'll edit it into the post.

I'd recommend the “AI safety via conditioning predictive models” doc my coauthors and I are working on right now—it's not quite ready to be published publicly yet, but we have a full draft that we're looking for comments on right now. Messaged to both of you privately; feel free to share with other HAIST members.

Could I get a link to this as well?

Would also love to have a look.

This recent comment thread discussing whether RLHF makes any progress beyond the classical "reward the agent when humans press the reward button" idea.