Overview: we're trying to form inside views about the value of alignment research related to RLHF. Recommend us reading material expressing relevant arguments.
The Harvard AI Safety Team (HAIST) is the AI safety student group at Harvard. Like the broader AI safety community, HAIST members have a variety of inside views about how useful research on RL from human feedback (RLHF) is. Broadly speaking, we can divide people into RLHF optimists and pessimists:
RLHF optimists, generally speaking, think that doing research in areas like the following is net positive:
- Avoiding reward model overfitting, e.g. by improving reward model sample efficiency or querying the humans on more informative task completions.
- Improving human oversight, e.g. with IDA or debate-esque schemes.
- Making it possible for humans to give richer feedback, e.g. natural language feedback instead of pairwise comparisons of task completions.
- Getting more aligned agents out of RLHF-esque paradigms by modifying RLHF, e.g. tweaks which result in more robust or conservative agents.
RLHF pessimists, on the other hand, tend to think that research like the above is net neutral or negative. Pessimists might have beliefs like:
- RLHF-esque schemes don't fundamentally address outer alignment, e.g. because of Goodhart's Law.
- Even if RLHF can produce agents which are aligned for sufficiently simple tasks, amplification schemes will either fail to agents which are capable on more difficult tasks or will result in unaligned agents.
- Improving RLHF has the effect of sweeping misaligned behavior under the rug, making it harder to iterate on alignment failures and causing people to take alignment less seriously.
Of course, many people will have some optimistic views and some pessimistic views, or more complicated mixtures of these views. For example, someone might think that most RLHF research is net negative unless there's a breakthrough in improving human oversight, so that the only net positive research agenda listed above is improving human oversight for now.
An important clarification: many people think that RLHF is a viable approach to outer alignment but that most AI x-risk is due to inner alignment. Let's agree to call such a person an RLHF optimist even if they think that inner alignment agendas are more valuable to work on than RLHF agendas. One possible definition here is that we're calling Alice an RLHF optimist if, given a button which creates a new person who is only able to do RLHF research, Alice would pay money to push the button. (For example, as great as this Ajeya post is, it doesn't seem to imply a position on RLHF optimism/pessimism, only a position that inner alignment is important.)
For a while, HAIST has had an idea floating around of doing "the great RLHF debate," where we all try to converge with each other on the usefulness of RLHF research. The seems useful to do since at some point our members will graduate and actually begin alignment jobs, and they'll need some way of deciding which alignment jobs to take. But if we're going to do the great RLHF debate, it seems important to do it well, i.e. in a way that actually tracks the truth.
One thing that makes having this debate tricky is that -- despite general community disagreement -- there don't seem to be many examples of people writing out their full reasoning about the usefulness of RLHF research. So this post is a call for recommendations for reading materials which speak to the usefulness of RLHF research. (We'd also be quite happy for people to leave comments explaining their views).
Our current plan for the debate is to spend some time reading material in small groups and trying to precisely articulate the main points of disagreement; then one week later spend some more time debating the points of disagreement we identified and trying to converge with each other. Before and after the debate phase, we'll probably poll people about their views to see whether convergence happened and get a sense of how peoples' views changed. (We'd also be quite happy to get feedback on this format.)
Here are reading materials that we're already aware of:
- Comment thread about whether RLHF is bad because it sweeps misalignment under the rug.
- Comment thread about whether Goodhart's Law is a fundamental obstruction to any RLHF-esque outer alignment approach.
- Section of a post making the pessimistic argument regarding both of the above.
- Comment thread discussing concerns about RLHF agents learning to model and manipulate humans, and hashing out disagreements about how to interpret the behavior of current models.
- Post presenting an argument that Goodharting might not be a big deal if value learning (including learning from human feedback) can keep up with capabilities.
- Post arguing that IDA and debate are nonviable amplification strategies.
We're planning on doing the reading phase of the debate next Tuesday (11/8), so materials that know about by Tuesday are much more useful than materials we're made aware of after Tuesday.