Compendium of problems with RLHF

[-][anonymous]3y16-3

I do AI policy, and it has been extremely rare for me to see something that sticks out to me as something as valuable as this. I don't think you're aware how incredibly valuable your gifs are for describing problems with RLHF. The human brain is structured to glaze over, ignore, or forget text, while considering visual evidence with the full force of reasoning faculties. Visual information is vastly more intuitive and digestible that written language, it is like "getting your foot in the door" to someone's limited daily allowance of attention and higher reasoning faculties. Although it requires some buy in to figure out ways to correctly and honestly describe the problem with gifs, the communication value (to other ML researchers) is massive and the ratio of honest, successful communication to effort is extremely high. Using gifs to visually describe the problems with RLHF just scales really well with more time, effort, and cognition spent on really accurate gifs, after making the first gifs.

Most people don't notice or seriously consider the problem with RLHF because they don't feel like digesting text criticizing RLHF, and more gifs will give experts a fair chance to have accurate thoughts about the problems with RLHF. I'm not an expert and I don't know how difficult it is to actually make gifs that can describe the complex problems, but I do know that if the bare minimum is managed at making such gifs, the paper has a much higher chance of causing a paradigm shift among ML researchers than the average ML researcher might think. Even if describing most of the RLHF problems with gifs is impossible or systematically fails, it will still have its effect amplified by allowing ML researchers to begin communicating problems to tech executives and policymakers who manage their projects and funding.

I hope that this isn't your final compendium on RLHF (I'm bookmarking it either way) and that you spend at least a little bit of time evaluating whether it's possible to describe problems with RLHF to ML researchers using gifs. This is a gold mine, and it never occurred to me that it could be done until I saw that gif. If you can't figure out a way to do it yourself, I recommend asking around for information about funding such as the EA future fund and ask about funds and grantmaking at rationalist events and people will probably be able to connect you to large sources of funding, teams of artists and animators to handle the grunt work, or even other ML researchers who have a lot of knowledge of ways to figure out ways to accurately depict RLHF problems visually (think rob bensinger or eliezer yudkowsky). I hope that this isn't your final compendium on RLHF (I'm bookmarking it either way) and that you spend at least a little bit of time evaluating whether it's possible to describe problems with RLHF to ML researchers using gifs.

[-]Charbel-Raphaël3y98

Thank you! Yes, for most of these issues, it's possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.

However, even though I am quite honored, it's not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.

[-]momom23y60

Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.

Potential uses of the post:

This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:

It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.
It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).
It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.

Meta level critics:

If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.

As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.

The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:

your opinion,
its severity,
its likelihood,
whether we have empirical, theoretical evidence or abstract reasons it should happen.

This has not been done in a systematic fashion, and it could be organized more clearly.

More specific criticism:

I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.
I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it.
Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.
9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it?
What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.
Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.
More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.
Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking.
To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?

In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.

[-]Daniel Kokotajlo3y*63

It's gonna take me a while to digest this post, but in the meantime, thank you! This is the sort of content I love to see. (ETA: I strong-upvoted this post)

[-]Daniel Kokotajlo3y42

My updated thoughts are: Still a great post, not as polished as it should be though. That's OK. The important thing is that it compiles a big list of problems and alleged problems for RLHF, with links.

[-]Charbel-Raphaël2y20

Here is the polished version from our team led by Stephen Casper and Xander Davies: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback :)

[-]Charbel-Raphaël3y51

I find it useful to work on a spreadsheet to think about the severity of these problems. Here is my template.

[-]Charlie Steiner3y41

If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible.

This is false, and the reason may be a bit subtle. Basically "agency" is not a bare property of programs, it's a property of how programs interact with their environment. ChatGPT is corrigible relative to the environment of the real world, in which it just sits around outputting text. This is easy because it's not really an agent relative to the real world! However, ChatGPT is an agent relative to the text environment - it's trying to steer the text in a preferred direction ^[1].

A robot that literally had the same cognitive performance as ChatGPT would just move the robot body in a way that encoded text, not in a way that had any skill at navigating the real world. But a robot that had analogous cognitive capabilities as ChatGPT except suited for navigating the real world would be able to navigate the real world quite well, and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.

^{^}
AFAIK this is only precisely true for the KL-penalty regularized version of RLHF, where you can think of the finetuned model as trying to strategically spend its limited ability to update the base transition function, in order to steer the trajectory to higher reward. For early stopping regularized RLHF you probably get something mathematically messier.

[-]Charbel-Raphaël3y32

Thanks, I overlooked this and it makes sense to me. However, I'm not as certain about your last sentence:

"and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world."

I agree with the idea of "steering the trajectory," and this is a possibility we must consider. However, I still expect that if we train the robot to use the "Shut Down" token when it hears "Hi RobotGPT, please shut down," I don't see why it wouldn't work.

It seems to me that we're comparing a second-order effect with a first-order effect.

[-]TurnTrout3y30

The model has been shaped to maximize its reward by any means necessary^[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.

This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game.

[-]jefftk3y30

The "Davidad suggests" link is broken, and maybe should go to this comment?

[-]Charbel-Raphaël3y30

Fixed, thanks!

[-]Yitz3y30

This was really helpful, thanks for the post!

[-]Closed Limelike Curves2y10

Using RL(AI)F may offer a solution to all the points in this section: By starting with a set of established principles, AI can generate and revise a large number of prompts, selecting the best answers through a chain-of-thought process that adheres to these principles. Then, a reward model can be trained and the process can continue as in RLHF. This approach is potentially better than RLHF as it does not require human feedback.

I'd like to say that I fervently disagree with . Giving an unaligned AI the opportunity to modify its own weights (by categorizing its own responses to questions), then politely asking it to align itself, is quite possibly the worst alignment plan I've ever heard; it's penny-wise, pound-foolish. (Assuming it even is penny-wise; I can think of several ways to generate a self-consistent AI that would cost less.)

[-][anonymous]3y10

I found this quite helpful, even if some points could use a more thorough explanation.

[-][anonymous]3y10

the public was not happy with the fact that the AI kept repeating "I am an AI developed by OpenAI", which pushed OpenAI to release the January 9 version that is again much more hackable than the December 15 patch version (benchmark coming soon).

Wow, that sounds bad. Do you have any source for this?

^{^}

I tend to agree with Katja Grace that values are not fragile in the sense imagined in the sequences [link].

^{^}

Modulo Models Don't "Get Reward".

^{^}

It's actually not that expensive, I'm willing to buy an aligned AI for a lot more than that. But it gives a lower bound on the order of magnitude of the alignment fee for RLHF.

^{^}

This does not seem to be a problem per se if your model of the human giving feedbacks is robust. But your model has to be robust. Also keep in mind that even pure human feedback is also likely to lead to AI takeover.

^{^}

As a first approximation, I suspect we can consider that only the upper layers of the model have been refined, the lower intermediate layers having not been modified. In Sparrow, only the upper 16 layers have been fine-tuned.

LESSWRONG
LW

LESSWRONG
LW

123

Compendium of problems with RLHF

123

123

Potential uses of the post:

Meta level critics:

More specific criticism:

Why RLHF counts as progress?

Why RLHF is insufficient?

Existing problems with RLHF because of (currently) non-robust ML systems

Incentives issues of the RL part of RLHF

Problems related to the HF part of RLHF

Superficial Outer Alignment

The Strawberry problem

Unknown properties under generalization

Final thoughts