1

I don't fully understand RLHF, so let me just put my idea out there and someone can tell me how I am confused.

Assume that RLHF can in fact "align" an otherwise dangerous system. Here's my question: You train the system, and then you start applying RLHF, right? Doesn't the dangerous system already exist, then, before the RLHF is applied? If so, isn't that dangerous?

Anyway I don't know any technical details so I assume I simply don't understand how these things work.

Beginner's question about RLHF

8th Aug 2023

3Charlie Steiner

2Viliam

New Answer

New Comment

1 Answers sorted by
top scoring

Charlie Steiner

Aug 12, 2023

For the general idea of RLHF, no, the dangerous system doesn't have to ever exist - the reward model can be learned in tandem with improvements in the AI's capabilities. Like in the original paper.

But for RLHF on large language models, like what OpenAI is doing a lot of, yes, there will be an un-RLHFed system that's very smart. But that system wouldn't necessarily be dangerous, in the case of a LLM the base model would probably just be very good at predicting the next word - RLHF fine-tuning would both be making it "safer" in the sense of making it produce text positively rated by humans, but also "more dangerous" in the sense of generating text better at achieving objectives in the real world.

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:59 AM

[-]Viliam9mo20

This article has a section on "Challenges of Reinforcement Learning from Human Feedback".

[-]Viliam9mo20

I don't understand it either, but going by Wikipedia, it seems that the dangers are the following:

the human feedback will be inconsistent, especially if we talk about things like ethics (different humans will give different answers on the trolley problem, what lesson is the AI supposed to learn here? probably something random, unrelated to what any of those humans thought);
humans can make mistakes, like fail to notice something important in a complicated or misleading situation (in which case the lack of negative feedback would be interpreted as "this is ok");

plus there is the practical problem of how much feedback would make you feel that the AI is now aligned sufficiently.

Moderation Log

LESSWRONG
LW

[ Question ]

Beginner's question about RLHF

1

1 Answers sorted by
top scoring

Aug 12, 2023

[ Question ]

Beginner's question about RLHF

1

1 Answers sorted by top scoring

Aug 12, 2023

1 Answers sorted by
top scoring