I don't fully understand RLHF, so let me just put my idea out there and someone can tell me how I am confused.

Assume that RLHF can in fact "align" an otherwise dangerous system.  Here's my question:  You train the system, and then you start applying RLHF, right?  Doesn't the dangerous system already exist, then, before the RLHF is applied?  If so, isn't that dangerous?

Anyway I don't know any technical details so I assume I simply don't understand how these things work.

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

For the general idea of RLHF, no, the dangerous system doesn't have to ever exist - the reward model can be learned in tandem with improvements in the AI's capabilities. Like in the original paper.

But for RLHF on large language models, like what OpenAI is doing a lot of, yes, there will be an un-RLHFed system that's very smart. But that system wouldn't necessarily be dangerous, in the case of a LLM the base model would probably just be very good at predicting the next word - RLHF fine-tuning would both be making it "safer" in the sense of making it produce text positively rated by humans, but also "more dangerous" in the sense of generating text better at achieving objectives in the real world.

Tyty very helpful and illuminating

2 comments, sorted by Click to highlight new comments since: Today at 2:56 PM

This article has a section on "Challenges of Reinforcement Learning from Human Feedback".

I don't understand it either, but going by Wikipedia, it seems that the dangers are the following:

  • the human feedback will be inconsistent, especially if we talk about things like ethics (different humans will give different answers on the trolley problem, what lesson is the AI supposed to learn here? probably something random, unrelated to what any of those humans thought);
  • humans can make mistakes, like fail to notice something important in a complicated or misleading situation (in which case the lack of negative feedback would be interpreted as "this is ok");

plus there is the practical problem of how much feedback would make you feel that the AI is now aligned sufficiently.