x

LESSWRONG

LW

Ran W — LessWrong

Ran W

Ran W

Message

17

Ω

11

1

2

3y

Ran W

17

Ω

11

3y

Why do we need RLHF? Imitation, Inverse RL, and the role of reward

This is a cross-post from my Substack to get some feedback from the alignment community. I have condensed some sections to get the message across. It's rarely questioned in the mainstream why we need RLHF in the first place? If we can do supervised fine-tuning (SFT) sufficiently well, then can...

Feb 3, 2024•16