Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.
I guess it hasn't been tested whether DPO scales better than RLHF. I don't have enough experience with these techniques to have a view on whether it does.
The Direct Preference Optimization (DPO) paper promises a more simple and efficient alternative to PPO that is able to void the reward modeling phase, and thus optimize directly for the preferences expressed in the preference data. This is achieved by the loss function:
LDPO(πθ;πref)=−E(x,yw,yl)∼D[logσ(βlogπθ(yw|x)πref(yw|x)−βlogπθ(yl|x)πref(yl|x))]
Where:
In essence, DPO computes the log probabilities of preferred and dispreferred completions under the current model and optimizes parameters to increase the likelihood of the preferred completions and decrease the likelihood of the dispreferred completions.
The authors share the following results:
Emphasis mine.
I have created a prediction market targeted at forecasting the likelihood that DPO is adopted by a Frontier Lab before Jan 1 2025. These results should be reproduced, which I might attempt on the EleutherAI Discord, in which case I will update this section of the post. Contact me if this is of interest to you.