LESSWRONG
LW

Anthropic (org)Language Models (LLMs)RLHFAI
Frontpage

12

[Preprint] Pretraining Language Models with Human Preferences

by Giulio
21st Feb 2023
1 min read
0

12

This is a linkpost for https://arxiv.org/abs/2302.08582
Anthropic (org)Language Models (LLMs)RLHFAI
Frontpage

12

New Comment
Moderation Log
More from Giulio
View more
Curated and popular this week
0Comments

Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:

  • Instead of fine-tuning on human-preferences, they directly incorporate human feedback in the pre-training phase, conditioning the model on <good> or <bad> feedback tokens placed at the beginning of the training sequences. 
  • They find this to be Pareto-optimal out of five considered pre-training objectives, greatly reducing the amount of undesired outputs while retaining standard LM pre-training downstream performance AND outperforming RLHF fine-tuning in terms of preference satisfaction.

This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.

From a discussion with James Chua on AISS's slack, we noted similarities between this work and Charlie Steiner's Take 13: RLHF bad, conditioning good. James is developing a library ("conditionme") specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.