LESSWRONG
LW

2127636907699376327957344805942353
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Mysteries of mode collapse
21276369076993763279573448059423533y30

The OpenAI style of RLHF progresses in 3 stages: 

  1. Fine-tuned Model: Fine-tune GPT-3 with supervised learning
  2. Reward Model: Make a copy of the FT Model, remove the last layer and add a linear layer, train this as a reward model.
  3. Policy: Make a copy of FT Model, train this with RL using the Reward Model.

2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT

My question: I wonder which of these stages contributes "the most" to the observed mode collapses/behavior.

  • If skip I 1. and just use GPT-3 without fine-tuning, how does this impact mode collapse?
  • What if I skip 2. and 3. and just use 1.?
  • What if I don't alternate 2. and 3. (maybe the more you alternate, the more you exacerbate?)

The Fine-Tuned Model gradients from 1. are also used in the loss function for 3. I didn't find the difference in performance from this to be super convincing, I wonder if it also comes with a cost...

Reply
No posts to display.