The OpenAI style of RLHF progresses in 3 stages:
2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT
My question: I wonder which of these stages contributes "the most" to the observed mode collapses/behavior.
The OpenAI style of RLHF progresses in 3 stages:
2. and 3. can be alternated and the Policy from 3. is what is used as the final InstructGPT
My question: I wonder which of these stages contributes "the most" to the observed mode collapses/behavior.
- If skip I 1. and just use GPT-3 without fine-tuning, how do
... (read more)