OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).
"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.
Let me know if you want help on this, I'm interested in this myself.
In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine).
I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.
Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only "cloned" them.
(Note that it wasn't specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)
That's interesting!
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
Very useful update, thanks.
Though I notice they don't say anything about how ada and text-ada-* models were trained.
Thanks for catching this and spreading the word!
Do we know if the following other models from OpenAI use true RLHF or also use this RLHF-like mystery method? (or something else!)
I (and many others) did not realize this before, but:
text-davinci-002
andtext-davinci-001
, the InstructGPT models on the OpenAI API, were not trained with RLHF (reinforcement learning from human feedback) as described in the InstructGPT paper, but a "similar but slightly different"[1] method that uses the same human feedback data. Apparently, this other method is not technically RLHF.Since this update has potentially nontrivial implications for interpreting the phenomena exhibited by
text-davinci-002
described in Mysteries of mode collapse (formerly titled "Mysteries of mode collapse due to RLHF"), I'm making this separate post for a signal boost.I have not corrected the original text of "Mysteries of mode collapse due to RLHF", but I've added a section at the beginning with further details on this update, copied here:
Sidenote on OpenAI's blog post, Aligning Language Models to Follow Instructions
the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!
which seems to confirm my suspicion about outcome-supervision