I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I... (read more)
people are open sourcing highly capable models already but those models have not
been trained on particularly meaning aligned corpuses. releasing reward models
would allow people to change the meaning of words in ways that cause the reward
models to rank their words highly but which are actually dishonest, this is
currently called SEO. however if trying to optimize for getting your ideas
repeated by a language model is the new target, then your language model
objective being open source will mess with things. idk mate just my ramblings I
guess
I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I... (read more)