I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.

My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I tend to think that completely unaligned models will get more backlash than imperfectly aligned ones, but maybe this is incorrect. I haven't thought deeply about whether it is safe/good for everyone to open source the underlying capability model.

Reply