Huge thanks to Chaska Yamane, Abhay Singh and Arpit Kalla for feedback and discussions


The recent success of ChatGPT lays bare the power of Reinforcement Learning from Human Feedback (RLHF). Previous models in the instruct series showed remarkable gains from RLHF, allowing instruction-tuned models to outperform models 100x their size. ChatGPT feels to me like it passed a qualitative threshold though, where the outputs are actually really useful for the average non-machine-learning interested person. I expect this impact to expand in the coming year, as language models continue to scale, use more optimal data scaling, and augment their knowledge by searching the internet. With the power of these models in mind, I want to argue for two main claims

  1. Open sourcing human feedback reward models is imperative for building aligned and safe AGI
  2. Language model focused businesses can still compete while open sourcing their human feedback models

Throughout this work I will distinguish between capabilities research/models and alignment research/reward models. My intuitive framework is that capabilities come from scaling models up, while alignment comes from tuning these models to better elicit the capabilities they already have in ways that are preferred by humans. These dimensions may seem not entirely orthogonal, as alignment exposes new capabilities, but I nonetheless endorse the claim that a fundamental piece of safety is to increase the rate of alignment progress relative to the rate of capabilities progress.


Importance to Alignment

Reducing the Alignment Tax Burden

They say only two things are certain: death and taxes. Hopefully, when building AGI, we can get by with just paying our alignment taxes and avoid the whole death part. An alignment tax is “any additional cost that is incurred in the process of aligning an AI system.” Jan Leike lays out three main classes of alignment tax, all of which I argue are greatly reduced by open sourcing human feedback models.

  • Performance taxes: Performance regressions caused via alignment compared to an unaligned baseline.
  • Development taxes: Effort or expenses incurred for aligning the model: researcher time, compute costs, compensation for human feedback, etc.
  • Time-to-deployment taxes: Wall-clock time taken to produce a sufficiently aligned model from a pretrained model.


Performance Taxes

There are three major ways that open-sourcing human feedback models reduce the performance tax.

First, the most obvious way is that open-sourcing the state of the art reward models will make it easier for everyone to reach aligned SOTA performance

Second, open sourcing reward models should allow the company that open sources the reward model to reduce any performance tax. In particular, having open datasets and reward models that anyone can update should allow for building a larger, more diverse and better curated dataset. In particular, many of the limitations of ChatGPT may purely be limitations on the signal available to the model in rare domains that require deep expertise. For example, the model can mess up the syllable count in a Haiku or Iambic Pentameter, perhaps because the reward model does not contain a huge amount of ratings by poets who are very carefully checking the syllable count. I expect this effect to get more pronounced as e.g. we want models to prove new math or do new research; we will need people in the relevant fields to provide deep human feedback akin to peer review to get good reward signals at the edges of human capability. 

Finally, I more speculatively disagree that there is any positive performance tax to alignment. In particular, I think that alignment provides a performance subsidy. This is strongly evidenced by the 100x model-size gain on human ratings from instruction following models. Moreover, ChatGPT’s adoption seems to have dramatically outpaced the widespread adoption of any previous model by the general public, providing strong evidence that it is practically much more useful than previous models. Open sourcing reward models should reduce the incentive to deploy unaligned models by opening up a strong performance subsidy for alignment.


Development Taxes

Open sourcing reward models will obviously greatly reduce the overall alignment communities’ development taxes, as the development cost of building the reward model drops to 0. I also think it reduces the development tax for the company open-sourcing the model, as they can get huge amounts of human feedback through open-source contributions, in the same way that many large tech companies open source their core technologies (e.g. PyTorch, Go, gRPC, Thrift) to build an open-source community to maintain and improve the software. I think this should outweigh any cost from organizational overhead involved.


Time-to-deployment Taxes

The argument for reducing Time-to-deployment Taxes is also straightforward, it is much easier to do RL from Human Feedback if the human feedback model and dataset are already open sourced and rigorously maintained.


Opening Investigation of reward models

Having human feedback models and datasets used to train these models being open sourced is really important for ensuring that AI robustly and broadly reflects our values

From the robustness side, I expect open-sourcing human feedback models to greatly improve our ability to red-team language models as well as perform foundational empirical research on problems relevant to the alignment problem, such as how alignment techniques change the underlying model and may lead to power seeking. It seems really important to me that we have as many people as possible investigating how our reward functions may break down, while the stakes are still relatively low, in order to avert catastrophe when AGI is deployed.

On a less technical and more ideological note, whose values should govern the future? We may, in the near term, live in a world where digital minds far outnumber human minds. Current incarnations of aligned language models exhibit the ability to show political bias and can refuse to argue in favor of fossil fuels. The values that govern our future should be visible to all and decided on by the people, not chosen opaquely by singular entities. 


Potential Objections

I think there are two main objections to open-sourcing human feedback models. The first class of arguments stems from worries about open-sourcing generally powerful AI technologies for the safety of humanity and the second class of arguments would be an objection from the entities that train the language models. I cover the second objection in its own section, and will address the first in this section.

One major line of argument against open-sourcing powerful AI technology is that it could cause actual harm. To use an extreme example, I am very strongly opposed to open sourcing nuclear bombs. Over the long-term we should expect AGI to have potential impacts and dangers far outweighing those of any modern technology, so I do not want to trivialize this objection. In particular, I am unsure of whether it is correct to open-source capabilities research such as the largest possible language models. I think reward models are much safer to open source and the benefits far outweigh the harms. In particular, it seems to me that most of the capabilities of these language models comes from the information distilled during pre training, and not the final alignment stage, meaning we shouldn’t expect open sourcing these models to greatly increase the latent capability of the models.

There is a second, related concern, which is that in the near term these models could be used to disseminate massive amounts of miss-information or otherwise cause harm. I am sympathetic to this view, but I see this as more of an argument against open-sourced capabilities research (which I have not thought about deeply enough to take a strong stance on) than I do as an argument against open sourcing reward models. I also think iterating with rapid public feedback is important, as these models will continue to improve regardless of if they are open-source or not, and we can’t simply bury our heads in the sand and pretend that powerful AI won’t exist. 


How to Stay Competitive 

In this section, I want to cover what I believe is actually the main objection that prevents open-sourcing of reward models in practice: that companies feel it will hurt their competitiveness. I think this is misguided for a few reasons. 

First, I expect that open source reward models/datasets will be able to dramatically outcompete anything a company itself could build. This is because, in the limit, I expect reward models to solve all of the easy problems and leave only those problems that require expert knowledge to rate. With non-expert labelers you can only get so far on understanding the “goodness” of some deep mathematical proof, research summary, or medical diagnosis. I can’t give a meaningful reward signal on Wiles's proof of Fermat’s last theorem, so a reward model trained on my feedback alone won’t be able to either. If companies want to have models that push the edge of human capability it seems important to have reward signals from a very broad range of people who exhibit outlier performance in these domains to give useful gradients to the RL process. It seems to me that the way to build an enormous, diverse, and expert dataset should really only be possible through open source, where expert contributions are rewarded in some way.

Finally, I don’t think open sourcing reward models commoditizes large models. Only the largest companies with massive compute clusters are able to train SOTA models. Deep technical experience is also required to train large models or get the most performance out of reinforcement learning. Companies still have domain expertise in training large models that is not easy to replicate, and can compete in the realm of capabilities/scaling, by using other proprietary data, or through superior business development and partnerships. Finally, I think that this objection underestimates just how positive sum alignment progress will be. Models continue to make massive quantitative progress coming with qualitative performance leaps on blazingly fast scales. It is hard to imagine any aspect of life that will not be transformed by the emergence of more powerful and general AI systems. With the subsidy from alignment and the positive-sum nature of building collaborative human feedback datasets, I find it hard to believe the expansion of applications made possible by better alignment progress could possibly outweigh the cost of a handful of other large companies having access to good alignment techniques. There seems to be too much low hanging fruit to worry about how to divide up the pie at this point as opposed to growing it for everyone. 



In conclusion, the success of ChatGPT and other instruction-tuned models demonstrates the effectiveness of Reinforcement Learning from Human Feedback (RLHF) in improving the performance of language models. The open sourcing of human feedback reward models is crucial in building aligned and safe AGI, as it reduces the alignment tax burden in the areas of performance, development, and time-to-deployment. Additionally, language model focused businesses can still compete while open sourcing their human feedback models by using their expertise and resources to build and maintain the best reward models. By working towards open sourcing human feedback reward models, we can move closer to achieving aligned and safe AGI that can be used for the benefit of all.


* Written first-try by ChatGPT, if the reward model were open source I would rate this down slightly, suggesting that it should also mention the importance of allowing the public to investigate and shape reward models and would point out that I never argued companies should compete to build the best reward models, rather that they could compete in other areas such as capabilities research or simply acknowledge how positive-sum open-source alignment progress will be.



New Comment
6 comments, sorted by Click to highlight new comments since: Today at 9:26 PM

I don't have too much information, but CarperAI is planning to be open-sourcing GPT-J models fine-tuned via RLHF for a few different tasks (e.g. summarization).  I think people should definitely do interpretability work on these models as soon as they are released. I think it would be great to compare interpretability results from the RLHF models to self-supervised models.

I hesitantly disagree because I think that releasing reward models will cause them to get over optimized and the backlash against the overoptimization may more than undo the optimization on the world. I would suggest instead releasing the models that have been trained on those rewards, because, for example, anthropic's models would give much better advice than td3.

I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model. 

My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I tend to think that completely unaligned models will get more backlash than imperfectly aligned ones, but maybe this is incorrect. I haven't thought deeply about whether it is safe/good for everyone to open source the underlying capability model.

people are open sourcing highly capable models already but those models have not been trained on particularly meaning aligned corpuses. releasing reward models would allow people to change the meaning of words in ways that cause the reward models to rank their words highly but which are actually dishonest, this is currently called SEO. however if trying to optimize for getting your ideas repeated by a language model is the new target, then your language model objective being open source will mess with things. idk mate just my ramblings I guess

I think this is a fair point that an open reward function is subject to "SEO" efforts to game it. But, how about a  "training" reward function that is open, and a "test" reward function that is hidden?

I would love to know what are some other OSS efforts on reward function (I do follow Carper's development on RF), and love to contribute.

[+][comment deleted]23d 10

New to LessWrong?