Huge thanks to Chaska Yamane, Abhay Singh and Arpit Kalla for feedback and discussions

Introduction

The recent success of ChatGPT lays bare the power of Reinforcement Learning from Human Feedback (RLHF). Previous models in the instruct series showed remarkable gains from RLHF, allowing instruction-tuned models to outperform models 100x their size. ChatGPT feels to me like it passed a qualitative threshold though, where the outputs are actually really useful for the average non-machine-learning interested person. I expect this impact to expand in the coming year, as language models continue to scale, use more optimal data scaling, and augment their knowledge by searching the internet. With the power of these models in mind, I want to argue for two main claims

Open sourcing human feedback reward models is imperative for building aligned and safe AGI
Language model focused businesses can still compete while open sourcing their human feedback models

Throughout this work I will distinguish between capabilities research/models and alignment research/reward models. My intuitive framework is that capabilities come from scaling models up, while alignment comes from tuning these models to better elicit the capabilities they already have in ways that are preferred by humans. These dimensions may seem not entirely orthogonal, as alignment exposes new capabilities, but I nonetheless endorse the claim that a fundamental piece of safety is to increase the rate of alignment progress relative to the rate of capabilities progress.

Importance to Alignment

Reducing the Alignment Tax Burden

They say only two things are certain: death and taxes. Hopefully, when building AGI, we can get by with just paying our alignment taxes and avoid the whole death part. An alignment tax is “any additional cost that is incurred in the process of aligning an AI system.” Jan Leike lays out three main classes of alignment tax, all of which I argue are greatly reduced by open sourcing human feedback models.

Performance taxes: Performance regressions caused via alignment compared to an unaligned baseline.
Development taxes: Effort or expenses incurred for aligning the model: researcher time, compute costs, compensation for human feedback, etc.
Time-to-deployment taxes: Wall-clock time taken to produce a sufficiently aligned model

...

ZZ Si

ZZ Si