In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.
I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."
Background on my involvement in RLHF work
Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:
- The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.)
- Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
- Evaluating consequences is hard.
- A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
- It’s very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions.
In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren’t able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes.
The history of my involvement:
- My first post on this topic was in 2015.
- When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here.
- I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn’t look promising enough to scale up.
- In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on language models, which completed its initial RLHF project in 2019. This required building significant infrastructure for scaling and working with language models, since this work was happening in parallel with GPT-2.
- Geoffrey later left for DeepMind and I took over the team. We wrote a follow-up paper polishing the result to the point where it seemed to be production-ready. Some people on the team started working on applying these results in production; Ryan Lowe ultimately led this effort which spun out into a different team (see paper). We also began working on simple settings where humans needed to use AI systems to solve subtasks (see paper). I left OpenAI at the start of 2021 to return to focusing on theory and Jan Leike took over the team.
The case for a positive impact
Overall, I think that early work on RLHF had significant value:
- I think it is hard to productively work on more challenging alignment problems without first implementing basic solutions.
- “Solve real problems one at a time” seems like a good way to make progress and is how most fields work. Trying to justify research on problem X by saying “well we could do RLHF, but it wouldn’t fix speculative problem X” is uncompelling to most audiences if no one has implemented RLHF or observed problem X. it’s even worse if they have plenty of more mundane examples of unaligned behavior unrelated to X.
- Without implementing basic solutions it’s much harder to empirically validate your hypotheses about risks. We can make reasonable arguments about what failures will eventually occur with RLHF, but you can learn more by building the system and studying it. I think there are real, huge uncertainties here, and the safety community is taking weak arguments too seriously.
- A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient. Determining whether RLHF has fundamental difficulties seems like a good way to improve research prioritization.
- Many more complex alignment proposals involve the same technical ingredients as RLHF, especially learning a reward from an expensive overseer. I think that debate and recursive reward modeling in particular are plausible approaches to alignment for mildly superhuman systems, and they build directly on RLHF.
- Taking ideas from theory to practice helps build expertise about how to do so, which both informs alignment research and facilitates future implementation.
- For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.
- Moreover, this kind of expertise is directly relevant when implementing future alignment proposals even if they are very different from RLHF. The implicit alternative seems to be an alignment community that deliberately avoids any problems that would be helpful for making AI systems useful, and potentially avoids doing any engineering work at all, creating predictable and potentially huge problems with implementation.
The case for a negative impact
People in the safety community make some arguments that research on RLHF has costs larger than these benefits. I don’t currently find these arguments persuasive:
- RLHF (and other forms of short-term “alignment” progress) make AI systems more useful and profitable, hastening progress towards dangerous capabilities.
- RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.
- Trying to delay AI progress by avoiding making AI systems better at doing what people want feels holistically unwise. RLHF does not appear to increase the kind of capabilities that are directly relevant to risk, but instead has an indirect effect via making AI systems more useful. My intuitive reaction is similar to a proposal to lobby against improvements to the tax code so that taxes will be more painful and the public will be more opposed to new taxes. It might be OK if your goal is to reduce tax burden, but probably counterproductive for reducing the social cost of taxes.
- Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong. Similarly, to the extent you successfully slow scaling, you are then in for faster scaling later from a lower initial amount of spending—I think it’s significantly better to have a world where TAI training runs cost $10 billion than a world where they cost $1 billion. A key background view is that the great majority of effective safety work will come when people are working with systems that are much closer to posing a risk, e.g. so they can actually exhibit and study interesting forms of reward hacking and deceptive alignment. Overall in expectation I think these effects claw back most of the benefits of slowing down progress by avoiding RLHF.
- RLHF “covers up problems” so that you can’t or won’t fix them in other ways.
- RLHF lets you produce models that don’t do bad-looking things, but there are some things which look fine but are actually bad. So you might worry that RLHF makes problems harder to study by covering up their symptoms. But we can (and do) still train models without RLHF, or using a weak overseer where outputs can be validated by stronger overseers. It seems that RLHF makes it much easier to produce realistic examples of problems—both because it facilitates settings with the kind of realistic failure modes you actually want to study (namely overpowering or misleading overseers) and because without RLHF there are going to be a thousand other hacks to try first to fix the problems.
- You might argue that RLHF gives people a way to cover up problems and so lets them avoid fixing them in deeper ways, or gives them a “false sense of security.” But in practice if people run into problems that can be fixed with RLHF, it looks like they will just do RLHF later (which is getting easier and easier over time). And in practice most of the problems that can be addressed with RLHF can be addressed in other hackier ways as well. This potential objection seems to rest on an unreasonably optimistic model about how superficial problems force people into pursuing deep fixes.
- RLHF is less safe than imitation or conditioning generative models.
- If we’re considering the danger posed by a model of a fixed level of usefulness, I think this is probably false though it’s a complicated question and I’m uncertain. The AI safety community makes various informal arguments about this which I find unpersuasive (though I mostly haven’t seen them laid out carefully). I suspect the differences are small and require empirical investigation. (While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.
- If RLHF poses distinctive risks, we are overwhelmingly more likely to avoid those risks by understanding them rather than by hoping no one ever implements RLHF. It’s unrealistic and deeply unstable to hope that no one uses RLHF because they didn’t think of it.
- This entire alignment approach is impractical, and therefore all the arguments about “taking the first step in the right direction” are wrong. On top of that working on RLHF obfuscates that fact and dilutes what should be a robust community consensus.
- To the extent this is true, I think it would be a pretty powerful argument against RLHF (largely because it implies that most of the benefits aren’t real). But I don’t agree that the approach can’t work. I’ve talked about this a lot with people, but feel like the arguments just aren’t holding together. The two weak links are on (i) arguments about the timing of difficulties relative to e.g. radically superhuman models—almost all of the arguments kick in after human level and it’s just not clear how far after, (ii) the probability of deceptive alignment emerging despite simple countermeasures, which I think of as a completely open empirical question—existing arguments are fine for arguing plausibility, but definitely can’t get you to 90% rather than 50%, (iii) the feasibility of fundamental improvements to RLHF.
Overall, I think it was valuable to use RLHF to fix the kind of basic alignment problems that are ubiquitous with pre-trained models. I think it has had a real impact facilitating work on more fundamental challenges, and helped move the community one step closer towards the kind of alignment solutions I expect to ultimately be successful.
Future work
I remain excited about "straightforward" approaches to improving RLHF, like devising better feedback (using combinations of human and AI work) and improving robustness by adversarial training. I think this work will continue to make ML systems more useful in practice, and so will be subject to the same kinds of objections as above. I still tentatively think this work is net positive and don't find arguments against persuasive.
I think this follow-up research will also not need to solve the “fundamentally confusing” problems for a long time, but that solving tractable problems gives you a good chance of aligning modestly superhuman AI and facilitates future work on the remaining more challenging problems.
That said, I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive. Research should be justified by an argument that it actually helps address important failures. Here are some types of work in this space that I’m particularly excited about:
- Work that addresses robustness in cases where we cannot train on deployment examples, or where we care about failure rates that are small relative to fine-tuning dataset size. In practice this would happen if failures are very high-stakes, but we can also study synthetic domains where we artificially aim at very low datasets.
- Training AI systems to give more correct answers in domains where human overseers can’t easily judge results and there is no other source of end-to-end feedback during training. That may involve giving humans better tools, studying and improving generalization from domains that do have feedback, or other methods.
- Anything that addresses clear examples of alignment failures, for which we have good reasons to believe that models “know” things they aren’t telling us, or “know” what we want them to do but nevertheless do something else. Many of these will fall into the first two categories, but it’s also interesting to fix more mundane failures (e.g. obvious untruths) if they can be clearly identified as alignment problems.
- Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.
And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don't think makes... (read more)
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs.
... (read more)I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.
I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.
Seeing RLHF teams in other organizations not directly downstream of your organizational involvement,... (read more)
I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.
I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.
... (read more)My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment.
My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing.
So this gives around $50B - $125B of variance to be explained by product-adjacent things like Chat-GPT.
... (read more)I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.
I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I'm quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF's capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.
This page also says that:
So I guess 003 could also have different base pretraining data?
People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it's also possible that CharacterAI's base models are RLHF'd to be consistent roleplayers.
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models tha... (read more)
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
I'll start with a pretty uncontroversial example that's neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word "fighting", I would say that the first model is fighting you (a little bit), and the second one isn't. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. "LLM" operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first ... (read more)
I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment. If false, then RLHF was net-negative because of its capabilities externalities. I also don't buy your argument about relative numbers of people working on capabilitie... (read more)
Doing things sooner and in a different way matters.
This argument is like saying that scaling up language models is net-neutral for AGI, because people would have done it anyway for non-AGI purposes. Doing things sooner matters a lot. I think in most of science and engineering that's the main kind of effect that anything has.
No, if false then it has a negative effect which must be quantitatively compared against positive effects.
Most things have some negative effects (e.g. LW itself).
This doesn't seem relevant---we were asking how large an accelerating effect alignment researchers have relative to capabilities researchers (since that determines how many days of speed-up they cause), so if capabilities progress is easier then that seems to increase both numerator and denominator.
... (read more)I don't think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f... (read more)
So for example, say Alice runs this experiment:
Alice observes that A learns to hack B. Then she solves this as follows:
Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,
"Cool. But this won't generalize to future lethal systems because it doesn't account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a 'capabilities overhang' relative to the overseer: there's no behavior that's clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. ... (read more)
But it would be better if we collectively then decided not to rush forward anyway, right?
And I still don't get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it's also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it. So "lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment" seem like a very risky plan.
Do you have examples of such historical work that you're happy to name? I'm really unsure what you're referring to (probably just because I haven't been involved in alignment for long enough).
It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.
The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via... (read more)
Has ARC got a written policy for if/when similar experiment generate inconclusive but possible evidence of dangerous behaviour.
If so would you consider sharing it (or a non-confidential version) for other organisations to use.
Do you mean they don't tell us what's up with the difference in risks of the measured techniques, or that they don't tell us much about AI risk in general? (I'd at least benefit from learning more about your views here)
typo?