Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.

I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."

Background on my involvement in RLHF work

Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to  disagreements about this background:

  • The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.)
  • Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
    • Evaluating consequences is hard.
    • A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
  • It’s very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions.

In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren’t able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes.

The history of my involvement:

  • My first post on this topic was in 2015.
  • When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here.
  • I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn’t look promising enough to scale up.
  • In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on language models, which completed its initial RLHF project in 2019. This required building significant infrastructure for scaling and working with language models, since this work was happening in parallel with GPT-2.
  • Geoffrey later left for DeepMind and I took over the team. We wrote a follow-up paper polishing the result to the point where it seemed to be production-ready. Some people on the team started working on applying these results in production; Ryan Lowe ultimately led this effort which spun out into a different team (see paper). We also began working on simple settings where humans needed to use AI systems to solve subtasks (see paper). I left OpenAI at the start of 2021 to return to focusing on theory and Jan Leike took over the team.

The case for a positive impact

Overall, I think that early work on RLHF had significant value:

  • I think it is hard to productively work on more challenging alignment problems without first implementing basic solutions.
    • “Solve real problems one at a time” seems like a good way to make progress and is how most fields work. Trying to justify research on problem X by saying “well we could do RLHF, but it wouldn’t fix speculative problem X” is uncompelling to most audiences if no one has implemented RLHF or observed problem X. it’s even worse if they have plenty of more mundane examples of unaligned behavior unrelated to X.
    • Without implementing basic solutions it’s much harder to empirically validate your hypotheses about risks. We can make reasonable arguments about what failures will eventually occur with RLHF, but you can learn more by building the system and studying it. I think there are real, huge uncertainties here, and the safety community is taking weak arguments too seriously.
    • A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient. Determining whether RLHF has fundamental difficulties seems like a good way to improve research prioritization.
  • Many more complex alignment proposals involve the same technical ingredients as RLHF, especially learning a reward from an expensive overseer. I think that debate and recursive reward modeling in particular are plausible approaches to alignment for mildly superhuman systems, and they build directly on RLHF.
  • Taking ideas from theory to practice helps build expertise about how to do so, which both informs alignment research and facilitates future implementation.
    • For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.
    • Moreover, this kind of expertise is directly relevant when implementing future alignment proposals even if they are very different from RLHF. The implicit alternative seems to be an alignment community that deliberately avoids any problems that would be helpful for making AI systems useful, and potentially avoids doing any engineering work at all, creating predictable and potentially huge problems with implementation.

The case for a negative impact

People in the safety community make some arguments that research on RLHF has costs larger than these benefits. I don’t currently find these arguments persuasive:

  • RLHF (and other forms of short-term “alignment” progress) make AI systems more useful and profitable, hastening progress towards dangerous capabilities. 
    • RLHF is just not that important to the bottom line right now.[1] Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.
    • Trying to delay AI progress by avoiding making AI systems better at doing what people want feels holistically unwise. RLHF does not appear to increase the kind of capabilities that are directly relevant to risk, but instead has an indirect effect via making AI systems more useful. My intuitive reaction is similar to a proposal to lobby against improvements to the tax code so that taxes will be more painful and the public will be more opposed to new taxes. It might be OK if your goal is to reduce tax burden, but probably counterproductive for reducing the social cost of taxes.
    • Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong. Similarly, to the extent you successfully slow scaling, you are then in for faster scaling later from a lower initial amount of spending—I think it’s significantly better to have a world where TAI training runs cost $10 billion than a world where they cost $1 billion. A key background view is that the great majority of effective safety work will come when people are working with systems that are much closer to posing a risk, e.g. so they can actually exhibit and study interesting forms of reward hacking and deceptive alignment. Overall in expectation I think these effects claw back most of the benefits of slowing down progress by avoiding RLHF.
  • RLHF “covers up problems” so that you can’t or won’t fix them in other ways. 
    • RLHF lets you produce models that don’t do bad-looking things, but there are some things which look fine but are actually bad. So you might worry that RLHF makes problems harder to study by covering up their symptoms. But we can (and do) still train models without RLHF, or using a weak overseer where outputs can be validated by stronger overseers. It seems that RLHF makes it much easier to produce realistic examples of problems—both because it facilitates settings with the kind of realistic failure modes you actually want to study (namely overpowering or misleading overseers) and because without RLHF there are going to be a thousand other hacks to try first to fix the problems.
    • You might argue that RLHF gives people a way to cover up problems and so lets them avoid fixing them in deeper ways, or gives them a “false sense of security.” But in practice if people run into problems that can be fixed with RLHF, it looks like they will just do RLHF later (which is getting easier and easier over time). And in practice most of the problems that can be addressed with RLHF can be addressed in other hackier ways as well. This potential objection seems to rest on an unreasonably optimistic model about how superficial problems force people into pursuing deep fixes.
  • RLHF is less safe than imitation or conditioning generative models.
    • If we’re considering the danger posed by a model of a fixed level of usefulness, I think this is probably false though it’s a complicated question and I’m uncertain. The AI safety community makes various informal arguments about this which I find unpersuasive (though I mostly haven’t seen them laid out carefully). I suspect the differences are small and require empirical investigation. (While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.
    • If RLHF poses distinctive risks, we are overwhelmingly more likely to avoid those risks by understanding them rather than by hoping no one ever implements RLHF. It’s unrealistic and deeply unstable to hope that no one uses RLHF because they didn’t think of it.
  • This entire alignment approach is impractical, and therefore all the arguments about “taking the first step in the right direction” are wrong. On top of that working on RLHF obfuscates that fact and dilutes what should be a robust community consensus
    • To the extent this is true, I think it would be a pretty powerful argument against RLHF (largely because it implies that most of the benefits aren’t real). But I don’t agree that the approach can’t work. I’ve talked about this a lot with people, but feel like the arguments just aren’t holding together. The two weak links are on (i) arguments about the timing of difficulties relative to e.g. radically superhuman models—almost all of the arguments kick in after human level and it’s just not clear how far after, (ii) the probability of deceptive alignment emerging despite simple countermeasures, which I think of as a completely open empirical question—existing arguments are fine for arguing plausibility, but definitely can’t get you to 90% rather than 50%, (iii) the feasibility of fundamental improvements to RLHF.

Overall, I think it was valuable to use RLHF to fix the kind of basic alignment problems that are ubiquitous with pre-trained models. I think it has had a real impact facilitating work on more fundamental challenges, and helped move the community one step closer towards the kind of alignment solutions I expect to ultimately be successful.

Future work

I remain excited about "straightforward" approaches to improving RLHF, like devising better feedback (using combinations of human and AI work) and improving robustness by adversarial training. I think this work will continue to make ML systems more useful in practice, and so will be subject to the same kinds of objections as above. I still tentatively think this work is net positive and don't find arguments against persuasive.

I think this follow-up research will also not need to solve the “fundamentally confusing” problems for a long time, but that solving tractable problems gives you a good chance of aligning modestly superhuman AI and facilitates future work on the remaining more challenging problems.

That said, I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive. Research should be justified by an argument that it actually helps address important failures. Here are some types of work in this space that I’m particularly excited about:

  • Work that addresses robustness in cases where we cannot train on deployment examples, or where we care about failure rates that are small relative to fine-tuning dataset size. In practice this would happen if failures are very high-stakes, but we can also study synthetic domains where we artificially aim at very low datasets.
  • Training AI systems to give more correct answers in domains where human overseers can’t easily judge results and there is no other source of end-to-end feedback during training. That may involve giving humans better tools, studying and improving generalization from domains that do have feedback, or other methods.
  • Anything that addresses clear examples of alignment failures, for which we have good reasons to believe that models “know” things they aren’t telling us, or “know” what we want them to do but nevertheless do something else. Many of these will fall into the first two categories, but it’s also interesting to fix more mundane failures (e.g. obvious untruths) if they can be clearly identified as alignment problems.
  • Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
  1. ^

    I would wildly guess that my involvement in RLHF and early language model training at OpenAI from 2017-2020 put me in the top 100 people accelerating AI progress but not in the top 10; I'd wildly guess that I accelerated progress by a few tenths of a percent during this period, and perhaps cut down timelines to powerful AI by a few days. I think there's room for debate one way or the other on that.

    In some sense this is a big acceleration and it's wrong to write it off as "not that important." But I think accelerating a ChatGPT-style wakeup by a week is not a major cost (in addition to being plausibly positive, there just wasn't that much AI-reducing-activity happening per week in the world of 2018).

    I also continue to think that RLHF is great, but that people overestimate (and misunderstand in all kinds of wild directions) the practical impact that it actually has on system behavior relative to the counterfactual training techniques.

    (I added this footnote long after the post was written, reacting to different people interpreting the post in very different ways, e.g. Oliver's comments below and Michael Nielsen's here.)

New to LessWrong?

New Comment
101 comments, sorted by Click to highlight new comments since: Today at 3:52 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. 

My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3. 

And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don't think makes... (read more)

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. 

I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).

I think the much more important differences are:

  1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be used by developers.
  2. It was deployed in a way that made it much easier for more people to interact with it.
  3. People hadn't appreciated progress since GPT-3, or even how good GPT-3 was, and this went viral (due to a combination of 1+2).
  4. If there are large capability differences I expect they are mostly orthogonal improvements.

I think the effect would have been very similar if it had been trained via supervised learning on good dialogs.

My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B

... (read more)

I think the effect would have been very similar if it had been trained via supervised learning on good dialogs

I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.

For ChatGPT in particular, I think it was built by John Schulman's team

I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.

Seeing RLHF teams in other organizations not directly downstream of your organizational involvement,... (read more)

I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.

I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It's just that supervised fine-tuning is worse.

I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.

I don't think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn't that crazy. The prod

... (read more)

How much total investment do you think there is in AI in 2023?

My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.

How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.)

Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment.

My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing. 

So this gives around $50B - $125B of variance to be explained by product-adjacent things like Chat-GPT.

How much influence are you giving to GPT-3, GPT-3.5, GP

... (read more)

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week.

I haven't thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn't done the ChatGPT release is probably between 1-4 weeks.

Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se?

One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).

Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you.  I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don't know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I've historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number).  That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn't enormously far away from yours, though still a good bit higher.  And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously.  Oh, yeah, good point. I was indee
Maybe - but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario. A sudden wave of destabilizing AI breakthroughs - with DALL-E/Midjourney/Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things - can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone speculating that this might cause a wide variety of actors - M & G included - with a desire to slow down AI progress to join forces to push for widespread regulation.
 Interesting. Where did something like this happen?
I asked Chat-GPT and one of the clearest examples it came up with is patent trolling by large pharmaceutical companies. Their lobbying tends to be far more focused on securing monopoly rights to their products for as long as possible than anything related to innovation. Other examples: * Automakers lobbying for restrictive standards for potential market disruptors like electric or self-driving vehicles * Telecoms lobbying against Net Neutrality * Taxi companies lobbying against ridesharing startups * Tech companies lobbying for intellectual property and data privacy regulations that they have better legal/compliance resources to handle
IMO it's much easier to support high investment numbers in "AI" if you consider lots of semiconductor / AI hardware startup stuff as "AI investments". My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing. 
I'd be interested to know how you estimate the numbers here, they seem quite inflated to me. If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K. 50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high. Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023. Agree with paul's comment above that timeline shifts are the most important variable.
Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern's guesses here:  As far as I can tell this resulted in a system with much worse economic viability than Chat-GPT. I would overall describe Sydney as "economically unviable", such that if Gwern's story here is correct, the difference between using straightforward supervised training on chat transcripts and OpenAIs RLHF pipeline is indeed the difference between an economically viable and unviable product.  There is a chance that Microsoft fixes this with more supervised training, but my current prediction is that they will have to fix this with RLHF, because the other technological alternatives are indeed no adequate substitutes from an economic viability perspective, which suggests that the development of RLHF did really matter a lot for this.
Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it... It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul's comment saying "I've seen only modest qualitative differences" in order to disagree and say "I think we've now seen substantial qualitative differences".  We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney. I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn't seem worth going into.
I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul's comparison: retrieval especially was an interesting dynamic.
For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF. 
Yep, I think it's pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?  I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT. 
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.
It depends a lot on the use case.  When it comes to what I'm doing with ChatGPT, I care more about the quality of the best answer when I generate five answers to a prompt than I care about the quality of the worst answer. I can choose the best answer myself and ignore the others.  Many use cases have ways to filter for valuable results either automatically or by letting a human filter.
Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by $10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.
Relevant piece of data:  I had some decent probability on this outcome but I have increased my previous estimate of the impact of Chat-GPT by 50%, since I didn't expect something this radical ("the single fastest growing consumer product in history").
That's not always the wrong thing to do - the sum of counterfactual impacts of the actions of many actors often sums up to greater than their total combined impact. A simple example would be if two co-founders of an impactful company wouldn't have been a founder without the other. Then the sum of their counterfactual impacts is equivalent to 2 times the total impact of the company. While I don't have an opinion on this particular case, you could imagine that additional AI investment may not have happened if either of the following were true: 1. The original RLHF proof of concept from OpenAI didn't happen - because Google's leadership wouldn't have the incentive for further investment. 2. If Google's leadership were different - because they may not have thought to invest more money in AI.

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free.

I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I'm quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.

7Quintin Pope1y
I've felt that ChatGPT was roughly on par with text-davinci-003, though much more annoying and with a worse interface.

That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/ChatGPT.

Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF's capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.

This page also says that:

Our models generally used the best available datasets at the time of training, and so different engines using the same training methodology might be trained on different data.

So I guess 003 could also have different base pretraining data?

[edit: this says the same thing as Quintin's sibling comment]

Important context for those who don't know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)

In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.

(Aside: the terminology "RLHF" is starting to become confusing, as some people use it narrowly to mean "PPO against a reward model" and others use it more broadly to mean "using any RL technique with a reward signal given by human reviewers," which would include FeedME.)

4Erik Jenner1y
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.
2Sam Marks1y
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there's no credit assignment over time, and so it doesn't train the model to actually aim towards high-reward states.
3Sam Marks1y
To be clear, I'm not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It's specifically SFT on highly-rated model outputs -- i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating -- which I'm calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.  So I still feel that the way I used the term "RL" was in line with normal usage. But if people still disagree now that I've explained myself in more detail, I'd be interested in hearing why.
Two central features of RL in my mind, which distinguish it from imitation learning: * Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there's efficient credit assignment through time. * (In theory) small differences in reward can lead to big differences in behavior, i.e. there's mode collapse to the highest-expected-reward policy. Q-learning is therefore a central example of RL, alongside actor-critic algorithms. Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as... like 75% RL, but a less central example than Q-learning. Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there's a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it'll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don't call RL. By contrast, if there's an underlying "true" reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it'll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me. Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of "online", "state-wise credit assignment" and "converges to sharp optimum" separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
4Sam Marks1y
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I'm confused about how you've labeled the methods you mention as having vs. not having this property. To make sure we're on the same page, suppose we're in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗). I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it's only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).  But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.  In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function -- if there's some obvious reason that the Q-learner will necessarily make a larger update, I've missed it. I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I'm less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn't, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it's only by baking in an assumption about the envir
I think your example is too simple to capture the relevant phenomenon. Here's one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You've seen the following trajectories: s2 -> s3 s1 -> s4 s1 -> s2 -> s5 Then q-learning will learn quickly that it should go s1 -> s2 -> s3, whereas REINFORCE and SFT will need to do further exploration before learning that. I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn't really scalable. (That also isn't true of all critics.) It feels like I might have just come up with a new name for "RL algorithms which work on offline data", which is presumably not a crucial distinction.
3Sam Marks1y
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously). 
Flagging that I would find that use of the term super confusing.
To throw in another perspective, I've been working with the OpenAI API models most days of the week for the past year or so. For my uses, the step-change in quality came from moving from base davinci to text-davinci-002, whereas the improvements moving from that to text-davinci-003 were decidedly less clear.
9Quintin Pope1y
I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it's very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.
Seems like you're implying that davinci is the base model for 002 and 003. That's not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).
Fair. I think the crucial question to Ajeya & Matthew's discussion of "Why the hype now?" is exactly how much worse the non-RLHF models that had been available since at least last March (davinci, code-davinci-002, text-davinci-002) actually were than the RLHF models made available just recently (text-davinci-003 and ChatGPT's underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.

People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it's also possible that CharacterAI's base models are RLHF'd to be consistent roleplayers.

Would love to learn more about the model(s) behind CharacterAI. Anyone know if there's publicly available information on them?
I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient) I don't know what mechanism was used to generate the longer coherence though.
I don't think this is related to RLHF.
At least ChatGPT seems to have a longer context window, this experiment suggesting 8192 tokens.

Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.

I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.

RLHF is less safe than imitation or conditioning generative models.

My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.

So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models tha... (read more)

I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.  The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures implied by being weighted by a true approximation of our world.

If predicting webtext is a good way to get things done, people can do that. But probably it isn't, and so people probably won't do that unless you give them a good reason.

That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don't know in what sense "predict human demonstrators" is missing an important safety property from "predict internet text," and right now it feels to me like kind of magical thinking.

The main way I can see it going is that you can condition the webtext model on other things like "there is a fu... (read more)

I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the "narrowing the prior" frame rather intuitive in this context. I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don't expect the RL part to play a huge role until we get to wilder applications. I'm worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning). I don't think "predict human demonstrators" is how I would frame the relevant effect from fine-tuning. More concretely, what I'm picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever "better" rated completions are), then you're not learning the actual distribution of the world anymore. You're trying to learn a distribution that's identical to ours except in that conversations are more polite. In other words, you're trying to predict "X, but nicer". The problem I see with this is that you aren't just affecting this in isolation, you're also affecting the other dynamics that these interact with. Conversations in our world just aren't that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -  - This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement. I don't think the safety loss (at least the part I'm referring to here) comes from the first-order effects of predicting something else. It's the second

I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent. 

Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.

One consequence downstream of this that seems important to me in the limit: 1. Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you. 2. Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you. I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.
Just a note here: I would not interpret fine-tuned GPTs as still "predicting" tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some "goodness" relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall "better" in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown. So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other. You don't know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread: It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case. Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn't "believe them" in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact "believes" those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown. So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don't have anything super concrete here and I'm not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it's something I've thought about poking.)
2Evan R. Murphy1y
What do you mean when you say the model is or is not "fighting you"?
I mean a model "fights" you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot "fight" you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you. I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of "fighting" happens, or what kind of "intervention" is possible when you are merely conditioning your model and not fine-tuning it? I haven't engaged much with this kind of thinking on LW or the broader safety community, but right now I don't really get it and it feels like anthropomorphizing or magical thinking.

I'll start with a pretty uncontroversial example that's neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.

Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.

With the way I was using the word "fighting", I would say that the first model is fighting you (a little bit), and the second one isn't. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.

Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. "LLM" operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first ... (read more)

4Sam Marks1y
Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"): 1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires. 2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic's paper, they usually mean either: (a) prompt engineering; or (b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token "GOOD" and then ask the model to produce outputs which start with "GOOD".) Approach (b) really doesn't seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.) I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic's chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF. Interested to hear if you have other intuitions here.
I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post). My claim is pretty similar to how you put it - in RLHF as in fine-tuning of the kind relevant here, we're focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it's likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.
3Evan R. Murphy1y
Glad to see both the OP as well as the parent comment.  I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper, post):   Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That's true, but I think it's just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size. 
Thanks! My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
2Sam Marks1y
This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down. (This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)
Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does. I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.
Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).
Similar points regarding safety of pure imitation learning vs reinforcement learning have been raised by many others on LW. So I'm really interested what Paul has to say about this.
I haven't engaged with this much, though I've e.g. talked with Evan some about why I'm not as excited about conditioning generative models as a strategy. I'm happy to engage with particular arguments but feel like I don't really know what argument is being made by the parent (or most of the other places I've seen this in passing). I think there is a simple reason imitation is safer: the model won't deliberately produce actions that the demosntrator wouldn't, whereas RLHF may produce actions that are very creative ways to get reward and may be hamful. I don't think this is what people are talking about though (and it wouldn't work for their broader arguments). I think they are imagining a higher probability of deceptive alignment and other generalization problems. I don't thinks I know the precise articulation of these concerns or the argument for it. On the empirics, sometimes people mention this paper and the RLHF'd model behavior "hey do you want to be shut down? --> no" as evidence of a higher probability of deceptive alignment from RLHF. I don't really think that's a reasonable interpretation of the evidence but if that's a large part of the argument people are making I'd be happy to engage on it.
3Charlie Steiner1y
As one of the people who's raised such points, I should note that they mostly apply to applications of language models qua language models (which Jozdien correctly does), and that different techniques can be appropriate for different domains.

RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.

I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment. If false, then RLHF was net-negative because of its capabilities externalities. I also don't buy your argument about relative numbers of people working on capabilitie... (read more)

I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment.

Doing things sooner and in a different way matters.

This argument is like saying that scaling up language models is net-neutral for AGI, because people would have done it anyway for non-AGI purposes. Doing things sooner matters a lot. I think in most of science and engineering that's the main kind of effect that anything has.

If false, then RLHF was net-negative because of its capabilities externalities.

No, if false then it has a negative effect which must be quantitatively compared against positive effects.

Most things have some negative effects (e.g. LW itself).

It is also far easier to make progress on capabilities than alignment

This doesn't seem relevant---we were asking how large an accelerating effect alignment researchers have relative to capabilities researchers (since that determines how many days of speed-up they cause), so if capabilities progress is easier then that seems to increase both numerator and denominator.

especially when you're not trying to make progress on alignment's core problems,

... (read more)
I think that there are many kinds of in vitro failures that don't pose any lab leak risk. For example, training models against weak overseers and observing the dynamics when they try to overpower those overseers, doesn't have any effect on increasing takeover risk. Similarly, the kinds of toy models of deceptive alignment we would build don't increase the probability of deceptive alignment. I think this kind of work is pretty much essential to realistic stories for how alignment actually makes progress or how we anticipate alignment failures. This seems wrong. For example, you can get treacherous turns in weak systems if you train them with weak overseers, or if you deliberately take actions that make in vitro treacherous turns more likely, without automatically getting such failures if you are constantly doing your best to make your AIs behave well. I completely disagree. I think having empirical examples of weak AIs overpowering weak overseers, even after a long track record of behaving well in training, would be extremely compelling to most ML researchers as a demonstration that stronger AIs might overpower stronger overseers, even after a long track record of behaving well in training. And whether or not it was persuasive, it would be extremely valuable for doing actually productive research to detect and correct such failures. (The deceptive alignment story is more complicated, and I do think it's less of a persuasive slam dunk, though I still think it's very good for making the story 90% less speculative and creating toy settings to work on detection and correction.) I don't think that most of the work in this category meaningfully increases the probability of lab leaks or misuse (again, the prototypical example is a weak AI overpowering a weak overseer). That said, I am also interested in work that does have real risks, for example understanding how close AI systems are to dangerous capabilities by fine-tuning them for similar tasks. In these cases I th
Really? How so?
1Garrett Baker1y
I don't know all the details, but the idea was that a thing that mimics humans and was capable would be safer than a thing that did lots of RL in a range of tasks and was powerful, so the creator of the architecture worked on improving text generation.

I don't think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.

6Garrett Baker1y
I just looked into it, it turns out you’re right. I think I was given a misimpression of the motivations here due to much OpenAI research at the time being vaguely motivated by “lets make AGI, and lets make it good”, but it was actually largely divorced from modern alignment considerations.
And this is actually pretty reasonable as a strategy, given their general myopia by default and their simulator nature playing well with alignment ideas like HCH. If we could avoid a second optimizer arising, then this scaled up would be nearly ideal for automated research on say alignment. But RLHF ruined it, and this was IMO a good example of a looking good alignment strategy that wasn't actually good.
I'm not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it's probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won't do other things. Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn't think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome. ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won't be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.
Basically, I'm talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem. I'll provide a link below: Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don't think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF.  There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent. But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn't quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can't arise, I expect non-myopic training to find such things. In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart's law in action. The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that "non-myopic training" can remove that property? Training a policy to imitate traces of "non-myopic cognition" in the first place seems like a way to plausibly create a policy that itself has "non-myopic cognition". But this is exactly how GPT pretraining works.
Huh, I'd not heard that, would be interested in hearing more about the thought process behind its development. Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.


A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f... (read more)

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have. There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases. It's possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and "AGI." (The view "stack more layers won't ever give you true intelligence, there is a qualitative difference here" seems like it's taking a beating every year, whether it's Eliezer or Gary Marcus saying it.)
When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not something that just depends on, say, the reward structure of the local environment. Different sorts of capabilities or generators of capabilities will relate in different ways to ultimate effects on the world. So the task of interfacing with capabilities to understand how they're being deployed (with what motive), and to actually specify motives, is a task that seems like it would depend a lot on the sort of capability in question.
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward. I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers. I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.

So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A's reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.

Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,

"Cool. But this won't generalize to future lethal systems because it doesn't account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a 'capabilities overhang' relative to the overseer: there's no behavior that's clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. ... (read more)

Avoiding RLHF at best introduces an important overhang

But it would be better if we collectively then decided not to rush forward anyway, right?

And I still don't get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it's also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it. So "lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment" seem like a very risky plan.

Yes, that seems consistent with my post. I mostly think that AI doing research will accelerate both risk and alignment, so we're aiming for it to be roughly a wash. But having nearly-risky AI to study seems incredibly important for doing good alignment work. I think this is a pretty robust bet. That's not the plan. I'm saying to do the work that seems most useful for alignment even if it has modest capability benefits, and that for some kinds of capability benefits the apparent cost is less than you'd think because of these overhang effects.
Yeah, I don't understand why it would be a wash, when destructive capabilities are easier than alignment (humans already figured out nukes, but not alignment) and alignment is expected to be harder for more advanced AI. Even without straight misalignment risk, giving superhuman AI to the current civilization doesn't sound like stability improvement. So without specific plan to stop everyone from misusing AI it still sounds safer to solve alignment without anyone building nearly-risky AI.

A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.

Do you have examples of such historical work that you're happy to name? I'm really unsure what you're referring to (probably just because I haven't been involved in alignment for long enough).

I think a lot of work on IRL and similar techniques has this issue---it's mostly designed to learn from indirect forms of evidence about value, but in many cases the primary upside is data efficiency and in fact the inferences about preferences are predictably worse than in RLHF. (I think you can also do IRL work with a real chance of overcoming limitations of RLHF, but most researchers are not careful about thinking through what should be the central issue.)

For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.

EY's counter to this?

I have read through most of this post and some of the related discussion today. I just wanted to write that it was really interesting, and as far as I can tell, useful, to think through Paul's reasoning and forecasts about strategy-related questions.
In case he believes this is a good idea, I would be very glad to read through a longer, more comprehensive document describing his views on strategic considerations.

It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data. 

The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via... (read more)

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.


Has ARC got a written policy for if/when similar experiment generate inconclusive but possible evidence of dangerous behaviour.

If so would you consider sharing it (or a non-confidential version) for other organisations to use. 

(While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.

Do you mean they don't tell us what's up with the difference in risks of the measured techniques, or that they don't tell us much about AI risk in general? (I'd at least benefit from learning more about your views here)

Yes, I mean that those measurements don't really speak directly to the question of whether you'd be safer using RLHF or imitation learning.

techn ical