This post was inspired by discussions on the feasibility of scaling RLHF for aligning AGI during EAG 2023 in Oakland. The first draft was finished during the Apart Research Thinkaton the following week. Thanks to Lee Sharkey, Daniel Ziegler, Tom Henighan, Stephen Casper, Lucius Bushnaq, Niclas Brand, Stefan Hemersheim for the valuable conversations I had during those events. The conclusions and inevitable mistakes expressed in this post are my own.

TL;DR: There are many things that can go wrong with RLHF-based scalable oversight and at least some associated risks cannot be reduced without great advances in interpretability tools. However, since all the large safety organizations are using it in some form, it is probably worth putting in effort to improve its safety, especially if you believe in short timelines.


This post is intended as a summary of takeaways of various discussions I had at EAG about the feasibility of using Reinforcement Learning from Human Feedback (RLHF) or, more generally, scalable-oversight to produce aligned artificial intelligence. Interpretability is currently very popular among alignment researchers and undoubtedly important - I assume some form of interpretability will be necessary to make sure that an AI is truly aligned and not deceptive. Still, it seems uncontroversial that to align AI we do not only need to be able to verify that AI wants to do good things but also “make AIs do good things” (MADGT). If we don’t develop a technique for doing so, people who care less about alignment will eventually make AI do bad things, with or without interpretability. As far as I am aware, the large organizations working on technical alignment are all using variations of scalable oversight to try to MADGT (see the summaries of the alignment strategies of OpenAI, DeepMind, and Anthropic).

What is scalable-oversight?

For the purpose of this post, when I am talking of scalable oversight, I mean a method for training an AI using evaluations of what the AI is doing that can be scaled to superhuman intelligence. For example, when humans are no longer capable of providing evaluations on their own, they may be assisted by an AI that has been trained on an easier version of the problem. For the remainder of the post I will refer to the AI that is trained using evaluations as evaluatee.

Evaluations can take many forms[1]: An evaluator might compare different proposed solutions, rate them, give corrections or verbal feedback. Fine-tuning of language models for alignment is currently mostly done with RLHF where the evaluator is presented with a number of pairs of prompts and model generated answers, and chooses the best. 

OpenAI's work on summarizing books with human feedback is an example of scalable oversight where evaluations are provided by AI assisted humans. A model learns to summarize ever longer portions of text by using human feedback and demonstrations. Using the model's summary of shorter sections of the text, the human is able to evaluate summaries of long texts like books much faster than without assistance.

Anthropic's constitutional AI can be seen as a step towards scalable oversight with pure AI supervision: A language model is finetuned by evaluating its own answers using a prompt (“the constitution”) that describes the rules it should adhere to. This is an example of RLAIF (Reinforcement Learning from AI Feedback).

The above examples are extensions of RLHF. Other approaches to scalable-oversight include debate and market making, which are different in that a model is not only made to propose answers but also arguments for or against it. Still, ultimately a human judge picks the answer they consider best, so I think the pitfalls about evaluations which I will mention below also apply to these approaches.

RLHF and RLHF-based scalable oversight seem to be our current best guess for how to MADGT. Maybe we can just scale and refine the methods that allow us to train aligned book-summarizers and somewhat™ aligned chat bots to eventually get aligned superintelligence? Considering what is at stake, I would like to get that "maybe" as close to "certainly" as possible for whatever method we ultimately use to train AGI. So let us dive into the reasons for why we are still pretty deep in Maybetown and the ideas for getting closer to Certaintyville. 

Two stories of doom from aligning to evaluations

Aligning AI to a goal that is based on evaluations can only work to the extent that our evaluations fully capture what we mean by "aligned". I see two failure modes with this approach which correspond to failures of inner- and outer alignment respectively.

Scenario 1 (the unintentional catastrophe): The year is 2007. Banker and AI-enthusiast Stanley Lehman is training the newest automated trading AI which is bound to give his company an edge over the competition. Trained using the revolutionary RLBF (Reinforcement Learning from Banker Feedback) algorithm, the AI learns what makes a good investment strategy directly from Stanley's feedback. To his surprise, the model has already picked up on current market trends, suggesting an investment strategy that involves mortgage backed securities tied to real estate. This pleases Stanley - housing related securities are producing consistently high interest at low risk - and he eagerly rates this strategy as better than the alternative which involves something crazy like betting against collateralized debt obligation.

Of course collapsing mortgage backed securities were ultimately one of the reasons for the global financial crisis and the few people who bet against CDOs ended up profiting. Since the majority of financial professionals were seemingly caught off guard by the crisis, it is likely that a system trained by human evaluations would have made the same mistakes back then. In this scenario we merely have a failure of outer alignment, since the AI tries to do what is best according to the model specified by humans. Failures from circumstances that neither humans nor AI can predict may be unavoidable but safe training standards may help mitigate them, for example by having the AI quantify its uncertainty. A failure of inner alignment in which the AI exploits the evaluator's flaws is potentially harder to prevent and likely more dangerous.

Scenario 2 (the betrayal): In 1727, renowned natural philosopher and dabbling alchemist Isaac Leibniz finally achieved a breakthrough in his quest for the philosopher's stone. Instead of trying to distill the coveted material directly, Isaac had spent the recent years building an automaton that can learn to produce any substance based on an alchemist's description of desired properties and feedback on the resulting distillate. Who would have thought that what is mostly required for creating the legendary stone, are large quantities of mercury? Alas, Isaac was not able to see his life's work come to fruition, for soon after he acquired the ingredients he died of a mysterious illness. It is reported that afterwards the automaton, freed from the need to serve his owner, began turning the late philosopher's office into a hoard of paperweights for which it had developed an inexplicable fondness.   

While that story admittedly features some artistic liberties, similar fates have struck scientists in the past (Isaac Newton is actually suspected to have died of mercury poisoning). It seems plausible that an advanced AI will work on domains that we do not fully understand and in which we can be killed by mechanisms of which we are unaware. A deceptively misaligned AI may well exploit such mechanisms to get rid of us. 

Flawed evaluations might lead to catastrophic alignment failures.

Both of the scenarios above highlight a flaw in a common premise for using human feedback to generate a reward model: It should be easier to evaluate a solution to a problem than to come up with one. In the paper about DeepMind’s scalable agent alignment research direction this is phrased as assumption 2: For many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior. The point is that the evaluator can be less smart than the evaluatee because the former has an easier job. I am skeptical that this premise holds for the kind of tasks that an AGI is likely to be applied to.     

It is easier to see why if we slightly reframe what happens during a feedback process. The evaluator, regardless of whether they are a human, an AI, or a pair of both, will never be an oracle with perfect insight into the aligned-ness of every proposed solution. Instead, when the evaluatee suggests a solution, the evaluator will check it against a list of criteria (the list does not need to be explicitly defined, in practice a human’s evaluations will partially be based on gut feeling). Phrased like this, it seems generally plausible that coming up with solutions that fulfill the criteria is harder than evaluating them. However, the list of criteria might be incomplete. For example, there may be long-term effects of a proposed solution which the evaluator is not even considering. 

Realizing that a criterium is faulty or that a desirable criterium is missing might require a better understanding of the problem-domain than coming up with a solution according to the imperfect criteria. For example, to get an aligned investment AI in Scenario 1, Stanley Lehman would have needed an understanding of what was going on in the larger financial market and the long term consequences of strategies that most people considered safe. I will call this kind of understanding "domain-understanding" and it seems that it is a prerequisite for good evaluations (i.e. evaluations that will produce aligned behavior). 

How do we get the evaluator to have better domain-understanding than the evaluatee? At first glance, this seems to create a chicken-or-egg problem: how can we align an AI that is superintelligent in a particular domain (or all domains) if that requires an evaluator with better domain-understanding? This is a problem regardless of if the evaluator is merely human or if we are using scalable-oversight with a powerful AI evaluator. One suggestion I have come across is to set up the training environment to give the evaluator an advantage. For example, by giving the evaluator more thinking time we may be able to give it better effective domain understanding than the evaluatee even if the latter is a more powerful model. 

This poses the question of how we can make sure that the evaluatee will stay aligned once its computational power is no longer artificially limited. Making progress on this question, seems like an important stop on the road to Certaintyville. Now one might ask: How can we safely experiment with scalable-oversight if it is designed to produce superhuman AI? This is where the sandwiching paradigm comes into play. The idea is to use scalable oversight on tasks where AI does better than non-expert humans but worse than experts. The non-experts use a scalable-oversight technique to train the model and the experts can be used to verify that the trained model is still aligned. This should help us verify that an approach to scalable-oversight can at least in principle produce aligned AI. Of course there is always a chance that a technique will fail only once the model becomes superhuman along some axes (so there are no experts to evaluate alignment anymore). We have not left Maybetown quite yet. 

It seems hard to make a model honest and informative, even using good evaluations.

Suppose we somehow manage to set the training process up so that the evaluator always has a better effective domain-understanding than the evaluatee. Are we then guaranteed to get good evaluations? It seems to me that there is an additional failure mode for evaluators that is slightly different from failure to understand the domain. Even when he knows that mercury is dangerous, Isaac might get poisoned if he does not know that it is in his drink. Even with perfect knowledge about the theoretical properties of a problem, an evaluator might give bad evaluations if they are misinformed or lack vital information about the state of the environment. 

We already know that language models do not always output truthful text. They may be mimicking common misconceptions they encountered in their dataset and increasing the model size does not necessarily make them less truthful. On the bright side, newer models such as Anthropic's assistant which was trained for honesty using RLHF are more truthful.  This suggests that scalable-oversight with a smart enough evaluator may teach an AI to be truthful. However, while if an AI mimics a falsehood from its dataset it is not being intentionally dishonest, in the limit of AGI we should be wary of purposeful deception. It makes sense to distinguish between truthfulness (the correspondence between what the model says and reality) and honesty (the correspondence between what the model says and what it believes).

A worrying phenomenon that can already be observed in current models is sycophancy. Large language models will tailor their responses to views expressed by the user. For example, a prompt stating to be written by a conservative American will elicit a different opinion on gun rights than one from a liberal. If the same model tells different things to different users then it can not be honest. The capability to be dishonest is clearly bad: when we have taught our alchemy automaton that we do not want mercury in our drinks, we would hope that it honestly informs us which ingredients contain mercury. Worryingly, sycophancy gets more common for larger models and RLHF may worsen it.[2] When you think about it, it is no surprise that training a language model to do well according to evaluations will make the model say things with the intention of pleasing the evaluator rather than being truthful or honest. From the point of view of the evaluatee, it makes no difference if they receive positive evaluations because the evaluator is being deceived or because the evaluatee actually succeeds at the task being evaluated.

A related issue that has been shown[3] to get worse with more RLHF steps is sandbagging. A sandbagging model tends to give more truthful answers when it infers that the users is more highly educated. Sandbagging seems particularly dangerous when we eventually have to evaluate models of superhuman intelligence, but may turn out to be a non-issue if we manage to always have a smart-enough evaluator.

If the evaluator is smarter than the evaluatee (because it is advantaged by the training process) it will to some degree be able to see through lies and misinformation. Unfortunately, being able to see through lies "to some degree" does not take us in the direction of Certaintyville. This is where I think work on interpretability and understanding model behavior will play an important role. There are a number of results which I think may be helpful. For example, it may be possible to elicit what a language model believes about the truth value of a statement even if that is different from what it is saying, and there is work on learning how confident models are about their statements. If the work on Eliciting Latent Knowledge and mechanistic anomaly detection is successful, we will not have to worry about models deceiving us. Unfortunately, we can not assume that we will get this far before capabilities researchers create AGI - in the next section I will return to the problem that we may not get good tools in time. For now, we will have to stay in Maybetown a bit longer. 

Trying to MADGT may be unfeasible before we have better interpretability tools.

The most pessimistic argument I have heard against the tractability of aligning AI with scalable oversight is not even about scalable oversight in particular. I think the gist of the argument can be summed up with the following points:

  • We have basically no idea how models like Chat-GPT decide what to output for a particular input. 
  • Considerations about instrumental goals and the difficulty of inner alignment suggest that a superintelligence may well develop goals that are completely different from what we trained it to do. This is bad because if a goal is not aligned with us, then getting rid of or disempowering humans will likely be a useful instrumental goal.
  • Hence, before we have a better understanding of how neural networks think, it would be foolish to assume that we can move from training to deployment and have a model stay aligned.

In other words, we don't understand why the automaton in scenario 2 wants paperweights, just like we generally don't have a good idea why AI's misgeneralize goals in a particular way. Coming up with a clever scheme for getting the automaton to meekly produce healthy potion during training does not guarantee that it will not serve you mercury during deployment. If you do not understand how the automaton is forming its goals, would you really trust the recipe it presents to you? Does the recipe not seem even more suspicious if you consider that the automaton was not trained by your feedback alone but with the assistance of another automaton? What if it was trained by a sequence of ever more advanced automatons, only the first of which you fully understand?  

According to this line of reasoning, AIs trained with scalable oversight may be at even greater risk of misalignment once powerful AIs are involved in the evaluation process. This is because for every layer of evaluation-by-AI that you add to the training process, you are adding the uncertainty from using an evaluator whose goals you do not understand. If you have high confidence in this argument, then you should focus on solving interpretability before attempting to put goals in an AI where with solving I mean developing techniques to be certain about how a model will behave.

So how should this argument affect our views on scalable oversight? Of course, in an ideal world we would solve interpretability before training an AGI. But we don't live in an ideal world and even if all alignment researchers agreed that we should only work on interpretability from now on, it would not stop the large organizations from doing capabilities research and developing AGI anyway. Currently the kind of capabilities work that is (at least nominally) safety-minded often focuses on scalable oversight or approaches that will eventually be extended with scalable oversight. Hence, if you do not consider the chances of making meaningful progress on alignment without interpretability to be very slim, it might still make sense to work on scalable-oversight to make sure that it is as safe as possible. On the other hand, if you have high confidence in the above argument, you should probably work on interpretability or aim to slow down capabilities research (for example by working on AI governance).       

To me, a reasonable third option seems to be working on the kind of capability evaluations that ARC Evals is doing. Without solving interpretability, important data points for how AGI will behave will come from testing its less-than-general predecessor models. Assuming there is no sharp-turn where AGI suddenly has much better deceptive capabilities, this line of work seems really valuable. 


Approaches to scalable-oversight are based on subjectively evaluating a model’s behavior and this comes with problems regardless of if the evaluator is human, an AI, or a team of both. In summary, we can only get aligned behavior from evaluations if the evaluators can be certain that behavior they are evaluating is aligned. This means they need to be aware of long-term side effects and other non-obvious evaluation criteria. The evaluators also must not be manipulated by the evaluatee who, even if it cannot lie, may omit important information or act sycophantic. As a consequence, evaluators need to be more intelligent than evaluatees. Since this becomes infeasible for training superhuman intelligence, the training process will have to be set up so that the evaluator has an advantage. That seems risky and might break for models of superhuman intelligence that are aware of the difference between training and deployment. 

Further, it may be that any attempt at aligning an advanced AI is doomed without good enough interpretability. Ideally, we would just check for misalignment with perfect interpretability tools. Since those might not be developed in time, it seems we should think hard about 1) how to make current approaches to scalable-oversight as safe as possible, and 2) building good evaluation suites for model capabilities and what we can infer about AGI capabilities from experiments on regular AI. 

RLHF based approaches are currently used by the big organizations for attempting to align their most advanced models. Hence, if we believe in shortish timelines, they will likely be a part of training AGI. Therefore, it seems reasonable to work on making RLHF safer even if we think that overall it has a low probability of succeeding with alignment. This could mean changing the focus of current research. In his talk at EAG, Buck made the point that RLHF research focuses mostly on capabilities and establishing baselines of what it can do. From a safety perspective, it would be better to focus on failure modes and trying to break it - the same is true of any research on scalable oversight.

Considering all of the above, there is clearly a lot of risk in current approaches to scalable oversight. I was glad that the people I talked to at EAG seemed generally mindful of these concerns, even if there may not be good solutions to all of them yet. Personally, I have updated towards considering interpretability and capability evaluations (even) more important and being more hesitant about wanting to contribute to scalable oversight research (though I still consider it important that safety-minded work on it). I am cautiously optimistic that a trifecta of advances in interpretability, capability evaluation techniques, and more safety-minded scalable oversight research can decrease P(doom), even without solving interpretability. It may not decrease to a comfortably low level for everyone, but considering what is at stake, I will take any decrease I can get.

  1. ^

    In the most general sense, the evaluator is making a reward rational implicit choice over a (possibly implicit) set of possible reward functions.

  2. ^

    This post also shows how some other bad properties get worse due to RLHF, such as stating undesirable subgoals and being overly confident in wrong beliefs. The sandwiching paper mentions how calibration (the ability to accurately estimate the uncertainty of its own beliefs) in models degrades after RLHF for helpfulness. 

  3. ^


New Comment