Model-driven feedback could amplify alignment failures

aogara

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Anthropic, DeepMind and Google Brain are all working on strategies to train language models on their own outputs. For a brief summary of the work so far:

Red Teaming Language Models with Language Models (Perez et al., 2022). One model prompts another, seeking to expose undesirable generations. A third model classifies those generations as undesirable, creating a dataset of behaviors to be avoided.
Constitutional AI (Bai et al., 2022). Begins with the red-teaming setup described in Perez et al., 2022. Then fine-tunes on that dataset by either (a) critiquing and rewriting the response and training the generator to imitate that output with supervised fine-tuning, or (b) choosing the better of two responses to train a preference model, and training the generator with RL on the preference model.
Large Language Models Can Self-Improve (Huang et al., 2022). Fine-tunes a model on its own "high confidence" outputs. High-confidence outputs are identified by asking the model to answer the same question many times, using chain-of-thought each time, and fine-tuning on outputs that are agreed upon by a majority of responses.

I'd like to point out a simple failure mode of this approach: Failures of alignment and capability in the original model could be amplified by fine-tuning on its own outputs. Empirically, recent experiments on language models have found more benefit than harm in model-driven feedback. But that might not always be the case.

This recent work is an extension of weak supervision, a technique dating back to at least 1963 which has been successful in applications such as image classification and protein folding. This literature has long acknowledged the possibility of amplifying a model's existing shortcomings via self-training:

Semi-Supervised Learning of Mixture Models (Cozman et al., 2003) analyze cases where weak supervision will help or hurt a maximum likelihood estimator.
Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning (Arazo et al., 2020) provide evidence that naive implementations of weak supervision can hurt performance on image classification tasks. They show that data augmentation and scaling can reduce these harms.
An Overview of Deep Semi-Supervised Learning (Ouali et al., 2020) Section 1.3 lays out key assumptions behind weak supervision, and discusses state of the art methods.

One particularly dangerous failure mode would be the classic deceptive alignment story, in which a model with long-term goals gains awareness of its training process and subverts it. With a model-driven feedback approach, there would be more of an opportunity to hide misaligned behavior during training. Models used for critiques or oversight could also engage in gradient hacking, putting their goals into the generator model.

A better approach might keep humans at the center of the feedback process. This is slower and might be less accurate in some cases, but could potentially avoid the worst failures of model-driven feedback. A popular middle ground uses model-assisted feedback methods:

AI Written Critiques Help Humans Notice Flaws (Saunders et al., 2022). GPT provides critiques of its own outputs. This version still has a human make the final judgement, limiting the influence of the model over its own training data.
Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2022). Finds that humans with access to a chatbot assistant are better able to answer factual questions than either the chatbot alone or humans unaided by AI.

Model-driven feedback has achieved impressive results on scalable oversight, especially compared to the empirical and theoretical challenges with debate. But in the future, the old adage might hold true: Garbage in, garbage out.

[-]cubefox1y20

Anthropic apparently noticed something like this but didn't worry much. From the Constitutional AI paper (my emphasis):

In Figure 7, we compare harmlessness PM scores for critiqued- vs direct-revisions. We found that critiqued revisions achieved better harmlessness scores for small models, but made no noticeable different for large models. Furthermore, based on inspecting samples from the 52B, we found that the critiques were sometimes reasonable, but often made inaccurate or overstated criticisms. Nonetheless, the revisions were generally more harmless than the original response. An example can be seen in Appendix A. For the main results of this paper, we chose to use critiqued revisions, as it may provide more transparency into the model’s reasoning process. This sort of reasoning may also be useful to help models uncover more subtle harms or unintended consequences.

The Appendix A example, where the model critiques appear somewhat confused, can be found on page 19 (I don't know how to include screenshots here).

LESSWRONG
LW

Model-driven feedback could amplify alignment failures

21

Ω 6

21

Ω 6