Notes on OpenAI’s alignment plan

Alex Flint

This work was supported by the Monastic Academy for the Preservation of Life on Earth.

OpenAI has published a plan for producing aligned AI systems. Very roughly, their plan is to build language models capable of producing human-level alignment research. The output of these models might be natural language essays about alignment, or code that directly implements machine learning systems. In either case, human researchers would spend their time reviewing machine-generated alignment research. Their research agenda goes as follows.

Step 1: Train language models to solve useful problems using reinforcement learning. Rather than having a human review each attempt the model makes, which would be very time-consuming, train a second model that watches a few human reviews and learns to predict what reward the human would have given. Then train the main reinforcement learning model using this second model to provide a lot of feedback. You still need a human to review this second model every so often to make sure it has learned the right thing, but overall it requires less human effort than having a human review every attempt that the reinforcement learning model makes.

Step 2: When tasks become so complex that humans struggle to assess the outputs of the model, use some combination of the following:

Train models that critique their own outputs or the outputs of other models, possibly requiring the initial model to expand its reasoning around certain points in its original output. ("debate")
Train auxiliary models that help humans work out what reward to give to the primary model by solving more tractable sub-problems. In assessing the validity of a philosophy essay, you might train an auxiliary model to identify which paragraphs contain the key reasoning steps, and then use this model as a tool when assessing the philosophy essays. It might be that the sub-problems are themselves very difficult to provide training data for, in which case you can build models to solve sub-sub-problems that help humans to assess the output of the models targeting the sub-problems, and so on recursively. ("Recursive reward modeling")
Train a model that is general-purpose enough that it can itself do useful work to assist a human generating training data for the model’s own training. Each time you train a new version of the model, give that model to humans to use as a tool in generating more training data, and then retrain the model on that improved training data. In training an internet-search assistant, for example, you might have humans use the previous version of the assistant to answer sub-questions that are relevant to assessing a top-level answer or answering a top-level question. If this produces better training data than in the previous iteration then you may also be able to train a better internet-search assistant using this improved training data. You can repeat this process for as long as the thing keeps improving. ("Iterated amplification")

Step 3: use the above to build a language model capable of producing human-level alignment research. Presumably we would put in prompts like "what are the governing dynamics of agency in intelligent systems?" or "how can we avoid mesa-optimization in supervised learning?" and the thing would produce essays containing insights that clarify the alignment problem for us.

Discussion

There is a bit of a leap from "fine-tune a language model" to "produce human-level alignment research". Let’s say we’re training a language model to write insightful essays about alignment. We begin with some language model that has been pre-trained using, say, text from the internet. We give it a prompt such as "what would it mean to measure degrees of agency in arbitrary computer programs?" It generates, initially, some text that is typical of what someone on the internet might write in response to such a question, i.e. not very insightful.

Next, we start fine-tuning. We generate a lot of essays using this initial model, and hopefully there is a little insight in one or two. We identify these slightly better essays and use reinforcement learning to push the language model so that it generates essays more like these marginally insightful ones. After repeating this process, what we hope for is a language model that produces essays containing insights about how to design aligned AI systems.

One of the difficulties here is that humans don’t really know what insights about alignment will look like. We hope that we will recognize them when we see them, but insights are sometimes difficult to digest because they often challenge assumptions that we didn’t know we were making. If a language model generates an essay that seems counterintuitive to us, how likely are we to read and consider deeply whether there is an insight buried somewhere in there? This is where iterated amplification or debate or recursive reward modeling are supposed to help. The idea is that we train a second model to solve a sub-problem such as "which paragraph is the most insightful?", and then we know which paragraph to look at in the top-level essay, or "give me a single-paragraph summary of the whole essay", and then assess that paragraph for insight, so we can more quickly assess whether an essay contains some marginal amount of insight.^[1]

I’m not sure that insight works like this. When I read Jaynes’ Probability Theory, I gradually picked up a kind of mental ethos based on how Jayes looks at the problem. I don’t think I could have pointed to a particular page that contained the "most" insight, and I think it was the fact that I spent a long time studying the book that helped to impart this ethos, so if I had read only a summary then my mind never would have had time to conform to the shape of the book. In short, I’m not sure that you can break down the assessment of insight into sub-problems, because what’s happening when you come around to an insight is that you are shifting your mental habits to accommodate something that doesn’t fit within your existing mental habits. Any way that you use your existing mental habits to break the problem into subproblems is likely to presuppose the very mental habits that a true insight would break you out of, and so assessing things in terms of subproblems is likely to prevent you from recognizing the deepest and most helpful kind of insights.

A second issue: is there a statistical regularity to text that contains insight about alignment? If the language model does generate an essay that contains an insight and we update its parameters to generate more essays that are statistically similar, then are further insights now more likely? It seems to me that such a parameter update would increase the likelihood of further essays containing the same basic insight, but I’m not sure whether it would also change the likelihood of further insights. Suppose that in a world where nobody had ever written much about embedded agency, a language model came up with something in the vicinity of the idea. As the language model starts expanding on this, we might reward it until it writes excellent essays about many aspects of embedded agency. But would this updated model really be more likely to realize that AI development may lead to arms races (in a world where AI arms races had not much been discussed)? If not then what we are really doing is an exhaustive search over essays.

A third issue: the training signal being used here is what seems insightful to humans. The system will therefore be incentivized to learn about what kind of things appear insightful to humans. Again, this is where debate, amplification, and recursive reward modeling are intended to help out. But all three of these ideas ultimately try to break high-level questions such as "what kind of goal should we give to a superintelligence?" into sub-questions that ground out in objectively verifiable facts. I’m not sure that breaking down problems into subproblems is a good way to verify real insights. Imagine someone working from within a Christian mindset a thousand years ago, asking "can a machine be in communion with Spirit?". Every way this person might break such a problem down into subproblems would likely already contain the basic assumptions of the Christian mindset (so it seems to me). It’s not that it’s impossible to break high-level alignment questions into sub-problems, it’s that we are likely to choose a breakdown that presupposes our initial assumptions. If we evaluate essays this way, I suspect the language models will learn to tell us what we already believe, in a way that feels insightful but does not actually challenge us to go beyond our assumptions.

An analogy

When we train reinforcement learning systems to play games such as Go, it is most helpful to put the learning algorithms in contact with the game of Go itself, and this is indeed how successful Go-playing models have been trained. If instead we trained Go-playing models by having each move assessed by humans who did not themselves understand the basic rules of Go (nor have access to a running Go board), then how likely is it that we would produce machines capable of playing high-level Go?

For the same reason that learning to play Go benefits enormously from a connection to a real game of Go, learning to write about alignment benefits enormously from a connection to the actual alignment problem as it manifests as a tangible problem in the real world. We should devise a training process that incorporates the alignment problem itself, rather than using human assessment as a proxy for that. In other words, the reward signal should be generated by whether the model can actually produce systems that are in fact aligned. OpenAI gives a working definition of "alignment" in their plan, which is:

Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent

Wouldn’t it make more sense to use as a reward signal the fact-of-the-matter about whether a certain system followed a particular human’s intention?

Conclusion

I have the sense that the reason we’ve been so successful in machine learning is due to a very tight feedback loop between a learning process and a thing that we want our models to learn. In some cases, the thing we want the model to learn is to solve a task in the way that humans solve it, in which case we generate a dataset and connect a learning process to that dataset. In other cases the thing we want the model to learn is to play a game such as Go or Chess or Starcraft, in which case we connect the learning process to an implementation of the game. Alignment research is neither of these: we are not trying to solve a task that humans already know how to solve, and we are not trying to play a game. If we really want to train machine learning systems to solve the alignment problem then we need to hook up the learning process to the alignment problem itself. In OpenAI’s plan, this is done via human opinions about whether a certain essay or piece of code would work. It is hoped that some combination of debate, amplification, and recursive reward modeling will change this, but in the end it is still like asking humans for their opinions about whether a certain Go move is brilliant or not, in a world where no humans understand the basic rules of Go, and there is no way for humans to directly play Go. It is not literally impossible for this to work, but the connection between the thing doing the learning and the thing to be learned is tenuous and hindered by all the confusions that humans already have about alignment.

Or perhaps we train a second model to summarize a whole essay into a single paragraph, and then assess whether that paragraph contains insight as a proxy for whether the whole essay contains insight. ↩︎

"Wouldn’t it make more sense to use as a reward signal the fact-of-the-matter about whether a certain system followed a particular human’s intention?"

If I understand what you are saying correctly, this wouldn't work, for reasons that have been discussed at length in various places, e.g. the mesa-optimization paper and Ajeya's post "Without specific countermeasures..." If you train a model by giving it reward when it appears to follow a particular human's intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

And so if you train an AI to build another AI that appears to follow a particular human's intention, you are just training your AI to do capabilities research.

(Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so as far as we can tell from the outside. Well, how are we going to construct such a reward signal? That would require getting all the way to the end of evhub's Interpretability Tech Tree.)

If you train a model by giving it reward when it appears to follow a particular human's intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

Right yeah I do agree with this.

Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub's Interpretability Tech Tree

Well I think we need something like a really-actually-reward-signal (of the kind you're point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn't require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don't think we've exhausted the design space of reward signals yet but it's definitely the "challenge of our times" so to speak.

I think this is a very good critique of OpenAI's plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently "generally intelligent" that they won't need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system's capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model. The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their "beliefs", even if they are generally intelligent enough to in some way "know" the correct answer. However, addressing this issue involves solving a different and I think probably easier problem (ELK/creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.

Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don't actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.

Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working.

So overall I'm not saying that language models can't be generally intelligent, I'm saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).

This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can't get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can't get feedback from another intelligent grader/assessor reasoning in a ~a priori way.

If you train a model by giving it reward when it appears to follow a particular human's intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

Right yeah I do agree with this.

Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub's Interpretability Tech Tree

40

Notes on OpenAI’s alignment plan

40

Ω 18

Discussion

An analogy

Conclusion

40

Ω 18

40

Ω 18