In this post, I am going to recast the recent “OpenAI’s Alignment Plans” in terms of Evan Hubinger’s Training Stories. I found this somewhat useful in terms of understanding this research agenda, and hopefully it might be useful for others too!

I am going to focus on the approach to alignment titled “Training AI systems to do alignment research”, as this is their current approach to actually creating an aligned AGI system. I think OpenAI sees the agendas “Training AI systems using human feedback” and “Training models to assist human evaluation” as being useful for this goal, but not ways of aligning AGI systems themselves, so I will only talk about these approaches as they relate to training an AI research assistant.

Also note that OpenAI’s post is pretty vague. This is maybe to be expected since the agenda seems quite new, and they appear some way away from implementing these ideas. I will lay out the optimistic training story I see being told here, but the actual story could diverge in some aspects.

What training story is being told here?

Training goal

Specification: The idea is to build a Large Language Model that can function as an “AI alignment researcher”. This system will be non-goal directed, and will only be useful in the “narrow” sense of being able to do alignment research.

Desirability: This model will not cause safety problems because it is only able to do alignment research, and it is not goal-directed. Hence, it will avoid the problems caused by AGI systems with agency.

If successful, this model will accomplish the desired goal of helping to produce aligned AGI systems by default: that is what it is intended to do! This approach is not aimed at fixing any policy issues such as “how do we deploy an AGI system”.

Training rationale

Constraints: The “AI research assistant” will be trained using a training signal which is hopefully “scalable and aligned”. This training signal will itself be generated via training models to assist human evaluation, which will allow us to train models even when their behaviour and outputs are complex. This will allow us to evaluate the assistant’s behaviour even on complex tasks, which will aid us in building the training signal. We will do this through methods such as IDA, Debate, and RRM, and we will improve these methods through the skills we have developed aligning current systems through methods such as human feedback.

Nudges: Because we are only training a “narrow” system, in the sense that we are only training a system to be able to assist with doing alignment research, we might expect that a robust training signal might be enough to ensure we have an aligned research assistant. 

Alongside this, we will use weaker AI alignment assistants to develop and test the AI alignment assistant. If the research agenda “training models to assist human evaluation” is successful, this could be enough to produce an aligned AI alignment assistant.

Evaluating the training story

Training goal evaluation 

Alignment: The biggest problem with the training goal, is that the training goal is not currently specified mechanistically at all. There is no attempt to suggest what kind of algorithm the model might be running, and indeed little mention of any potential model besides it being a LLM. The post only describes the intended behaviour of the model, not how the model might actually generate this behaviour. We want mechanistic explanations because of concerns about issues such as inner misalignment: many in the community think it is necessary to know what the model is doing mechanistically (at least to an extent) before we can be confident in its alignment.

This is maybe harsh, as the whole point of developing a research assistant is that we cannot currently answer all questions about alignment! I think a case could be made that we don’t need a mechanistic understanding of weaker assistants in order to use them with confidence, and that these weaker assistants could help us to gain a mechanistic understanding of more advanced systems. I think the key here is: can we train a system X which can safely help us to pursue alignment research, before we have a mechanistic understanding of how system X is working? If the OpenAI team believes this, a better argument for this would be useful.

Competitiveness: There is maybe an issue here with timings. It seems plausible that if we have the capabilities to build an aligned “AI research assistant”, then we will maybe have the potential to build a very powerful ML system. We may then not have long to use the research assistant before AGI is developed.

However, if successful I think this seems like it could be very competitive with other approaches to alignment research. I don’t think it needs justification that an aligned AGI system could do better alignment research than humans can currently.

Training rationale evaluation

Alignment: To me, the training rationale nudges section is very weak, and I think this is because it doesn’t tell a precise enough story about what a “scalable and aligned” training signal might look like, and why this would be sufficient for us to produce an aligned model (it’s worth noting that there may be better justification for this elsewhere that I’m not aware of). It’s also hindered by the fact that there is no clear algorithm being aimed at here, just a vague notion of an alignment assistant.

One argument given in the post is that we can use weaker AI assistants to help us to evaluate more capable research assistants. This would involve using ideas from their research agenda “training models to assist human evaluations”, which might mean that the crux of “how do we produce a system we are confident in?” boils down to evaluating Recursive Reward Modelling, IDA, Debate, or a similar system to help humans evaluate systems with complex behaviour. I think the key here could be something like “is aligning increasingly more powerful systems continuous in some sense?”, but I think other resources are better placed to answer this question than I am.

Another argument made in the post is that aligning an AI research assistant might be easier than an AGI because it is “narrower” in scope. This is similar to work by Christiano and Shah regarding why narrow value learning could prove tractable, yet still useful. However, it doesn’t seem obvious that a system which is able to meaningfully advance progress in alignment research will be “narrow” in any sense. For example, if the models we create have increasing control over the research process, the scope of these models starts to feel pretty broad.

I think this is again related to how continuous we expect the challenge of alignment to be: will we be able to align a powerful AI alignment model by iteratively improving and expanding the scope of more basic AI alignment models, or will there be some discontinuity in difficulty?

A separate question specific to LLMs is to what extent they are not goal-directed, and to what extent we can expect them to not be goal-directed even as they become more powerful and we use them in more contexts. I suspect discussions around this question exist somewhere already.

Competitiveness: I think the greatest potential barrier to competitiveness, which the authors recognise, is that “It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI”. If this is the case, this approach may only introduce unnecessary barriers to training an aligned AGI system. However, I am by no means convinced of this claim, and actually think there’s a good chance the cost of developing this approach could be competitive with other approaches in terms of expense, if it works.


To my current understanding, there are two main issues with the path to alignment OpenAI have specified. The first is that they haven’t explained at all how their system might work mechanistically. It might be the case that we can safely use a weaker system we don’t understand mechanistically to eventually produce a more powerful system we do understand mechanistically, but this is not obvious to me. Some alignment researchers, such as Evan Hubinger, think a mechanistic understanding is necessary to have confidence in a system’s alignment.

The second problem is that although they have focussed greatly on how to produce an “aligned and scalable” training signal, it is not at all clear that this will be sufficient to produce an aligned system. I think the core assumption here is that we can iteratively create more powerful alignment assistants which are themselves aligned, although I’m less certain of this than I am about the first issue.

New to LessWrong?

New Comment