In response to Eliezer Yudkowsky's challenge, I will show how the alignment research approach outlined by OpenAI lacks common desiderata for effective plans. Most of the deficiencies appear to be difficult or impossible to fix, and we should thus expect the plan to fail.

Meta-level description of what makes a plan good places great emphasis on the goals/objectives of the plan. George T. Doran suggests goals be S.M.A.R.T.: Specific, Measurable, Achievable, Relevant, and Time-Bound.

Specific: OpenAI's description of the goals as AI that is "Value aligned" and "Follow human intent" could be elaborated in much greater detail than these 5 words. Yet making them specific is no easy task. No definition exists of these words in sufficient detail to be put into computer code, nor does an informal consensus exist.

Measurable: There currently exists no good way to quantify value alignment and intent-following. It is an open question if such quantification is even possible to do in an adequate way, and OpenAI does not seem to focus on resolving philosophical issues such as those required to make value alignment measurable.

Achievable: The plan suggests a relatively narrow AI would be sufficient to contribute to alignment research, while being too narrow to be dangerous. This seems implausible: Much easier problems than alignment research have been called AGI-complete, and general reasoning ability is widely thought to be a requirement for doing research.

Relevant: The plan acknowledges existential risk from advanced AI, but the proposed goal is insufficient to end the period of acute risk. This gap in the plan must be closed. I do not think this is trivial, as OpenAI rejects MIRI-style pivotal acts. My impression of OpenAI is that they hope alignment can be solved to the extent that the "Alignment tax" becomes negative to such an overwhelming degree that deceptively aligned AI is not built in practice by anyone.

Time-bound: The plan is not time-bound. OpenAI's plan is conceptualized as consisting of 3 pillars (Reinforcement Learning from Human Feedback, AI-assisted evaluation, AI doing alignment research), but the dependencies mean this will need to be 3 partially overlapping phases. The plan lacks information about timing or criteria for progressing from each phase to the next: When does OpenAI intend to pivot towards the last phase, AI-based alignment research?

Other problems

Robustness: The 3 steps outlined in the plan have the property that step 1 and 2 actively pushes humanity closer to extinction, and only if step 3 succeeds will the damage be undone. "Making things worse to make them better" is sometimes necessary, but this should not be done blindly. I suspect OpenAI disagrees, but I do not know what their objection would be.

Resource allocation: This is a central part of good plans in general, and is absent except for a few scattered remarks about underinvestment in robustness and interpretability. To the extent that the problem is ensuring alignment is ahead of capability (“Two progress bars”), this is crucial. The departure of key personnel from OpenAI’s alignment team suggests that OpenAI is working more on capability than alignment.

Risk analysis: OpenAI acknowledges that it may turn out that the least capable AI that can do alignment research is capable enough to be dangerous. This is described as a limitation but not further discussed. A better plan would go into details on analyzing such weak points.

Review: The plan calls for OpenAI to be transparent about how well the alignment techniques actually work in practice. From the outside, it is unclear if the plan is on track. The launch of ChatGPT seems to have not gone the way OpenAI expected, but the actual result and evaluation has not been published (yet).


The challenge was not framed as a request for a defensible, impartial analysis - we were asked for our thoughts. The thoughts I present below are honestly held, but are derived more from intuition.

Having “a realistic plan for solving alignment” is a high bar that OpenAI is far from meeting. No-one else can meet the bar, but “reality doesn't grade on a curve”: Either we pass the inflexible criteria or we die. The recent alignment work from OpenAI seems to be far from the required level to solve alignment.

OpenAI calls the article "Our Approach to Alignment Research", and not "Our Plan for Existential Safety from AGI". This does not invalidate criticizing the article as a plan, but instead shows that OpenAI chose the wrong subject to write about.

A better plan probably exists for the development of GPT-4.

The capability work done by OpenAI is actively harmful by limiting our time to come up with a better plan.

I would also question the degree to which OpenAI management is committed to following this plan. The internal power structure of OpenAI is opaque, but worrying signs include the departure of key alignment researchers and statements from the CEO.

In my mind, I envision a scenario where the Alignment Team finds evidence that either:

  • Evaluating adversarially generated research is too hard
  • Building AI assistants in practice does not lead to insights about recognizing deceptive plans
  • The Alignment Team is unable to build an AI that can productively work on alignment research without the AI being potentially dangerous
  • There is a fundamental problem with RLHF (e.g., it only learns hijackable sense-data, without reference to reality)

In this scenario, I doubt the CEO would oblige if the Alignment Team requests that OpenAI stop capability work.


I personally (weakly) don’t think the purpose of the plan is to end the period of acute risk from unaligned AI:

I strongly object to the expansive definition of "Alignment" being used in the document. If InstructGPT fails to follow simple instructions, this is a lack of capability and not a lack of alignment. If it keeps making unwanted toxic and biased answers, this is not misalignment. The goal of alignment is that the AI does not kill everyone, and this focus should not be diluted.

Microsoft is a major partner in OpenAI, and is well-known for their “Embrace, Extend, Extinguish” business strategy. I worry that a similar strategy may be used by OpenAI, where Alignment is extended to be about a large number of other factors, which are irrelevant but where OpenAI has a competitive advantage.

Stated very bluntly, the alignment work done in OpenAI may be a fig leaf similar to greenwashing, which could be called “Alignment-washing”.

This post is a summary of my presentation in the Reading Group session 264.

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 10:42 PM

The goal of alignment is that the AI does not kill everyone

It's worth pointing out that there was no time when alignment meant "AI doesn't kill everyone:"

  • I first encountered the term "alignment" as part of Stuart Russell's "value alignment" (e.g. here) by which he means something like "aligning the utility function of your AI with the values of the human race." This is clearly broader than not killing everyone.
  • As far as I know MIRI first used the term in this paper which defined "aligned" as "reliably pursues beneficial goals" (though I think they only defined it for a smarter than human AI). This is also broader than not killing everyone.
  • I used to say "AI control" (and blogged at to mean "getting an AI to try to do what you want it to do." In 2017 I switched to using AI alignment (and moved to at the suggestion of Rob Bensinger and MIRI, who proposed using "alignment" as a synonym for 'Bostrom's "control problem."' The control problem is defined in Superintelligence as the principal agent problem between a human and the AI system they built, which is broader than not killing everyone.  I have tried to offer a more precise definition of how I use the term AI alignment, as meaning "figuring out how to build AI systems that are trying to do what their operators want them to do."
  • Eliezer has used AI alignment (later than Russell AFAIK) to mean the whole area of research relevant to building sufficiently advanced AIs such that "running them produces good outcomes in the real world." This makes it an incredibly complicated empirical and normative question what counts as alignment, and AFAICT practically anything might count. I think this is an absolutely terrible definition. You should define the problem you want to work on, not define alignment as "whatever really actually matters" and then argue that empirically the technical problems you care about are the ones that really actually matter. That's quite literally an invitation to argue about what the term refers to. I still honestly find it hard to believe that people at MIRI considered this a reasonable way of defining and using the term.

So I would say the thing you are describing as "the goal" and "focus" of alignment is just a special case that you care a lot about. (I also care a lot about this problem! See discussion in AI alignment is distinct from its near-term applications.) This isn't a case of a term being used in a pure and clear way by one community and then co-opted or linguistically corrupted by another; I think it's a case of a community being bad at defining and using terms, equivocating about definitions, and smuggling complicated empirical claims into proposed "definitions." I've tried to use the term in a consistent way over the last 6 years.

I think it's reasonable to use "AI safety" to refer to reducing the risk of negative impacts from AI and "AI existential safety" to refer to reducing the risk of existential catastrophes from AI.

I am sympathetic to the recent amusing proposal of "AI notkilleveryoneism" for the particular area of AI existential safety that's about reducing the risk that your AI deliberately kills everyone. (Though I find Eliezer's complaints about linguistic drift very unsympathetic.) I usually just describe the goal in a longer way like "figuring out how to build an AI that won't deliberately kill us" and then have shorter words for particular technical problems that I believe are relevant to that goal (like alignment, robustness, interpretability...)

(Sorry for a bit of a rant, but I've seen a lot of people complaining about this in ways I disagree with. OP happened to be the post that I replied to.)

Bostroms definition of the control problem in 'Superintelligence' only refer to "harming the projects interests", which you are right is broader than existential risk. However, the immediate context makes it clear that Bostrom is discussing existential risk. The "harm" referred to does not include things like gender bias.

On reflection, I don't actually believe that AI Alignment has ever exclusively referred to existential risk from AI. I do believe that talk about "AI Alignment" on LessWrong has usually primarily been about existential risk. I further think that the distinction from "Value Alignment" (and if that is related to existential risk) has been muddled and debated.

I think the term "The Alignment Problem" is used because this community agrees that one problem (not killing everyone) is far and away more central than the rest (e.g. designing an AI to refuse to tell you how to make drugs).

Apart from the people here from OpenAI/DeepMind/etc, I expect general agreement that the task "Getting GPT to better understand and follow instructions" is not AI Alignment, but AI Capability. Note that I am moving my goalpost from defending the claim "AI Alignment = X-Risk" to defending "Some of the things OpenAI call AI Alignment is not AI Alignment".

At this point I should repeat my disclaimer that all of this is my impression, and not backed by anything rigorous. Thank you for engaging anyway - I enjoyed your "rant".

The control problem is initially introduced as: "the problem of how to control what the superintelligence would do." In the chapter you reference it is presented as the principal agent problem that occurs between a human and the superintelligent AI they build (apparently the whole of that problem).

It would be reasonable to say that there is no control problem for modern AI because Bostrom's usage of "the control problem" is exclusively about controlling superintelligence. On this definition either there is no control research today, or it comes back to the implicit controversial empirical claim about how some work is relevant and other work is not.

If you are teaching GPT to better understand instructions I would also call that improving its capability (though some people would call it alignment, this is the de dicto vs de re distinction discussed here). If it already understands instructions and you are training it to follow them, I would call that alignment.

I think you can use AI alignment however you want, but this is a lame thing to get angry at labs about and you should expect ongoing confusion.

The launch of ChatGPT seems to have not gone the way OpenAI expected

Why do we think it didn't go as aspected?

Eliezer: OpenAI probably thought they were trying hard at precautions; but they didn't have anybody on their team who was really creative about breaking stuff, let alone anyone as creative as the combined Internet; so it got jailbroken in like a day after something smarter looked at it.

I think this is very weak evidence. "Jailbreaking it" did as far as I know no damage. At least I haven't seen anybody point to any damage created. On the other hand, it did give OpenAI training data it could use to fix many of the holes. 

Even if you don't agree with that strategy, I see no evidence that this wasn't the planned strategy.

On reflection, I agree that it is only weak evidence. I agree we know nothing about damage. I agree that we have no evidence that this wasn't the planned strategy. Still, the evidence the other way (that this was deliberate to gather training data) is IMHO weaker.

My point in the "Review" section is that OpenAI's plan committed them to transparency about these questions, and yet we have to rely on speculations.

I find the fact that they used the training data in a short time to massively reduce the "jailbreak"-cases evidence in the direction that the point of the exercise was to gather training data. 

ChatGPT has a mode where it labels your question as illegitimate and colors it red but still gives you an answer. Then there's the feedback button to tell OpenAI if it made a mistake. This behavior prioritizes gathering training data over not giving any problematic answers.

Maybe the underlying reason why we are interpreting the evidence in different ways is because we are holding OpenAI to different standards:

Compared to a standard company, having a feedback button is evidence of competence. Quickly incorporating training data is also a positive update, as is having an explicit graphical representation of illegitimate questions.

I am comparing OpenAI to the extremely high standard of "Being able to solve the alignment problem". Against this standard, having a feedback button is absolutely expected, and even things like Eliezers suggestion (publishing hashes of your gambits) should be obvious to companies competent enough to have a chance of solving the alignment problem.

It's important to be able to distinguish factual questions from questions about judgments. "Did the OpenAI release happen the way OpenAI expected?" is a factual question that has nothing to do with the question of what standards we should have for OpenAI.

If you get the factual questions wrong it's very easy for people within OpenAI to easily dismiss your arguments. 

I fully agree that it is a factual question, and OpenAI could easily shed light on the circumstances around the launch if they chose to do so.

That's not even an assertion that it didn't go as they expected, let alone an explanation of why one would assume that.

Seems to me Yudkowsky was (way) too pessimistic about OpenAI there. They probably knew something like this would happen.

Thanks for writing this up. I agree with several of the subpoints you make about how the plan could be more specific, measurable, etc. 

I'm not sure where I stand on some of the more speculative (according to me) claims about OpenAI's intentions. Put differently, I see your post making two big-picture claims: 

  1. The 1-2 short blog posts about the OpenAI plan failed to meet several desired criteria. Reality doesn't grade on a curve, so even though the posts weren't intended to spell out a bunch of very specific details, we should hold the world's leading AGI company to high standards, and we should encourage them to release SMARTer and more detailed plans. (I largely agree with this)  
  2. OpenAI is alignment-washing, and their safety efforts are not focused on AI x-risk. (I'm much less certain about this claim and I don't think there's much evidence presented here to support it). 

IMO,  the observation that OpenAI's plan isn't "SMART" could mean that they're alignment-washing. But it could also simply mean that they're working toward making their plan SMARTer, and they're working toward making their plans more specific/measurable, but they wanted to share what they had so far (which I commend them for). Similarly, the fact that OpenAI is against pivotal-acts could mean that they're not taking the "we need to escape the critical risk period" goal seriously or it could mean that they reject one particular way of escaping the acute risk period, and they're trying to find alternatives. 

I also think I have some sort of prior that goes something like "you should have a high bar for confidently claiming that someone isn't pursuing the same goal as you, just because their particular approach to achieving that goal isn't yet specific/solid.

I'm also confused though, because you probably have a bunch of other pieces of evidence going into your model of OpenAI, and I don't believe that everyone should have to write up a list of 50 reasons in order to criticize the intentions of a lab.

All things considered, I think I land somewhere like "I think it's probably worth acknowledging more clearly that the accusations about alignment-washing are speculative, and the evidence in the post could be consistent with an OpenAI that really is trying hard to solve the alignment problem. Or acknowledge that you have other reasons for believing the alignment-washing claims that you've decided not to go into in the post."

I'll do both:

  1. I (again) affirm that this is very speculative.
  2. A substantial part of my private evidence is my personal evaluation of the CEO of OpenAI. I am really uneasy about stating this in public, but I now regret keeping my very negative evaluation of SBF private. Speak the truth, even if your voice trembles. I think a full "Heel turn" is more likely than not.

"The plan suggests a relatively narrow AI would be sufficient to contribute to alignment research, while being too narrow to be dangerous. This seems implausible: Much easier problems than alignment research have been called AGI-complete, and general reasoning ability is widely thought to be a requirement for doing research" - Could you clarify what some of these easier problems are and why they are AGI-complete?


I haven't seen a rigorous treatment of the concept of AGI-completeness. Here are some suggested AGI complete problems:

I don't have a solid answer, but I would be surprised if the task "Write the book 'Superintelligence'" required less general intelligence than "full self-driving from NY to SF".

I'm interested why you would think that writing "Superintelligence" would require less GI than full self-driving from NY to SF. The former seems like a pretty narrow task compared to the latter.

I was unclear. Let me elaborate:

"AGI-Completeness" is the idea that a large class of tasks have the same difficulty, roughly analogous to "Turing-Completeness" and "NP-Completenes".

My claim in the post is that I doubt OpenAI's hope that the task "Alignment Research" will turn out to be strictly easier than any dangerous task.

My claim in my comment above refers to the relative difficulty of 2 tasks:

  1. Make a contribution to Alignment Reseach comparable to the contribution of the book 'Superintelligence'.
  2. Drive from NY to SF without human intervention except for filling the gas tank etc.

I am willing to bet there won't be a level of AI capability persisting more than 3 months where 1) is possible, but 2) is not possible.

I can't give a really strong answer for this intuition. I could see a well-trained top-percentile chimpanzee having 0.1% probability of making the car-trip. I could not see any chimpanzee coming up with anything comparable to 'Superintelligence', no matter the circumstances.