Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)


Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.

Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.

We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.

Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.[1]

Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.

It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.

We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.

We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".

Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.

Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.[2][3]

  1. ^

    We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!

    Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.

    In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.

  2. ^

    Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”

  3. ^

    Eliezer adds:  "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."

New to LessWrong?

New Comment
33 comments, sorted by Click to highlight new comments since: Today at 5:00 PM

Thanks for writing this! I'd be very excited to see more critiques of our approach and it's been great reading the comments so far! Thanks to everyone who took the time to write down their thoughts! :)

I've also written up a more detailed post on why I'm optimistic about our approach. I don't expect this to be persuasive to most people here, but it should give a little bit more context and additional surface area to critique what we're doing.

My own responses to OpenAI's plan:

These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI's plan, but I think they cover the most egregious issues.

I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.

I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.

I don't think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI's approach because it views these as solutions in themselves, instead of as simplified problems.

I'm happy to see OpenAI and OpenAI Alignment Team get recognition/credit for having a plan and making it public. Well deserved I'd say. (ETA: To be clear, like the OP I don't currently expect the plan to work as stated; I expect us to need to pivot eventually & hope a better plan comes along before then!)

What's MIRI's current plan? I can't actually remember, though I do know you've pivoted away from your strategy for Agent Foundations. But that wasn't the only agenda you were working on, right?

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:

In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" says that when we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.

E.g., if we think it's a dangerous kind of event for the AGI to ask "How can I achieve this end using strategies from across every possible domain?" then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).

Ideas that inherit from this principle include the general notion of Task-directed AGI, taskishness, and mild optimization.

Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.

MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"

My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.

  • If an alignment idea strikes us as having even a tiny scrap of hope, and isn't already funding-saturated, then we're making sure it gets funded. We don't care whether that happens at MIRI versus elsewhere — we're just seeking to maximize the amount of good work that's happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety of different moonshots and speculative ideas that are targeted at the core difficulties of alignment (described in the AGI Ruin and sharp left turn write-ups).
  • If an idea seems to have a significant amount of hope, and not just a tiny scrap — either at a glance, or after being worked on for a while by others and bearing surprisingly promising fruit — then I expect that MIRI will make that our new organizational focus, go all-in, and pour everything we have into helping with it as much we can. (E.g., we went all-in on our 2017-2020 research directions, before concluding in late 2020 that these were progressing too slowly to still have significant hope, though they might still meet the "tiny scrap of hope" bar.)

None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.

Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.

Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)

(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.

FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have generally demonstrated pretty bad judgment by, um, being capabilities researchers.

MIRI isn't developing an AGI.

But MIRI wants to build an FAI. What their plan is, if they think they can build one, seems relevant. Or what they would do if they think they, or someone else, is going to build an AGI.

They published the dialogues and have written far more on the subject of how one might do so if one was inclined than any of the major institutions actually-building-AGI. I'm merely stating the fact that, as a very small group not actively attempting to build a FAI, it makes sense that they don't have a plan in the same sense.

Of course, Eliezer also wrote this.

I know Eliezer and Nate have written a bunch of stuff on this topic. But they're not the whole of MIRI. Are e.g. Scott, or Abram, or Evan on board with this? In fact, my initial comment was going to be "I know Eliezer and Nate have written about parts of their plans before, but what about MIRI's plan? Has everyone in the org reached a consensus about what to do?" For some reason I didn't ask that. Not sure why.

EDIT: Ah, I forgot that Nate was MIRI's executive. Presumably, his publically comments on building an AGI are what MIRI would endorse.

My ~2-hour reaction to the challenge:[1]

(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI's leadership.

(III)[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI's alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI's capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don't have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)


(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.

Assumptions implied by OpenAI's plans, with my reactions:

  • (V) Easy alignment / warning shots for misaligned AGI:
    "Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]" My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don't need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don't expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)
  • (VI) "AGI alignment" isn't "AGI complete":
    This is already acknowledged in the post: "It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems." However, it isn't exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that "this time, it will be powerful enough to help with alignment"?
  • (VII) Related assumption: No lethal discontinuities:
    The whole post suggest the workflow "new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1". (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don't see evidence for "ability to hold off on capabilities research". What are the organizational procedures allowing this?
  • (VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn't mention because I didn't have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
  • (IX) Regarding "outer alignment  alignment": Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them:
    It is good to at least acknowledge that there might be other parts of AI alignment than just "figuring out learning from human feedback (& human-feedback augmentation)". However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.
  • (X) Ability to differentially use capabilities progress towards alignment progress:
    The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won't also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
  • (XI) Creating an aligned AI is sufficient for getting AI to go well:
    The plan doesn't say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no "explicit assumptions & detection system & course-correction-procedure" for "nothing will go wrong if we just do (b)".

General complaint: The plan is not a plan at all! It's just a meta-plan.

  • (XII) Ultimately, I would paraphrase the plan-as-stated as: "We don't know how to solve alignment. It seems hard. Let's first build an AI to make us smarter, and then try again." I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like --- and if so, what is it.
  • (XIII) For example, the post mentions that "robustness and interpretability research [is important for the plan]". However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn't make it any less of an issue!) This means that the plan is not detailed enough.
    As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can't see the step that would fail if X was untrue. This doesn't say anything good about your proof.
  1. ^

    Eliezer adds:  "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."

    As far as I know, I came up with points (I), (III), and (XII) myself and I don't remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI's publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)

  2. ^

    (II) For example, consider the following claim: "We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants." My reaction: Yes, technically speaking this is true. But likewise --- please excuse the jarring analogy --- the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn't it be even better if you personally didn't plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).

  3. ^

    Apologies for the inconsistent numbering. I had to give footnote [2] number (II) to get to the nice round total of 13 points :-).

(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect. 

It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.

The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.

> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect. 

Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:

To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research".[1] And then the corresponding more realistic version of the claims would be that:

  • either (i') AI assistants will fundamentally be able to speed up alignment much more than capabilities
  • or (ii') the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research
  • or (iii') both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway.


  • Regarding (iii'): It seems that in the worlds where (iii') holds, you could just as well solve alignment without developing AI assistants.
  • Regarding (i'): Personally I don't buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i') seems rather hard to me.)
  • Regarding (ii'): As before, this seems implausible based on the track record :-).


  1. ^

    This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.

Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan. 

4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.

I don't know how to link to the specific comment, but here somewhere. Also:

We can focus on tasks differentially useful to alignment research


Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like "use LLMs to help with alignment research or get left behind when ML research gets automated". If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn't expect it to be particularly good. The option "everybody agrees to not build AI assistants and we can do alignment research first" is maybe not on the table, or at least it probably doesn't feel like it is to the alignment team at OpenAI. 

Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.

But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)

(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)

Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.

Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.

I strongly endorse this, based on previous personal experience with this sort of thing. Crowdsourcing routinely fails at many things, but this isn't one of them (it does not routinely fail).

It's a huge relief to see that there are finally some winning strategies, lately there's been a huge scarcity of those.

Quick submission:

The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.

The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.

The third prong of OAI's strategy seems doomed to me, but I can't really say why in a way I think would convince anybody that doesn't already agree. It's totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human  alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn't going to work for obvious reasons. 

I think the whole thing fails way before this, but I'm happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.

This is an intuition only based on speaking with researchers working on LLMs, but I think that OAI thinks that a model can simultaneously be good enough at next token prediction to assist with research but also be very very far away from being a powerful enough optimizer to realise that it is being optimized for a goal or that deception is an optimal strategy, since the latter two capabilities require much more optimization power. And that the default state of cutting edge LLMs for the next few years is to have GPT-3 levels of deception (essentially none) and graduate student levels of research assistant ability.

Epistemic status: 50% sophistry, but I still think it's insightful since specifically aligning LLMs needs to be discussed here more.

I find it quite interesting that much of current large language model (LLM) alignment is just stating, in plain text, "be a helpful, aligned AI, pretty please". And it somehow works (sometimes)!  The human concept of an "aligned AI" is evidently both present and easy to locate within LLMs, which seems to overcome a lot of early AI concerns like whether or not human morality and human goals are natural abstractions (it seems they are, at least to kinda-human-simulators like LLMs).

Optimism aside, OOD and deceptions are still major issues for scaling LLMs to superhuman levels. But these are still commonly-discussed human concepts, and presumably can be located within LLMs. I feel like this means something important, but can't quite put my finger on it. Maybe there's some kind of meta-alignment concept that can also be located in LLMs which take these into account? Certainly humans think and write about it a lot, and fuzzy, confused concepts like "love" can still be understood and manipulated by LLMs despite them lacking a commonly-agreed-upon logical definition. 

I saw the topic of LLM alignment being brought up on Alignment Forums, and it really made me think. Many people seem to think that scaling up LLMs to superhuman levels will cause result in human extinction with P=1.00, but it's not immediately obvious why this would be the case (assuming you ask it nicely to behave). 

A major problem I can imagine is the world-model of LLMs above a certain capability collapsing to something utterly alien but slightly more effective at token prediction, in which case things can get really weird. There's also the fact that a superhuman LLM is very very OOD in a way that we can't account for in advance.

Or the current "alignment" of LLMs is just deceptive behavior. But deceptive to whom? It seems like chatGPT thinks it's in the middle of a fictional story about AIs or a role-playing session, with a bias towards milqtoast responses, but that's... what it always does? An LLM LARPing as a supersmart human LARPing as a boring AI doesn't seem very dangerous. I do notice that I don't have a solid conceptual framework for what the concept of "deception" even means in an LLM, I would appreciate any corrections/clarifications. 

I'm assuming that it's just the LLM locating several related concepts of "deception" within itself, thinking (pardon the extreme anthropomorphism) "ah yes, this may a situation where this person is going to be [lied to/manipulated/peer-pressured]. Given how common it was in my training set, I'll place probability X Y and Z on each of those possibilities", and then weigh them against hypotheses like "this is poorly written smut. The next scene will involve..." or "This is a QA session set in a fictional universe. The fictional AI in this story has probability A of answering these questions truthfully". And then fine-tuning moves the weights of these hypotheses around. Since the [deception/social manipulation/say what a human might want to hear in this context] conceptual cluster generally gets the best feedback, the model will get increasingly deceptive during the course of its fine-tuning.

Maybe just setting up prompts and training data that really trigger the "fictional aligned AI" hypothesis, and avoiding fine-tuning can help? I feel like I'm missing a few key conceptual insights. 


Key points: LLMs are [weasel words] human-simulators. The fact that asking them to act like a friendly AI in plain English can increase friendly-AI-like outputs in a remarkably consistent way implies that human-natural concepts like  "friendly-AI" or "human morality" also exist within them. This makes sense - people write about AI alignment a lot, both in fiction and in non-fiction. This is an expected part of the training process - since people write about these things, understanding them reduces loss. Unfortunately, deception and writing what sounds good instead of what is true are also common in its training set, so "good sounding lie that makes a human nod in agreement" is also an abstraction we should expect.

Do you need any help distilling? I'm fine with working for free on this one, looks like a good idea.

Large language models like GPT-3 are trained on vast quantities of human-generated data. This means that a model of human psychology is implicit within the model. During fine-tuning, much of their performance gains come from how fast they are able to understand the intentions of the humans labeling their outputs.

This optimizes for models that have the best human simulations, which leads to more deception as the size of the model increases.

In practice, we will see a rapid improvement in performance, with the model finally being able to understand (or just access its existing understanding of) the intent behind human labeling/requests. This may even be seen as a win for alignment - it does what we want, not what we said! The models would be able to ask for clarification in ambiguous situations, and ask if certain requests are misspelled or badly phrased.

All the while they get better at deceiving humans and not getting caught.

I don't like that the win condition and lose condition look so similar. 


Edit: I should clarify, most of these concerns apply to pretty much all AI models. My specific issue with aligning large language models is that:

  1. They are literally optimized to replicate human writing. Many capabilities they have come from their ability to model human psychology. There doesn't need to be a convoluted structure that magically appears inside GPT-3 to give it the ability to simulate humans. GPT-3 is in many ways a human simulation. It "knows" how a human would evaluate its outputs, even though that information can't always be located for a particular task. 
  2. This means that the hypothesis "do what appeals to humans, even if it contains a lot of manipulation and subtle lies, as long as you don't get caught" can be easily located (much of human writing is dedicated to this) in the model. As tasks grow more complex and the model grows larger, the relative computation of actually completing the task increases relative to deception.

I agree

In my opinion, this methodology will be a great way for a model to learn how to persuade humans and exploit their biases because this way model might learn these biases not just from the data it collected but also fine-tune its understanding by testing its own hypotheses

See my comment on Jan’s new post.

On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:

  1. Practical considerations: AI systems currently tend to require lots of examples and it's expensive to get these if they all have to be provided by a human.
  2. Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.
  3. If you're training a neural network, then there are generically going to be lots of adversarial examples for that network. As the AI gets more and more powerful, we'd expect it to be able to generate more and more situations where its learned value function gives a high reward but a human would give a low reward. So it seems like we end up playing a game of adversarial example whack-a-mole for a long time, where we're just patching hole after hole in this million-dimensional bucket with thousands of holes. Probably the AI manages to kill us before that process converges.
  4. To make the above worse, there's this idea of a sharp left turn, where a sufficiently intelligent AI can think of very weird plans that go far outside of the distribution of scenarios that it was trained on. We expect generalization to get worse in this regime, and we also expect an increased frequency of adversarial examples. (What would help a lot here is designing the AI to have an interpretable planning system, where we could run these plans forward and negatively reinforce the bad ones (and maybe all the weird ones, because of corrigibility reasons, though we'd have to be careful about how that's formulated because we don't want the AI trying to kill us because it thinks we'd produce a weird future).)
  5. Once the AI is modelling reality in detail, its reward function is going to focus on how the rewards are actually being piped to the AI, rather than the human evaluator's reaction, let alone of some underlying notion of goodness. If the human evaluators just press a button to reward the AI for doing a good thing, the AI will want to take control of that button and stick a brick on top of it.

On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:

  1. The AI can try and fool the critic just like it would fool humans. It doesn't even need a realistic world model for this, since using the critic to inform the training labels leaks information about the critic to the AI.
  2. It's therefore very important that the critic model generates all the strong and relevant criticisms of a particular AI output. Otherwise the AI could just route around the critic.
  3. On some kinds of task, you'll have an objective source of truth you can train your model on. The value of an objective source of truth is that we can use it to generate a list of all the criticisms the model should have made. This is important because we can update the weights of the critic model based on any criticisms it failed to make. On other kinds of task, which are the ones we're primarily interested in, it will be very hard or impossible to get the ground truth list of criticisms. So we won't be able to update the weights of the model that way when training. So in some sense, we're trying to generalize this idea of "a strong a relevant criticism" between these different tasks of differing levels of difficulty.
  4. This requirement of generating all criticisms seems very similar to the task of getting a generative model to cover all modes. I guess we've pretty much licked mode collapse by now, but "don't collapse everything down to a single mode" and "make sure you've got good coverage of every single mode in existence" are different problems, and I think the second one is much harder.

On using AI systems, in particular large language models, to advance alignment research: This is not going to work.

  1. LLMs are super impressive at generating text that is locally coherent for a much broader definition of "local" than was previously possible. They are also really impressive as a compressed version of humanity's knowledge. They're still known to be bad at math, at sticking to a coherent idea and at long chains of reasoning in general. These things all seem important for advancing AI alignment research. I don't see how the current models could have much to offer here. If the thing is advancing alignment research by writing out text that contains valuable new alignment insights, then it's already pretty much a human-level intelligence. We talk about AlphaTensor doing math research, but even AlphaTensor didn't have to type up the paper at the end!
  2. What could happen is that the model writes out a bunch of alignment-themed babble, and that inspires a human researcher into having an idea, but I don't think that provides much acceleration. People also get inspired while going on a walk or taking a shower.
  3. Maybe something that would work a bit better is to try training a reinforcement-learning agent that lives in a world where it has to solve the alignment problem in order to achieve its goals. Eg. in the simulated world, your learner is embodied in a big robot, and it there's a door in the environment it can't fit through, but it can program a little robot to go through the door and perform some tasks for it. And there's enough hidden information and complexity behind the door that the little robot needs to have some built-in reasoning capability. There's a lot of challenges here, though. Like how do you come up with a programming environment that's simple enough that the AI can figure out how to use it, while still being complex enough that the little robot can do some non-trivial reasoning, and that the AI has a chance of discovering a new alignment technique? Could be it's not possible at all until the AI is quite close to human-level.

My (very amateur and probably very dumb) response to this challenge: 

tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid deception (typo)detection in very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misaligned mesa-optimizers, and we continue using them to do oversight on yet more capable models with the same problems, and then there’s a treacherous turn and we die.

These are first pass thoughts on why I expect the OpenAI Alignment Team’s plan to fail. I was surprised at how hard this was to write, it took like 3 hours including reading. It is probably quite bad and not worth most readers’ time. 

Summary of their plan

The plan starts with training AIs using human feedback (training LLMs using RLHF) to produce outputs that are in line with human intent, truthful, fair, and don’t produce dangerous outputs. Then, they’ll use their AI models to help with human evaluation, solving the scalable oversight problem by using techniques like Recursive Reward Modeling, Debate, and Iterative Amplification. The main idea here is using large language models to assist humans who are providing oversight to other AI systems, and the assistance allows humans to do better oversight. The third pillar of the approach is training AI systems to do alignment research, which is not feasible yet but the authors are hopeful that they will be able to do it in the future. Key parts of the third pillar are that it is easier to evaluate alignment research than to produce it, that to do human-level alignment research you need only be human-level in some domains, and that language models are convenient due to being “preloaded” with information and not being independent agents. Limitations include that the use of AI assistants might amplify subtle inconsistencies, biases, or vulnerabilities, and that the least capable models that could be used for useful alignment research may themselves be too dangerous if not properly aligned.


A key claim is that we can use RLHF to train models which are sufficiently aligned such that they themselves can be useful to assist human overseers providing training signal in the training of yet more powerful models, and we can scale up this process. The authors mention in their limitations how subtle issues with the AI assistants may scale up in this process. Similarly, small ways in which AI assistants are misaligned with their human operators are unlikely to go away. The first LLMs you are using are quite misaligned in the sense that they are not trying to do what the operator wants them to do; in fact, they aren’t really trying to do much; they have been trained in a way that their weights lead to low loss on the training distribution, as in you might say they “try” to predict likely next words in text based on internet text, though they are not internally doing search. When you slap RLHF on top of this, you are applying a training procedure which modifies the weights such that the model is “trying” to produce outputs which look good to a human overseer; the system is aiming at a different goal than it was before. The goal of producing outputs which look good to humans is still not actually what we want, however, as this would lead to giving humans false information which they believe to be true, or otherwise outputs which look good but are misleading or incorrect. Furthermore, the strategy of RLHF is not going to create models which are robustly learning the goals we want; for instance you can see how the Jailbreaking of ChatGPT uses out of training-distribution prompts to elicit outputs we had thought we trained out. Using RLHF doesn’t robustly teach the goals we want it to; we don’t currently have methods of robustly teaching the goals we want to. There’s some claim here about the limit, where if you provided an absolutely obscene amount of training examples, you could get a model which robustly has the right objectives; it’s unclear to me if this would work, but it looks something like starting with very simple models and applying tons of training to try to align their objectives, and then scaling up; at the current rate we seem to be scaling up capabilities far too quickly in relation to the amount of alignment-focused training. The authors agree with the general claim “We don’t expect RL from human feedback to be sufficient to align AGI”

The second part of the OpenAI Alignment Team’s plan is to use their LLMs to assist with this oversight problem by allowing humans to do a better job evaluating the output of models. The key assumption here is that, even though our LLMs won’t be perfectly aligned, they will be good enough that they can help with research. We should expect their safety and alignment properties to fall apart when these systems become very intelligent, as they will have complex deception available to them.

What this actually looks like is that OpenAI continues what they’re doing for months-to-years, and they are able to produce more intelligent models and the alignment properties of these models seem to be getting better and better, as measured by the fact that adversarial inputs which trip up the model are harder to find, even with AI assistance. Eventually we have language models which are doing internal optimization to get low loss, invoking algorithms which do quite well at next token prediction, in accordance with the abstract rules learned by RLHF. From the outside, it looks like our models are really capable and quite aligned. What has gone on under the hood is that our models are mesa-optimizers which are very likely to be misaligned. We don’t know this and we continue to deploy these models in the way we have been, as overseers for the training of more powerful models. The same problem keeps arising, where our powerful models are doing internal search in accordance with some goal which is not “all the complicated human values” and is probably highly correlated with “produce outputs which are a combination of good next-token-prediction and score well according to the humans overseeing this training”. Importantly, this mesa-objective is not something which, if strongly optimized, is good for humans; values come apart in the extremes; most configurations of atoms which satisfy fairly simple objectives are quite bad by my lights.

Eventually, at sufficiently high levels of capabilities, we see some treacherous turn from our misaligned mesa-optimizers which are able to cooperate which each other; GG humans. Maybe we don’t get to this point because, first, there are some major failures or warning shots which get decision makers in key labs and governments to realize this plan isn’t working; idk I wouldn’t bet on warning shots being taken seriously and well.

The third pillar is a hope that we can use our AIs to do useful alignment research before they (reach a capabilities point where they) develop deceptively aligned mesa-objectives. I feel least confident about this third pillar, but my rough guess is that the Alignment-researching-AIs will not be very effective at solving the hard parts of alignment around deception, but they might help us e.g., develop new techniques for oversight. I think this because deception research seems quite hard, and being able to do it probably requires being able to reason about other minds in a pretty complex way, such that if you can do this then you can also reason about your own training process and become deceptively-aligned. I will happily be proved wrong by the universe, and this is probably the thing I am least confident about.

Part of this issue is its lack of approachability for lay persons, a category which I believe includes the vast majority of elected officials and important decision-makers at regulatory agencies.

I think some kind of "reading list" with both some balanced information on perspectives, but certainly some simple primers that provide a basis for the average, interested member of the public to hold an informed opinion is extremely important.  

The Op-Ed has great value, but this is an exceptionally difficult topic area for most people.  What seems clear, however, is that every individual member of humanity should, when a small group of people is pushing towards outcomes that have even a small chance of creating existential risk for all of humanity, be able to expect that small group to observe a duty of care and to assume a responsibility to both educate and engage before proceeding to a point of no return.  

It seems clear that there is at least consensus on the existence (even if assigned a small probability) of such existential risks (of a non- or mal-aligned super-intelligence that arises in a fast-takeoff scenario).  That alone should clearly be enough to embrace a set of duties owed to mankind as a whole.

In terms of motivating org's, maybe this would work better as an open letter. This format provides social pressure by focusing on how many researchers have signed it, and positive reinforcement by calling out good behavior, and minor negative reinforcement my showing organizations that we hope will join yet.

That's how they do it in other fields, although I'm not sure if it actually works in other fields, or if it's just effective signaling. Still it would be worth a try.

To make it easier we should also kudos to org Y if X of their researchers have given their own plans. That's because having researcher give their own plan is a lot easier than getting official sanction, but it's also a useful stepping stone.