(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)
Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.
Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.
We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.
Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.
Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.
It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.
We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.
We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".
Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.
Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.
We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!
Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.
In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.
Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
Thanks for writing this! I'd be very excited to see more critiques of our approach and it's been great reading the comments so far! Thanks to everyone who took the time to write down their thoughts! :)
I've also written up a more detailed post on why I'm optimistic about our approach. I don't expect this to be persuasive to most people here, but it should give a little bit more context and additional surface area to critique what we're doing.
My own responses to OpenAI's plan:
These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI's plan, but I think they cover the most egregious issues.
I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
I don't think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI's approach because it views these as solutions in themselves, instead of as simplified problems.
I'm happy to see OpenAI and OpenAI Alignment Team get recognition/credit for having a plan and making it public. Well deserved I'd say. (ETA: To be clear, like the OP I don't currently expect the plan to work as stated; I expect us to need to pivot eventually & hope a better plan comes along before then!)
What's MIRI's current plan? I can't actually remember, though I do know you've pivoted away from your strategy for Agent Foundations. But that wasn't the only agenda you were working on, right?
The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:
Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.
MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"
My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.
None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.
Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.
Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)
(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)
FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have generally demonstrated pretty bad judgment by, um, being capabilities researchers.
MIRI isn't developing an AGI.
But MIRI wants to build an FAI. What their plan is, if they think they can build one, seems relevant. Or what they would do if they think they, or someone else, is going to build an AGI.
They published the dialogues and have written far more on the subject of how one might do so if one was inclined than any of the major institutions actually-building-AGI. I'm merely stating the fact that, as a very small group not actively attempting to build a FAI, it makes sense that they don't have a plan in the same sense.
Of course, Eliezer also wrote this.
I know Eliezer and Nate have written a bunch of stuff on this topic. But they're not the whole of MIRI. Are e.g. Scott, or Abram, or Evan on board with this? In fact, my initial comment was going to be "I know Eliezer and Nate have written about parts of their plans before, but what about MIRI's plan? Has everyone in the org reached a consensus about what to do?" For some reason I didn't ask that. Not sure why.
EDIT: Ah, I forgot that Nate was MIRI's executive. Presumably, his publically comments on building an AGI are what MIRI would endorse.
My ~2-hour reaction to the challenge:
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible. For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI's leadership.
(III) My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI's alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI's capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don't have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)
(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.
Assumptions implied by OpenAI's plans, with my reactions:
"Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]" My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don't need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don't expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)
This is already acknowledged in the post: "It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems." However, it isn't exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that "this time, it will be powerful enough to help with alignment"?
The whole post suggest the workflow "new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1". (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don't see evidence for "ability to hold off on capabilities research". What are the organizational procedures allowing this?
It is good to at least acknowledge that there might be other parts of AI alignment than just "figuring out learning from human feedback (& human-feedback augmentation)". However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.
The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won't also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
The plan doesn't say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no "explicit assumptions & detection system & course-correction-procedure" for "nothing will go wrong if we just do (b)".
General complaint: The plan is not a plan at all! It's just a meta-plan.
As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can't see the step that would fail if X was untrue. This doesn't say anything good about your proof.
As far as I know, I came up with points (I), (III), and (XII) myself and I don't remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI's publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
(II) For example, consider the following claim: "We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants." My reaction: Yes, technically speaking this is true. But likewise --- please excuse the jarring analogy --- the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn't it be even better if you personally didn't plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
Apologies for the inconsistent numbering. I had to give footnote  number (II) to get to the nice round total of 13 points :-).
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.
Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research". And then the corresponding more realistic version of the claims would be that:
This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.
Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.
I don't know how to link to the specific comment, but here somewhere. Also:
Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like "use LLMs to help with alignment research or get left behind when ML research gets automated". If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn't expect it to be particularly good. The option "everybody agrees to not build AI assistants and we can do alignment research first" is maybe not on the table, or at least it probably doesn't feel like it is to the alignment team at OpenAI.
Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)
I strongly endorse this, based on previous personal experience with this sort of thing. Crowdsourcing routinely fails at many things, but this isn't one of them (it does not routinely fail).
It's a huge relief to see that there are finally some winning strategies, lately there's been a huge scarcity of those.
The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.
The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.
The third prong of OAI's strategy seems doomed to me, but I can't really say why in a way I think would convince anybody that doesn't already agree. It's totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn't going to work for obvious reasons.
I think the whole thing fails way before this, but I'm happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.
This is an intuition only based on speaking with researchers working on LLMs, but I think that OAI thinks that a model can simultaneously be good enough at next token prediction to assist with research but also be very very far away from being a powerful enough optimizer to realise that it is being optimized for a goal or that deception is an optimal strategy, since the latter two capabilities require much more optimization power. And that the default state of cutting edge LLMs for the next few years is to have GPT-3 levels of deception (essentially none) and graduate student levels of research assistant ability.
This inspired a full length post.
Epistemic status: 50% sophistry, but I still think it's insightful since specifically aligning LLMs needs to be discussed here more.
I find it quite interesting that much of current large language model (LLM) alignment is just stating, in plain text, "be a helpful, aligned AI, pretty please". And it somehow works (sometimes)! The human concept of an "aligned AI" is evidently both present and easy to locate within LLMs, which seems to overcome a lot of early AI concerns like whether or not human morality and human goals are natural abstractions (it seems they are, at least to kinda-human-simulators like LLMs).
Optimism aside, OOD and deceptions are still major issues for scaling LLMs to superhuman levels. But these are still commonly-discussed human concepts, and presumably can be located within LLMs. I feel like this means something important, but can't quite put my finger on it. Maybe there's some kind of meta-alignment concept that can also be located in LLMs which take these into account? Certainly humans think and write about it a lot, and fuzzy, confused concepts like "love" can still be understood and manipulated by LLMs despite them lacking a commonly-agreed-upon logical definition.
I saw the topic of LLM alignment being brought up on Alignment Forums, and it really made me think. Many people seem to think that scaling up LLMs to superhuman levels will cause result in human extinction with P=1.00, but it's not immediately obvious why this would be the case (assuming you ask it nicely to behave).
A major problem I can imagine is the world-model of LLMs above a certain capability collapsing to something utterly alien but slightly more effective at token prediction, in which case things can get really weird. There's also the fact that a superhuman LLM is very very OOD in a way that we can't account for in advance.
Or the current "alignment" of LLMs is just deceptive behavior. But deceptive to whom? It seems like chatGPT thinks it's in the middle of a fictional story about AIs or a role-playing session, with a bias towards milqtoast responses, but that's... what it always does? An LLM LARPing as a supersmart human LARPing as a boring AI doesn't seem very dangerous. I do notice that I don't have a solid conceptual framework for what the concept of "deception" even means in an LLM, I would appreciate any corrections/clarifications.
I'm assuming that it's just the LLM locating several related concepts of "deception" within itself, thinking (pardon the extreme anthropomorphism) "ah yes, this may a situation where this person is going to be [lied to/manipulated/peer-pressured]. Given how common it was in my training set, I'll place probability X Y and Z on each of those possibilities", and then weigh them against hypotheses like "this is poorly written smut. The next scene will involve..." or "This is a QA session set in a fictional universe. The fictional AI in this story has probability A of answering these questions truthfully". And then fine-tuning moves the weights of these hypotheses around. Since the [deception/social manipulation/say what a human might want to hear in this context] conceptual cluster generally gets the best feedback, the model will get increasingly deceptive during the course of its fine-tuning.
Maybe just setting up prompts and training data that really trigger the "fictional aligned AI" hypothesis, and avoiding fine-tuning can help? I feel like I'm missing a few key conceptual insights.
Key points: LLMs are [weasel words] human-simulators. The fact that asking them to act like a friendly AI in plain English can increase friendly-AI-like outputs in a remarkably consistent way implies that human-natural concepts like "friendly-AI" or "human morality" also exist within them. This makes sense - people write about AI alignment a lot, both in fiction and in non-fiction. This is an expected part of the training process - since people write about these things, understanding them reduces loss. Unfortunately, deception and writing what sounds good instead of what is true are also common in its training set, so "good sounding lie that makes a human nod in agreement" is also an abstraction we should expect.
Do you need any help distilling? I'm fine with working for free on this one, looks like a good idea.
Large language models like GPT-3 are trained on vast quantities of human-generated data. This means that a model of human psychology is implicit within the model. During fine-tuning, much of their performance gains come from how fast they are able to understand the intentions of the humans labeling their outputs.
This optimizes for models that have the best human simulations, which leads to more deception as the size of the model increases.
In practice, we will see a rapid improvement in performance, with the model finally being able to understand (or just access its existing understanding of) the intent behind human labeling/requests. This may even be seen as a win for alignment - it does what we want, not what we said! The models would be able to ask for clarification in ambiguous situations, and ask if certain requests are misspelled or badly phrased.
All the while they get better at deceiving humans and not getting caught.
I don't like that the win condition and lose condition look so similar.
Edit: I should clarify, most of these concerns apply to pretty much all AI models. My specific issue with aligning large language models is that:
In my opinion, this methodology will be a great way for a model to learn how to persuade humans and exploit their biases because this way model might learn these biases not just from the data it collected but also fine-tune its understanding by testing its own hypotheses
See my comment on Jan’s new post.
On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:
On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:
On using AI systems, in particular large language models, to advance alignment research: This is not going to work.
I have a response here.
Here is a critique of OpenAI's plan
My (very amateur and probably very dumb) response to this challenge:
tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid deception (typo)
detectionin very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misaligned mesa-optimizers, and we continue using them to do oversight on yet more capable models with the same problems, and then there’s a treacherous turn and we die.
These are first pass thoughts on why I expect the OpenAI Alignment Team’s plan to fail. I was surprised at how hard this was to write, it took like 3 hours including reading. It is probably quite bad and not worth most readers’ time.
Summary of their plan
The plan starts with training AIs using human feedback (training LLMs using RLHF) to produce outputs that are in line with human intent, truthful, fair, and don’t produce dangerous outputs. Then, they’ll use their AI models to help with human evaluation, solving the scalable oversight problem by using techniques like Recursive Reward Modeling, Debate, and Iterative Amplification. The main idea here is using large language models to assist humans who are providing oversight to other AI systems, and the assistance allows humans to do better oversight. The third pillar of the approach is training AI systems to do alignment research, which is not feasible yet but the authors are hopeful that they will be able to do it in the future. Key parts of the third pillar are that it is easier to evaluate alignment research than to produce it, that to do human-level alignment research you need only be human-level in some domains, and that language models are convenient due to being “preloaded” with information and not being independent agents. Limitations include that the use of AI assistants might amplify subtle inconsistencies, biases, or vulnerabilities, and that the least capable models that could be used for useful alignment research may themselves be too dangerous if not properly aligned.
A key claim is that we can use RLHF to train models which are sufficiently aligned such that they themselves can be useful to assist human overseers providing training signal in the training of yet more powerful models, and we can scale up this process. The authors mention in their limitations how subtle issues with the AI assistants may scale up in this process. Similarly, small ways in which AI assistants are misaligned with their human operators are unlikely to go away. The first LLMs you are using are quite misaligned in the sense that they are not trying to do what the operator wants them to do; in fact, they aren’t really trying to do much; they have been trained in a way that their weights lead to low loss on the training distribution, as in you might say they “try” to predict likely next words in text based on internet text, though they are not internally doing search. When you slap RLHF on top of this, you are applying a training procedure which modifies the weights such that the model is “trying” to produce outputs which look good to a human overseer; the system is aiming at a different goal than it was before. The goal of producing outputs which look good to humans is still not actually what we want, however, as this would lead to giving humans false information which they believe to be true, or otherwise outputs which look good but are misleading or incorrect. Furthermore, the strategy of RLHF is not going to create models which are robustly learning the goals we want; for instance you can see how the Jailbreaking of ChatGPT uses out of training-distribution prompts to elicit outputs we had thought we trained out. Using RLHF doesn’t robustly teach the goals we want it to; we don’t currently have methods of robustly teaching the goals we want to. There’s some claim here about the limit, where if you provided an absolutely obscene amount of training examples, you could get a model which robustly has the right objectives; it’s unclear to me if this would work, but it looks something like starting with very simple models and applying tons of training to try to align their objectives, and then scaling up; at the current rate we seem to be scaling up capabilities far too quickly in relation to the amount of alignment-focused training. The authors agree with the general claim “We don’t expect RL from human feedback to be sufficient to align AGI”
The second part of the OpenAI Alignment Team’s plan is to use their LLMs to assist with this oversight problem by allowing humans to do a better job evaluating the output of models. The key assumption here is that, even though our LLMs won’t be perfectly aligned, they will be good enough that they can help with research. We should expect their safety and alignment properties to fall apart when these systems become very intelligent, as they will have complex deception available to them.
What this actually looks like is that OpenAI continues what they’re doing for months-to-years, and they are able to produce more intelligent models and the alignment properties of these models seem to be getting better and better, as measured by the fact that adversarial inputs which trip up the model are harder to find, even with AI assistance. Eventually we have language models which are doing internal optimization to get low loss, invoking algorithms which do quite well at next token prediction, in accordance with the abstract rules learned by RLHF. From the outside, it looks like our models are really capable and quite aligned. What has gone on under the hood is that our models are mesa-optimizers which are very likely to be misaligned. We don’t know this and we continue to deploy these models in the way we have been, as overseers for the training of more powerful models. The same problem keeps arising, where our powerful models are doing internal search in accordance with some goal which is not “all the complicated human values” and is probably highly correlated with “produce outputs which are a combination of good next-token-prediction and score well according to the humans overseeing this training”. Importantly, this mesa-objective is not something which, if strongly optimized, is good for humans; values come apart in the extremes; most configurations of atoms which satisfy fairly simple objectives are quite bad by my lights.
Eventually, at sufficiently high levels of capabilities, we see some treacherous turn from our misaligned mesa-optimizers which are able to cooperate which each other; GG humans. Maybe we don’t get to this point because, first, there are some major failures or warning shots which get decision makers in key labs and governments to realize this plan isn’t working; idk I wouldn’t bet on warning shots being taken seriously and well.
The third pillar is a hope that we can use our AIs to do useful alignment research before they (reach a capabilities point where they) develop deceptively aligned mesa-objectives. I feel least confident about this third pillar, but my rough guess is that the Alignment-researching-AIs will not be very effective at solving the hard parts of alignment around deception, but they might help us e.g., develop new techniques for oversight. I think this because deception research seems quite hard, and being able to do it probably requires being able to reason about other minds in a pretty complex way, such that if you can do this then you can also reason about your own training process and become deceptively-aligned. I will happily be proved wrong by the universe, and this is probably the thing I am least confident about.
Part of this issue is its lack of approachability for lay persons, a category which I believe includes the vast majority of elected officials and important decision-makers at regulatory agencies.
I think some kind of "reading list" with both some balanced information on perspectives, but certainly some simple primers that provide a basis for the average, interested member of the public to hold an informed opinion is extremely important.
The Op-Ed has great value, but this is an exceptionally difficult topic area for most people. What seems clear, however, is that every individual member of humanity should, when a small group of people is pushing towards outcomes that have even a small chance of creating existential risk for all of humanity, be able to expect that small group to observe a duty of care and to assume a responsibility to both educate and engage before proceeding to a point of no return.
It seems clear that there is at least consensus on the existence (even if assigned a small probability) of such existential risks (of a non- or mal-aligned super-intelligence that arises in a fast-takeoff scenario). That alone should clearly be enough to embrace a set of duties owed to mankind as a whole.
In terms of motivating org's, maybe this would work better as an open letter. This format provides social pressure by focusing on how many researchers have signed it, and positive reinforcement by calling out good behavior, and minor negative reinforcement my showing organizations that we hope will join yet.
That's how they do it in other fields, although I'm not sure if it actually works in other fields, or if it's just effective signaling. Still it would be worth a try.
To make it easier we should also kudos to org Y if X of their researchers have given their own plans. That's because having researcher give their own plan is a lot easier than getting official sanction, but it's also a useful stepping stone.