(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)
Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.
Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.
We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.
Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.[1]
Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.
It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.
We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.
We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".
Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.
Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.[2][3]
- ^
We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!
Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.
In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.
- ^
Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”
- ^
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
Epistemic status: 50% sophistry, but I still think it's insightful since specifically aligning LLMs needs to be discussed here more.
I find it quite interesting that much of current large language model (LLM) alignment is just stating, in plain text, "be a helpful, aligned AI, pretty please". And it somehow works (sometimes)! The human concept of an "aligned AI" is evidently both present and easy to locate within LLMs, which seems to overcome a lot of early AI concerns like whether or not human morality and human goals are natural abstractions (it seems they are, at least to kinda-human-simulators like LLMs).
Optimism aside, OOD and deceptions are still major issues for scaling LLMs to superhuman levels. But these are still commonly-discussed human concepts, and presumably can be located within LLMs. I feel like this means something important, but can't quite put my finger on it. Maybe there's some kind of meta-alignment concept that can also be located in LLMs which take these into account? Certainly humans think and write about it a lot, and fuzzy, confused concepts like "love" can still be understood and manipulated by LLMs despite them lacking a commonly-agreed-upon logical definition.
I saw the topic of LLM alignment being brought up on Alignment Forums, and it really made me think. Many people seem to think that scaling up LLMs to superhuman levels will cause result in human extinction with P=1.00, but it's not immediately obvious why this would be the case (assuming you ask it nicely to behave).
A major problem I can imagine is the world-model of LLMs above a certain capability collapsing to something utterly alien but slightly more effective at token prediction, in which case things can get really weird. There's also the fact that a superhuman LLM is very very OOD in a way that we can't account for in advance.
Or the current "alignment" of LLMs is just deceptive behavior. But deceptive to whom? It seems like chatGPT thinks it's in the middle of a fictional story about AIs or a role-playing session, with a bias towards milqtoast responses, but that's... what it always does? An LLM LARPing as a supersmart human LARPing as a boring AI doesn't seem very dangerous. I do notice that I don't have a solid conceptual framework for what the concept of "deception" even means in an LLM, I would appreciate any corrections/clarifications.
I'm assuming that it's just the LLM locating several related concepts of "deception" within itself, thinking (pardon the extreme anthropomorphism) "ah yes, this may a situation where this person is going to be [lied to/manipulated/peer-pressured]. Given how common it was in my training set, I'll place probability X Y and Z on each of those possibilities", and then weigh them against hypotheses like "this is poorly written smut. The next scene will involve..." or "This is a QA session set in a fictional universe. The fictional AI in this story has probability A of answering these questions truthfully". And then fine-tuning moves the weights of these hypotheses around. Since the [deception/social manipulation/say what a human might want to hear in this context] conceptual cluster generally gets the best feedback, the model will get increasingly deceptive during the course of its fine-tuning.
Maybe just setting up prompts and training data that really trigger the "fictional aligned AI" hypothesis, and avoiding fine-tuning can help? I feel like I'm missing a few key conceptual insights.
Key points: LLMs are [weasel words] human-simulators. The fact that asking them to act like a friendly AI in plain English can increase friendly-AI-like outputs in a remarkably consistent way implies that human-natural concepts like "friendly-AI" or "human morality" also exist within them. This makes sense - people write about AI alignment a lot, both in fiction and in non-fiction. This is an expected part of the training process - since people write about these things, understanding them reduces loss. Unfortunately, deception and writing what sounds good instead of what is true are also common in its training set, so "good sounding lie that makes a human nod in agreement" is also an abstraction we should expect.