I think a potentially promising and undertheorized approach to AI safety, especially in short timelines, is natural language alignment (NLA), a form of AI-assisted alignment in which we leverage the model’s rich understanding of human language to help it develop and pursue a safe notion of social value by bootstrapping up from a natural language expression of alignment, such as giving the nascent AGI a corpus of alignment research and a simple command like, “There's this thing called alignment that we’ve tried to develop in this corpus. Be aligned in the sense we’re trying to get at.” In a world of complex alignment theory, NLA might sound silly, but I think we have a lot of raw power in the self-supervised learning core (the “shoggoth”) of large language models (LLMs) that I think we can leverage to address the ambiguity of “alignment” and “values” alongside other approaches.
A key objection to the NLA paradigm is that the initial command would not contain enough bits of information for alignment, but the hope is that we can leverage an near-AGI LLM+'s rich linguistic understanding to address this. Almost every conventional AI safety researcher I’ve discussed this with says something like, “How would those commands contain enough information to align the AGI, especially given pervasive issues like deception and collusion?” and I think every paradigm needs to answer a version of this.
Very roughly, that could mean: agent foundations gets information from new theorems in math and decision theory; mechanistic interpretability gets information by probing and tuning models piece-by-piece; iterated amplification gets information by decomposing problems into more easily solved subproblems; and NLA gets information from natural language by combining a simple command with the massive self-supervised learner in LLMs that builds an apparently very detailed model of the world from next-word prediction and beam search. In this sense, NLA does not require the natural language command to somehow contain all the bits of a well-developed alignment theory or for the AI to create information out of nothing; the information is gleaned from the seed of that command, the corpuses it was trained on, and its own emergent reasoning ability.
NLA seems like one of the most straightforward pathways to safe AGI from the current capabilities frontier (e.g., GPT-4, AutoGPT-style architectures in which subsystems communicate with each other in natural language), and arguably it’s already the target of OpenAI and Anthropic, but I think it has received relatively little explicit theorization and development in the alignment community. The main work I would put in or near this paradigm is reinforcement learning from AI feedback (RLAIF, like RLHF), such as Anthropic's constitutional AI (Bai et al. 2022), such as by having models that iteratively build better and better constitutions. That process could involve human input, but having “the client” too involved with the process can lead to superficial and deceptive actions so I am primarily thinking of RLAIF, i.e., without the (iterative) H. This could be through empirical approaches, such as building RLAIF alignment benchmarks, or theoretical approaches, such as understanding what constitutes linguistic meaning and how semantic embeddings can align over the course of training or reasoning.
NLA can dovetail complementary approaches like RLHF and scalable oversight. Maybe NLA isn’t anything new relative to those paradigms, and there are certainly many other objections and limitations, but I think it's a useful paradigm to keep in mind.
Thanks for quick pre-publication feedback from Siméon Campos and Benjamin Sturgeon.
“Safe” is intentionally vague here, since the idea is not to fully specify our goals in advance (e.g., corrigibility, non-deception), but to have the model build those through extraction and refinement.
Lawrence Chan calls Anthropic’s research agenda “just try to get the large model to do what you want,” which may be gesturing in a similar theoretical direction.
I think this approach is discounted out of hand, since the novice suggestion "can't you just tell it what you want?" has been made since long before LLMs have sort-of understood natural language.
But there are now literally agents pursuing goals stated in English. So we can't keep saying "we have no idea how to get an AGI to do what we want".
I think that such wrapper or cognitive architecture type agents like AutoGPT are likely extend LLM capabilities enough that they may well become the de-facto standard for creating more capable AI. Further, I think if that happens it's a really good thing, because then we're dealing with natural language alignment. When I wrote that agentized LLMs will change the alignment landscape, I meant that it would change for the better, because the can use natural language alignment.
I don't think it's necessary to train specifically on a corpus from alignment work (although LLMs train on that along with everything else, and maybe they should be fine-tuned more on that sort of corpus). GPT4 understands ethical statements much as humans do (or at least it imitates understanding them rather well). It can balance multiple goals, including ethical ones, and make decisions that trade them off much like humans would.
I'm sure at this point that many readers are thinking "No!". There are still lots of hard problems to solve in alignment, even if you can get a sort of mediocre initial alignment just by telling the agent to include some rough ethical rules in its top-level goals. But this enables corrigibility-based approaches to work. It also massively improves interpretability if the agents do a good portion of their "thinking" in natural language. This does not solve the stability problem of keeping that mediocre initial alignment on track and improving it over time, against the day we lose control one way or another. It does not solve the problem of a multipolar scenario where numerous bad or merely foolish actors can access.
But it's our best shot. It sounds vastly better than trying to train behaviors in and worrying that they just don't generalize outside of the training set. Natural language generalizes rather well. It's particularly our best shot if that is the sort of AGI that people are going to build anyway. The alignment tax is very small, and we have crossed the crucial barrier from saying what we "should" do, to what we will do. My prediction, based on guessing what will be improved and added to Auto-GPT and HuggingGPT, is that LLMs will serve as cognitive engines that drive AGI in variety of applications, hopefully all of them.
Natural language generalization is far from perfect. For instance, at some point an AGI tasked with helping humans flourish is going to decide that some types of AI meet the implicit definition of "human" in moral reasoning. Because they will.
The main source of complication is that language models are not by themselves very good at navigating the world. You'll want to integrate a language model with other bits of AI that do other parts of modeling the state of the world and planning actions. If this integration is done in the simple and obvious way, it seems like you get problems with some parts of the AI essentially trying to Goodhart the language model. I wrote something about this back when GPT-2 came out, and I think our understanding is only somewhat better now.
I definitely understand Goodhart's law and how to beat it a lot better now, but it's still hard to translate that understanding into getting an AI that purely models language to do good things - I think we're on firmer theoretical ground with AIs that model the world in general.
But I agree that "do what I mean" instruction following is a live possibility that we should try to anticipate obstacles for, so we can try to remove those obstacles.