Natural language alignment

I think this approach is discounted out of hand, since the novice suggestion "can't you just tell it what you want?" has been made since long before LLMs have sort-of understood natural language.

But there are now literally agents pursuing goals stated in English. So we can't keep saying "we have no idea how to get an AGI to do what we want".

I think that such wrapper or cognitive architecture type agents like AutoGPT are likely extend LLM capabilities enough that they may well become the de-facto standard for creating more capable AI. Further, I think if that happens it's a really good thing, because then we're dealing with natural language alignment. When I wrote that agentized LLMs will change the alignment landscape, I meant that it would change for the better, because the can use natural language alignment.

I don't think it's necessary to train specifically on a corpus from alignment work (although LLMs train on that along with everything else, and maybe they should be fine-tuned more on that sort of corpus). GPT4 understands ethical statements much as humans do (or at least it imitates understanding them rather well). It can balance multiple goals, including ethical ones, and make decisions that trade them off much like humans would.

I'm sure at this point that many readers are thinking "No!". There are still lots of hard problems to solve in alignment, even if you can get a sort of mediocre initial alignment just by telling the agent to include some rough ethical rules in its top-level goals. But this enables corrigibility-based approaches to work. It also massively improves interpretability if the agents do a good portion of their "thinking" in natural language. This does not solve the stability problem of keeping that mediocre initial alignment on track and improving it over time, against the day we lose control one way or another. It does not solve the problem of a multipolar scenario where numerous bad or merely foolish actors can access.

But it's our best shot. It sounds vastly better than trying to train behaviors in and worrying that they just don't generalize outside of the training set. Natural language generalizes rather well.^[1] It's particularly our best shot if that is the sort of AGI that people are going to build anyway. The alignment tax is very small, and we have crossed the crucial barrier from saying what we "should" do, to what we will do. My prediction, based on guessing what will be improved and added to Auto-GPT and HuggingGPT, is that LLMs will serve as cognitive engines that drive AGI in variety of applications, hopefully all of them.

^{^}
Natural language generalization is far from perfect. For instance, at some point an AGI tasked with helping humans flourish is going to decide that some types of AI meet the implicit definition of "human" in moral reasoning. Because they will.

^{^}

“Safe” is intentionally vague here, since the idea is not to fully specify our goals in advance (e.g., corrigibility, non-deception), but to have the model build those through extraction and refinement.

^{^}

Lawrence Chan calls Anthropic’s research agenda “just try to get the large model to do what you want,” which may be gesturing in a similar theoretical direction.

[-]Seth Herd3y*1610

[-]Charlie Steiner3y2-1

The main source of complication is that language models are not by themselves very good at navigating the world. You'll want to integrate a language model with other bits of AI that do other parts of modeling the state of the world and planning actions. If this integration is done in the simple and obvious way, it seems like you get problems with some parts of the AI essentially trying to Goodhart the language model. I wrote something about this back when GPT-2 came out, and I think our understanding is only somewhat better now.

I definitely understand Goodhart's law and how to beat it a lot better now, but it's still hard to translate that understanding into getting an AI that purely models language to do good things - I think we're on firmer theoretical ground with AIs that model the world in general.

But I agree that "do what I mean" instruction following is a live possibility that we should try to anticipate obstacles for, so we can try to remove those obstacles.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

31

Natural language alignment

31

Ω 7

31

Ω 7