Alignment vs. Safety, part 2: Alignment

David Scott Krueger

There are a few ways in which the term alignment is used by people working on AI safety. This leads to important confusions, which are the main point of this post. But there’s some background first, so some readers may want to skip to the “alignment vs. safety” section.

As I mentioned in the previous post, the term “alignment” was invented to pick out the hard technical problem of AI existential safety -- how do you make an AI system that is so aligned with your preferences/interests/values/intentions/goals/… that you can safely delegate to it and trust it not to act against you?

At the time it was introduced, most AI researchers weren’t thinking about this problem. A lot of them were skeptical that it was a real problem, or thought it was silly to talk about AI systems having their own intentions or goals.

This changed with GPT-3, the precursor to ChatGPT. This AI and other “large language models (LLMs)” demonstrated that alignment, -- getting the AI to want to do what you want -- was clearly a problem and a separate problem from making the AI more capable of doing what you want.

GPT-3 was very unpredictable, because it was just trained to predict the next word (or “token”) of text scraped from the internet. It didn’t follow instructions. But if you were clever in how you primed it, you could get it to do basically all the same things that ChatGPT could.

For instance, you could get it to continue a list of translated fruit, if you input:

Strawberry -> Fraise
Orange -> Orange
Apple ->

You could expect GPT-3 to output “Pomme”.

Some people enjoyed finding clever ways to prime or “prompt” GPT-3 to get it to perform different tasks. But it was alignment techniques that made it into a product you could use without any cleverness. The AI could already do the tasks, but it had to be taught to act like it “wanted” to follow instructions instead of just predicting text.

With LLMs, alignment became a very practical problem, and researchers realized it. The technical problems that AI x-safety researchers such as myself had been obsessing about for years went from being dismissed as nonsense to being central to AI almost overnight.

AI researchers started to use “alignment” as a phrase that basically just meant “getting LLMs to do what we want”. But this is different than “getting LLMs to want to do what we want”. Alignment is only about what the AI wants, not what it’s capable of, and an AI can fail to do what you want simply because it doesn’t know how.

How different meanings of alignment cause confusion and make things seem safer than they are

Alignment was introduced to pick out this technical problem described above. But before it became mainstream, it was also often used to refer to the existential safety community in which it originated, or the motivating problem of how to keep AI from destroying humanity. And it was also used as a name for any technical work related to keeping AI from destroying humanity.

People in AI existential safety often conflate safety and alignment, or assume that “solving alignment” is all that is required to ensure that “AI goes well”. There are a few problems with this.

Is assurance part of alignment?

While many of the relevant technical problems can be viewed as alignment problems, there’s an important separate problem that often gets lumped in: can we tell if we’ve succeeded? Is the AI trustworthy? This, “the assurance problem”, is actually a really hard problem, potentially much harder, because the way AIs are made makes it hard to understand what they want. It’s not like we’re programming its goals; we’re using “machine learning” to “teach” the AI what’s good and bad using trial and error. It’s actually quite similar to training a dog by giving it treats when it does the tricks you want.

When researchers say “Our AI is very aligned” or “alignment is going well”, it’s not clear if they are including the assurance problem. This can, and does, lead to false assurance. We should not believe AI developers’ claims that their AIs are aligned without strong justification, which they are unwilling or (I believe) unable to provide. When AI researchers or companies say a model is aligned, what they really mean is that it seems that way to them, based on their judgment, not that they have any convincing proof that it is aligned. The assurance problem is clearly not solved.

How aligned do AIs need to be?

AIs are not and have never been perfectly aligned. They misbehave. This is again to do with how they are “trained”, and it’s not a problem that is going away any time soon. Talking about “solving alignment” doesn’t make sense in a context where our alignment methods are known to be unreliable in this way. The real question is “how aligned is aligned enough?” Nobody knows the answer to this.

Intent alignment or value alignment?

Alignment can mean (1) “the AI behaves as intended” (“intent alignment”) or (2) “the AI is acting in accordance with my values” (“value alignment”). These are different things. We don’t expect a tool like a translation app to solve all of our problems, just to translate things when we ask it to. But we might also want to build AI agents that autonomously, or even proactively, do things we want, or like, or think are good, or useful. Aligning an agent could be a lot harder. If you are handing over the keys to the kingdom to an AI, and it has values that are somewhat different than yours, it might do things you don’t like, and you might not be able to get the keys back. I and others have done a lot of research trying to figure out how close to perfect the value alignment of an AI would need to be to prevent this sort of thing, but it’s an open question.

This is important because a lot of researchers are only talking about intent alignment when they say things like “this AI is pretty aligned”. But today’s AIs, even the AI “agents”, still function more like a tool that follows instructions and then awaits the next command. But I expect this to change, because this requires too much “human-in-the-loop”. An AI that can guess what you would want next and do it is a lot more powerful, and those kinds of AIs are going to become more popular than the more passive tool-like AIs of today, even if they aren’t trustworthy, because we haven’t solved the assurance problem.

Superalignment

A commonly recognized concern is that all of our techniques for alignment and assurance may break down as AIs get smarter and smarter. Part of the reason is that the AIs may be able to trick us and “play nice”. Another reason is that the way AI companies plan to make “superintelligent” AI is by putting AI in charge of building smarter AI… that then builds even smarter AI, et cetera. This means the superintelligent AI could function completely differently than today’s AIs and require completely different alignment and assurance techniques.

Is alignment really sufficient?

Even if we “solve alignment” there’s still the question of which intentions or values we align AIs with. The answer might end up having more to do with competitive pressures than what we actually care about as humans. AI developers are recklessly racing to build smarter and smarter AI as fast as they can, and increasingly putting AI in charge of the process instead of trying to steer it themselves.

There are many ways that this could lead to disasters up to and including the end of humanity.

Our paper on gradual disempowerment argues that AIs might end up aligned with institutions like companies that don’t fundamentally care about people, but profit. There are other concerns as well, such as sudden coups by people or organizations that lack legitimacy and act on behalf of their own self-interest instead of the broader interests of humanity as a whole. In general, humans and human organizations tend to be somewhat selfish and short-sighted, and AIs might inherit those properties through alignment.

When researchers treat “solve alignment” as identical to “make sure AI doesn’t kill everyone or otherwise cause terrible future outcomes”, they assume away such problems, which I think are actually critically important.

Summary:

In summary, for historical reasons, the word “alignment” is used for a wide range of things. This can cause a bunch of problems, such as:

Conflating alignment and assurance.
Talking about AIs being “aligned” instead of how aligned they are, which is never perfect.
Confusing the problem of “getting AI tools to behave as intended” with the problem of “getting AI agents to understand your values well enough that you’d be comfortable handing them the keys to the kingdom”.
Suggesting that current alignment techniques will scale to superintelligence.
Assuming solving the technical alignment problem is the same thing as preventing catastrophically bad outcomes from AI.

There is a lot more that could be said, but these are the biggest problems I see in the way people use the word “alignment” these days. It’s important to notice that all of these point in the direction of making the situation seem better than it is.

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.