Disambiguating "alignment" and related notions

by capybaralet1 min read5th Jun 201821 comments



EDIT: renamed CAIS --> TAIS to avoid acronym class with Drexler's Comprehensive AI Services.

I recently had an ongoing and undetected inferential gap with someone over our use of the term "value aligned".

Holistic alignment vs. parochial alignment

I was talking about "holistic alignment":
Agent R is holistically aligned with agent H iff R and H have the same terminal values.
This is the "traditional AI safety (TAIS)" (as exemplified by Superintelligence) notion of alignment, and the TAIS view is roughly: "a superintelligent AI (ASI) that is not holistically aligned is an Xrisk"; this view is supported by the instrumental convergence thesis.

My friend was talking about "parochial alignment". I'm lacking a satisfyingly crisp definition of parochial alignment, but intuitively, it refers to how you'd want a "genie" to behave:
R is parochially aligned with agent H and task T iff R's terminal values are to accomplish T in accordance to H's preferences over the intended task domain.
We both agreed that a parochially aligned ASI is not safe by default (it might paperclip), but that it might be possible to make one safe using various capability control mechanisms (for instance, anything that effectively restrict it to operating within the task domain).

Sufficient alignment

We might further consider a notion of "sufficient alignment":
R is sufficiently aligned with H iff optimizing R's terminal values would not induce a nontrivial Xrisk (according to H's definition of Xrisk).
For example, an AI whose terminal values are "maintain meaningful human control over the future" is plausibly sufficiently aligned. A sufficiently aligned ASI is safe in the absence of capability control. It's worth considering what might constitute sufficient alignment, short of holistic alignment. For instance, Paul seems to argue that corrigible agents are sufficiently aligned. As another example, we don't know how bad of a perverse instantiation to expect from an ASI whose values are almost correct (e.g. within epsilon in L-infinity norm over possible futures).

Intentional alignment and "benignment"

Paul Christiano's version of alignment, I'm calling "intentional alignment":
R is intentionally aligned with H if R is trying to do what H wants it to do.
Although it feels intuitive, I'm not satisfied with the crispness of this definition, since we don't have a good way of determining a black box system's intentions. We can apply the intentional stance, but that doesn't provide a clear way of dealing with irrationality.

Paul also talks about benign AI which is about what an AI is optimized for (which is closely related to what it "values"). Inspired by this, I'll define a complementary notion to Paul's notion of alignment:
R is benigned with H if R is not actively trying to do something that H doesn't want it to do.


1) Be clear what notion of alignment you are using.

2) There might be sufficiently aligned ASIs that are not holistically aligned.

3) Try to come up with crisper definitions of parochial and intentional alignment.


22 comments, sorted by Highlighting new comments since Today at 2:39 AM
New Comment

Great classification!

Holistically aligned AI is not safe: the last thing we need is the AI which has sexual desire to humans. Instead, AI should correctly and safely understand our commands, that is close to "sufficient alignment" form the above definitions.

This is not what I meant by "the same values", but the comment points towards an interesting point.

When I say "the same values", I mean the same utility function, as a function over the state of the world (and the states of "R is having sex" and "H is having sex" are different).

The interesting point is that states need to be inferred from observations, and it seems like there are some fundamentally hard issues around doing that in a satisfying way.

Related to the distinction between 2-place and 1-place words. We want the AI to have the "curried" version of human values, not a symmetric version where the word "me" now refers to the AI itself.

Can you please explain the distinction more succinctly, and say how it is related?

In your definition this distinction about the state of the world is not obvious, as usually humans use the words "have the same values" not about the state of the world, but meaning having the same set of preferences but centered around a different person.

In that case, situation becomes creepy: for example, I want to have sex with human females. A holistically aligned with me AI will also want to have sex with human females - but I am not happy about it, and the females will also find it creepy. More over, if such AI will have 1000 times of my capabilities, it will completely dominate sexual market, getting almost all sex in the world and humans will go extinct.

An AI with avturchin-like values centered around the AI would completely dominate the sexual market, if and only if, avturchin would completely dominate the sexual market given the opportunity. More generally, having an AI with human-like values centered around itself is only as bad as having as having an AI holistically aligned (in the original, absolute, sense) with a human who is not you.

As an aside, why would I find it creepy if a godlike superintelligence wants to have sex with me? It's kinda hot actually :)

I could imagine the following conversation with holistically aligned AI:

-Mom, I decided to become homosexual.

-No, you will not do it, because heterosexuality was your terminal value at the moment of my creation.

[+][comment deleted]2y 5

the states of “R is having sex” and “H is having sex” are different)

In that case, "paradise for R" and "paradise for H" are different. You need to check out "centered worlds".

Although it feels intuitive, I'm not satisfied with the crispness of this definition, since we don't have a good way of determining a black box system's intentions

Are you also unsatisfied with "holistic alignment" or "parochial alignment" as crisp concepts, since we don't have a way of determining a black box system's terminal values?

I don't think I was very clear; let me try to explain.

I mean different things by "intentions" and "terminal values" (and I think you do too?)

By "terminal values" I'm thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it's terminal values are whatever that reward function expresses.

Whereas "trying to do what H wants it to do" I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but "wants to learn the right one", or really just any case where R could reasonably be described as "trying to do what H wants it to do".

Talking about a "black box system" was probably a red herring.

Do I terminally value reproduction? What if we run an RL algorithm with genetic fitness as its reward function?

If we define terminal values in the way you specify, but then "aligned" (whether parochially or holistically) does not seem like a strong enough condition to actually make us happy about an AI. Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from "has X as explicit terminal goal" to "is actually trying to achieve X."

I realized it's unclear to me what "trying" means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by "(actually) trying", and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?

It seems like you are referring to daemons.

To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.

To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn't mean we should ignore such problems; it's just out of scope.

Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from "has X as explicit terminal goal" to "is actually trying to achieve X."

That seems like the wrong way of phrasing it to me. I would put it like "MIRI wants to figure out how to build properly 'consequentialist' agents, a capability they view us as currently lacking".

For what it's worth I've also tried to specify what we mean by AI alignment formally. From my perspective these are interesting different kinds of alignment but all still suffer from insufficient precision. For example, what is a "terminal value" exactly and whence does it arise? I think nailing these things down will prove important as we make further progress towards engineering aligned AI and discover the need to ground out some of our ideas in the implementation.

How does a parochially aligned AI paperclip? My preferences over the task T definitely include "don't paperclip the universe". Is the idea that the AI incorrectly infers my preferences about the task T, and then tries to optimize that?

It doesn't *necessarily*. But it sounds like what you're thinking of here is some form of "sufficient alignment".

The point is that you could give an AI a reward function that leads it to be a good personal assistant program, so long as it remains restricted to doing the sort of things we expect a personal assistant program to do, and isn't doing things like manipulating the stock market when you ask it to invest some money for you (unless that's what you expect from a personal assistant). If it knows it could do things like that, but doesn't want to, then it's more like something sufficiently aligned. If it doesn't do such things because it doesn't realize they are possibilities (yet), or because it hasn't figured out a good way to use it's actuators to have that kind of effect (yet), because you've done a good job boxing it, then it's more like "parochially aligned".

Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn't need to have common sense "background values" like "don't kill anyone".

Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don't know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.

Got it, thanks. The part of the definition that I didn't grasp was "H's preferences over the intended task domain".

So I discovered that Paul Christiano already made a very similar distinction to the holistic/parochial one here:


ambitious ~ holistic

narrow ~ parochial

Someone also suggested simply using general/narrow instead of holistic/parochial.

Those seem like really important distinctions. I have the feeling people who don't think AI Alignment is super important either implicitly or explicitly only think of parochial alignment and not of holistic alignment and just don't consider the former to be so difficult.

People who don't know much about AI have two templates, neither of which is AI: the first is a conventional computer, the second is a human being. Following the first template it seems obvious that AGIs will be obedient and slavish.