I agree it's useful for alignment not to be a fully-general-goodness word, and to have specific terminology that makes it clear what you're talking about.

But I think there are a desiderata the word originally meant in this context, and I think it's an important technical point that the line between "generically good" and "outer aligned" is kinda vague. I do think there are important differences between them but I think some of the confusion lives in the territory here.

The "outer alignment" problem, as I understand it, is that it's hard to specify a goal that won't result in very bad things that you didn't intend happening. You note:

Perfect alignment techniques (as I use the term) would allow a company like Google to train very smart models^[4] so that if Google asks them to increase ad revenue or optimize TPU load balancing or sell all their customers’ private info or censor YouTube videos criticizing Xi Jinping, then they’d be fully motivated to do all those things and whatever else Google asks them to do (good or bad!).^[5] With perfect alignment techniques, Google would be able to instill in its AIs a pure desire to be as helpful as possible to Google.^[6]

There's inner-alignment failures where the AI ends up not even trying to do anything that remotely corresponds to "maximize ad revenue." (some other process takes over, and it starts squiggling the universe because that's what the sub-process wanted)

But from an outer alignment perspective, it's nontrivial to specify this such that, say, it doesn't convert all the earth to computronium running instances of google ad servers, and bots that navigate google clicking on ads all day.

And when you specify "okay, don't kill any humans in the process, and ensure that the board / CEO of google remain in control of google and are actually happy with the outcome", you have some new problems like "the AI is so good at hacking the minds of the google CEO / board that your ordinary conception of "the CEO has remained in control" breaks down. And, most plans that increase google ad revenue probably cause some number of deaths of humans to go up or down, due to flow-through effects. If the result is that millions of additional humans are dying, you probably want the AI to notice and stop. If the result is that ~10 humans maybe accidentally die... maybe that's fine because it's within the normal range of "stuff happening to humans as a response to complicated plans"... but... do you then want to encourage AIs to consider 'okay I'm allowed to spend up to 10 human lives accelerating google's ad revenue?' as a business-as-usual-resource?

And then, well, there's subtler questions about whether google increasing ad revenue is warping human behavior mindhacking people to click google ads in ways that are clearly unintended. If you convert all of the earth to computronium with bots clicking ads, that seems like you clearly failed to outer-align the thing. If you convince all humans to voluntarily join pods where they click google ads all day, is that an outer alignment failure? Probably.

What if your ad-agency-AI is pretty persuasive, but at the upper range of how persuasive human ad agencies could have been in the first place? Well, maybe that's fine.

These were all pretty central use cases of the earlier uses of "aligned" AFAICT, and I think it makes sense for them to continue to be treated as fairly central for what "outer alignment" means, if we keep using that framework.

But once you've hit that narrow target... well, yeah I agree there's some room leftover between "solving outer alignment to this degree" and "fully solving general goodness", but I think it's not that much further. Once you've gotten here, you're pretty close to generally solving goodness. If we have transformative AIs with this degree of philosophical sophistication who are still just... doing what corporations tell them to do in a fairly business-as-usual-way, something feels like it's gone wrong. (I agree that "the part where something has gone wrong here" isn't necessarily an AI alignment problem. But it still feels like 'you picked a kinda dumb outer-alignment optimization target')

I think there's something meaningful in saying "okay, we outer-aligned a thing by virtue of having it operating in a very narrow domain with very limited effects on humans" (which is still hard to specify, but I can imagine solving that sort of thing in a way that doesn't require getting close to 'solving goodness'). But, I think that's pretty narrowly scoped compared to anything a business might want to use a powerful AI for.

[-]Mark Xu3y67

But from an outer alignment perspective, it's nontrivial to specify this such that, say, it doesn't convert all the earth to computronium running instances of google ad servers, and bots that navigate google clicking on ads all day.

But Google didn't want their AIs to do that, so if the AIs do that then the AIs weren't aligned. Same with the mind-hacking.

In general, your AI has some best guess at what you want it to do, and if it's aligned it'll do that thing. If it doesn't know what you meant, then maybe it'll make some mistakes. But the point is that aligned AIs don't take creative actions to disempower humans in ways that humans didn't intend, which is separate from humans intending good things.

[-]Raemon3y53

I agree. But the point is, in order to do the thing that the CEO actually wants, the AI needs to understand goodness at least as well as the CEO. And this isn't, like, maximal goodness for sure. But to hold up under superintelligent optimization levels, it needs a pretty significantly nuanced understanding of goodness.

I think there is some disagreement between AI camps about how difficult it is to get to the level-of-goodness the CEO's judgment represents, when implemented in an AI system powerful enough to automate scientific research.

I think the "alignment == goodness" confusion is sort of a natural consequence of having the belief that "if you've built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you're already 90% of the way to 'build an actual benevolent sovereign."

(I assume Mark and Ajeya have heard these arguments before, and we're having this somewhat confused conversation because of a frame difference somewhere about how useful it is to think about goodness and what AI is for, or something)

[-]Mark Xu3y2-2

"if you've built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you're already 90% of the way to 'build an actual benevolent sovereign."

I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they're not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other "common sense morality." AIs could just act similarly? Current AIs already seem like they basically know what types of things humans would think are bad or good, at least enough to know that when humans ask for coffee, they don't mean "steal the coffee" or "do some complicated scheme that results in coffee".

Seperately, it seems like in order for your AI act competently in the world it does have to have a pretty good understanding of "goodness", e.g. to be able to understand why Google doesn't do more spying on competitors, or more insider trading, or do other unethical but profitable things, etc. (Seperately, the AI will also be able to write philosophy books that are better than current ethical philosophy books, etc.)

My general claim is that if the AI takes creative catastrophic actions to disempower humans, it's going to know that the humans don't like this, are going to resist in the ways that they can, etc. This is a fairly large part of "understanding goodness", and enough (it seems to me) to avoid catastrophic outcomes, as long as the AI tries to do [it's best guess at what the humans wanted it to do] and not [just optimize for the thing the humans said to do, which it knows is not what the humans wanted it to do].

[-]Ben Pace3y20

Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign.

If "science" includes "building and testing AGIs" or "building and testing nukes" or "building and testing nanotech", then I think the "average human" "doing science" is unaligned.

I have occasionally heard people debate whether "humans are aligned". I find it a bit odd to think of it as a yes/no answer. I think humans are good at modeling some environments and not others. High pressured environments with superstimuli are harder than others (e.g. the historical example of "you can get intense amounts of status and money if you lead an army into war and succeed", or the recent example of "you can get intense amounts of status and money if you lead a company to build superintelligent AGI"). Environments where lots of abstract conceptual knowledge is required to understand what's happening (e.g. modern economies, science, etc) can easily be a situation where the human doesn't understand the situation and makes terrible choices. I don't think this is a minor issue, even for people with strong common-sense morality, and applies to lots of hypothetical situations where humans could self-modify.

Also relatedly in my head, I feel like I see this intuition rested on a bunch (emphasis mine):

Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they're not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other "common sense morality."

There's something pretty odd to me here about the way that it's sometimes assumed that average people have uncertainty over their morality. If I think of people who feel very "morally righteous", I don't actually think of them as acting with much humility. The Christian father who beats his son for his minor sins, the Muslim father who murders his daughter for being raped, are people who I think have strong moral stances. I'm not saying these are the central examples of people acting based on their moral stances, but I sometimes get the sense that some folks think that all agents that have opinions about morality are 'reflective', whereas I think many/most humans historically simply haven't been, and thought they understood morality quite well. To give a concrete example of how it relates to alignment discussion, I can imagine that an ML system trained to "act morally" may start acting very aggressively in-line with its morals and see those attempting to give it feedback as attempting to trick it into not acting morally.

I think I would trust these sorts of sentences to be locally valid if they said "consider an extremely reflective human of at least 130 IQ who often thinks through both consequentialist and deontological and virtue ethic lenses on their life" rather than "consider an average human".

[-]habryka3y20

One reason why they're not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other "common sense morality."

I mean, intrinsically caring about "common sense morality" sounds to me like you basically already have an AI you can approximately let run wild.

Superintelligences plus a moral foundation that reliably covers all the common sense morality seems like it's really very close to an AI that would just like to go off and implement CEV. CEV looks really good according to most forms of common-sense morality I can think of. My current guess is almost all the complexity of our values lives in the "common sense" part of morality.

[-]Daniel Kokotajlo3y76

Awesome! Huge fan of these posts, I just slurped them all down in one sitting.

Also, Kelsey, I didn't know you had a kid, congratulations!

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

96

New blog: Planned Obsolescence

96

96