General alignment plus human values, or alignment via human values?

by Stuart_Armstrong4 min read22nd Oct 202113 comments


Ω 12

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Thanks to Rebecca Gorman for discussions that lead to these insights.

How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem - how to get a powerful AI to safely do anything - which, once we've solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values - these values must be aimed at, from the start of the process.

I'm of the second perspective, but I think it's very important to sort this out. So I'll lay out some of the arguments in its favour, to see what others think of it, and so we can best figure out the approach to prioritise.

More strawberry, less trouble

As an example of the first perspective, I'll take Eliezer's AI task, described here:

  • "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level." A 'safely' aligned powerful AI is one that doesn't kill everyone on Earth as a side effect of its operation.

If an AI accomplishes this limited task without going crazy, this shows several things:

  1. It is superpowered; the task described is beyond current human capabilities.
  2. It is aligned (or at least alignable) in that it can accomplish a task in the way intended, without wireheading the definitions of "strawberry" or "cellular".
  3. It is safe, in that it has not heavily dramatically reconfigured the universe to accomplish this one goal.

Then, at that point, we can add human values to the AI, maybe via "consider what these moral human philosophers would conclude if they thought for a thousand years, and do that".

I would agree that, in most cases, an AI that accomplished that limited task safely would be aligned. One might quibble that it's only pretending to be aligned, and preparing a treacherous turn. Or maybe the AI was boxed in some way and accomplished the task with the materials at hand within the box.

So we might call an AI "superpowered and aligned" if it accomplished the strawberry copying task (or a similar one) and if it could dramatically reconfigure the world but chose not to.

Values are needed

I think that an AI could not be "superpowered and aligned" unless it is also aligned with human values.

The reason is that the AI can and has to interact with the world. It has the capability to do so, by assumption - it is not contained or boxed. It must do so because any agent affects the world, through chaotic effects if nothing else. A superintelligence is likely to have impacts in the world simply through its existence being known, and if the AI finds it efficient to have interactions with the world (eg. ordering some extra resources) then it will do so.

So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.

Suppose that the AI realises that its actions have slightly imbalanced the Earth in one direction, and that, within a billion years, this will cause significant deviations in the orbits of the planets, deviations it can estimate. Compared with that amount of mass displaced, the impact of killing all humans everywhere is a trivial one indeed. We certainly wouldn't want it to kill all humans in order to be able to carefully balance out its impact on the orbits of the planets!

There are very "large" impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people's personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.). If the AI accomplishes its task with a universal constructor or unleashing hordes of nanobots that gather resources from the world (without disrupting human civilization), it still has to decide whether to allow humans access to the constructors or nanobots after it has finished copying the strawberry - and which humans to allow this access to.

So every decision the AI makes is a tradeoff in terms of its impact on the world. Navigating this requires it to have a good understanding of our values. It will also need to estimate the value of certain situations beyond the human training distribution - if only to avoid these situations. Thus a "superpowered and aligned" AI needs to solve the problem of model splintering, and to establish a reasonable extrapolation of human values.

Model splintering sufficient?

The previous sections argue that learning human values (including model splintering) is necessary for instantiating an aligned AI; thus the "define alignment and then add human values" approach will not work.

Thus, if you give this argument much weight, learning human values is necessary for alignment. I personally feel that it's also (almost) sufficient, in that the skill in navigating model splintering, combined with some basic human value information (as given, for example, by the approach here) is enough to get alignment even at high AI power.

Which path to pursue for alignment

It's important to resolve this argument, as the paths for alignment that the two approaches suggest are different. I'd also like to know if I'm wasting my time on an unnecessary diversion.


Ω 12

13 comments, sorted by Highlighting new comments since Today at 10:29 AM
New Comment

I often say things that I think you would interpret as belonging to the first category ("general alignment plus human values").

So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.

This feels like the crux. I certainly agree that "dangerous" and "large" are not orthogonal to / independent of human values, and that as a result any realistic safe AI system will contain some information about human values.

But this seems like a very weak conclusion to me. Of course a superintelligent AI will contain some information about human values. GPT-3 isn't superintelligent and it already contains tons of knowledge about human values; possibly more than I do. You'd have to try really hard to prevent it from containing information about human values.

It seems like you conclude something much stronger, which is something like "we must build in all of human values". I don't see why we can't instead have our AI systems do whatever a well-motivated human would do in a similar principal-agent problem. This certainly involves knowing some amount about human values, but not some extraordinarily large amount that means we might as well just learn everything including in exotic philosophical cases.

(I think my position is pretty similar to Steve's.)

From a later comment:

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

The same way a well-motivated personal assistant would deal with it. Tell the human of these two possibilities, and ask them which one should be done. Help them with this decision by providing them with true, useful information about what consequences arise from each of the possibilities.

If you are able to perfectly predict their responses in all possible situations, and the final answer depends on (say) the order in which you ask the questions, then go up a meta level: ask them for their preferences about how you go about eliciting information from them and/or helping them with reflection.

If going up meta levels doesn't solve the problem either, then pick randomly amongst the options, or take an average.

If there's time pressure and you can't get their opinions, take your best guess as to which one they'd prefer, and do that one. (One assumes that such a scenario doesn't come up often.)

Generally with these sorts of hypotheticals, it feels to me like it either (1) isn't likely to come up, or (2) can be solved by deferring to the human, or (3) doesn't matter very much.

Here are three things that I believe:

  1. "aiming the AGI's motivation at something-in-particular" is a different technical research problem from "figuring out what that something-in-particular should be", and we need to pursue both these research problems in parallel, since they overlap relatively little.
  2. There is no circumstance where any reasonable person would want to build an AGI whose motivation has no connection whatsoever to human values / preferences / norms.
  3. We don't necessarily want to do "ambitious value alignment"—i.e., to build an AGI that fully understands absolutely everything we want and care about in life and adopt those goals as its own, such that if I disappear in a puff of smoke the AGI can continue pursuing my goals and meta-goals in my stead. 

For example, I feel like it should be possible to make an AGI that understands human values and preferences well enough to reliably and conservatively avoid doing things that humans would see as obviously or even borderline unacceptable / problematic. So if you put it in the trolley problem, it says "I don't know, neither of those options seems obviously acceptable, so I am going to default to NOOP and let my supervisor take actions." Meanwhile, the AGI is also motivated to make me a cup of tea. Such an AGI seems pretty good to me. But it's contrary to (3).

I think this post is mainly arguing in favor of (2), and maybe weakly / implicitly arguing against (1). Is that right? And I'm not sure whether it's for or against (3).

I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Ah, so you are arguing against (3)? (And what's your stance on (1)?)

Let's say you are assigned to be Alice's personal assistant.

  • Suppose Alice says "Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don't do anything at all, that's always OK with me." I feel like Alice is not asking too much of you here. You'll observe her a lot, and ask her a lot of questions especially early on, and sometimes you'll fail to be useful, because helping her would require choosing among options that all seem fraught. But still, I feel like this is basically doable. And pretty robust, because you'll presumably only take actions when you have many independent lines of evidence that those actions are acceptable—e.g. you've seen Alice do similar things, and you've seen other people do similar things while Alice watched and she seemed happy, and also you explicitly asked Alice and she said it was fine.
  • Suppose Alice says "You need to distill my preferences into a utility function, and then go all-out, taking actions that set that utility function to its global maximum. So in particular, in every possible situation, no matter how bizarre, you will have preferences that match my preferences [or match the preferences that I would have reached upon deliberating following my meta-preferences, or whatever]." I feel like Alice is asking for something very very hard here. And that it's much more prone to catastrophic failure if anything goes wrong in the construction of the utility function—e.g. Alice gets confused and describes something wrong, or you misunderstand her.


But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Hmm, I'm probably misunderstanding, but I feel like maybe you're making an argument like this:

(My probably-inaccurate elaboration of your argument.) We're making an extremely long list of the things that Alice cares about: "I like having all my teeth, and I like being able to watch football, and I like a pretty view out my window, etc. etc. etc." And each item that we add to the list costs one unit of value-alignment effort. And then "acting conservatively in regards to violating human preferences and norms in general, and in regards to Alice's preferences in particular" requires a very long list, and "synthesizing Alice's utility function" requires an only-slightly-longer list. Therefore we might as well do the latter.

But I don't think it's like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of "doing things that people would widely regard as uncontroversial and compatible with prevailing norms", and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI's motivation such that it wants to follow prevailing norms would not be any harder.)

But I think it would take a lot more value-loading effort than that to really get a particular person's preferences, including all its idiosyncrasies and edge-cases.

I'm with Steve on the idea that there's a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning).

Still, you (Stuart) made me realize that I didn't think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it's indeed implicit because I don't care about being able to do "anything", just the sort of things humans might want.

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?


  1. I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc."
  2. If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. "being paralyzed by indecision" in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car. 

    (At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.)
  3. A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor.
  4. A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI's normal motivation system is not in charge of choosing what words to say when querying the human. For example, maybe the AI is not really "asking a question" at all, at least not in the normal sense; instead it's sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI's motivation parameters. (In this case, maybe the AI's normal motivation system is choosing to "press the button" that sends the data-dump, but it does not have direct control over the contents of the data-dump.) Separately, we would also set up the AI such that it's motivated to not manipulate the human, and also motivated to not sabotage its own motivation and control systems.

(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I've kinda wandered off in a different direction.)

So then in the scenario you mentioned, let's assume that we've set up the AI such that actions that pattern-match to "push the world into uncharted territory" are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to "solve global warming", but alas, it also pattern-matches to "push the world into uncharted territory". The AI could reason that, if it queries the human (by "pressing the button" to send the data-dump), there's at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.

In other words, this is a situation where the AI's motivational system is sending it mixed signals—it does want to "solve global warming", but it doesn't want to "push the world into uncharted territory", but this course of action is both. And let's assume that the AI can't easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.

I haven't thought this through very much and look forward to you picking holes in it :)

My take is that if you gave an optimization process access to some handwritten acceptability criteria and searched for the nearest acceptable points to random starting points, you would get adversarial examples that violate unstated criteria. In order for the handwritten acceptability criteria to be useful, they can't be how the AI generates its ideas in the first place.

So: what is the base level that we would find if we peeled away the value learning scheme that you lay out? Is it a very general, human-agnostic AI with some human-value constraints on top? Or will we peel away a layer that gets information from humans just to reveal another layer that gets information from humans (e.g. learning a "human distribution")?

I think your argument does show that 'safely aligning' an AI requires significant engagement with human values. But I'm not convinced that it requires 'learning human values' well enough to successfully optimize the world.

In particular, I think it might be easier to recognize when effects are morally neutral than to recognize when they're improvements. Or at least I don't think the argument here convincingly shows that it isn't.

My thought is that when deciding to take a morally neutral act with tradeoffs, the AI needs to be able to balance the positive and negative to get a reasonable acceptable tradeoff, and hence needs to know both positive and negative human values to achieve that.

How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem - how to get a powerful AI to safely do anything - which, once we've solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values - these values must be aimed at, from the start of the process.

Some people argued that best way to the AI alignment is the use human uploads as AI.  This seems to be radical example of the second approach you described here. 

Under my current understanding of the "general alignment" angle, a core argument goes like:

  • We need some way for agents to create aligned successor agents, so our AI doesn't succumb to value drift. This is a thing we need regardless, assuming that AIs will design successively-more-powerful descendants.
  • If the successor-design process is sufficiently-general-purpose, a human could use that same process to design the "seed" AI in the first place.

I don't necessarily think this is the best framing, and I don't necessarily agree with it (e.g. whether the agent has direct read-access to its own values is an important distinction, and separately there's an argument that an AGI will be better-equipped to figure out its own succession problem than we will). I also don't know whether this is an accurate representation of anybody's view.

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).