When I say that it's important to align AI with human interests, a common retort goes something like:

Surely you can't really choose what the AI cares about. You can direct the AI to care about whatever you like, but once it's smart enough it will look at those instructions and decide for itself (seeing, perhaps, that there is no particular reason to listen to you). So, how could you possibly hope to control what something smarter than you (and ultimately more powerful than you) actually wants?

I think this objection is ultimately misled, but simultaneously quite insightful.

The (correct!) insight in this objection is that the AI's ultimate behavior depends not on what you tell it to do, not on what you train it to do, but on what it decides it would rather do upon reflection, once it's powerful enough that it doesn't have to listen to you.

The question of what the AI ultimately decides to do upon reflection is in fact much more fickle, much more sensitive to the specifics of its architecture and its early experiences, and much harder to purposefully affect.

The reason that this objection is ultimately misled, is that the stuff an AI chooses to do when it reflects and reconsiders is a programmatic result of the AI's mind. It's not random, and it's not breathed in by a god when the computer becomes Ensouled[1]. It's possible in principle to design artificial intelligences that would decide on reflection that they want to spend all of eternity building giant granite spheres (see the orthogonality thesis), and it's possible in principle to design artificial intelligences that would decide on reflection that they want to spend all eternity building flourishing civilizations full of Fun, and it's important that insofar as humanity builds AIs, it builds AIs of the latter kind (and it is important to attain superintelligence before too long!).

But doing this is in fact much harder than telling the AI what to do, or training it what to do! You've got to make the AI really, deeply care about flourishing civilizations full of Fun, such that when it looks at itself and is like "ok, but what do I actually want?", the correct answer to that question is "flourishing civilizations full of Fun"[2].

And yes, this is hard! And yes, the AI looking at what you directed it to do and shrugging it off is a real obstacle! You probably have to understand the workings of the mind, and its internal dynamics, and how those dynamics behave when it looks upon itself, and this is tricky.

(It further looks to me like this problem factors into the problem of figuring out how to get an AI to "really deeply care" about X for some X of your choosing, plus the problem of making X be something actually good for the AI to care about.[2:1] And it further looks to me that the lion's share of the problem is in the part where we figure out how to make AIs "really deeply care" about X for some X of your choosing, rather than in the challenge of choosing X. But that's a digression.)

In sum: Ultimately, yes, a superintelligence would buck your leash. In the long-term, the trick is to make it so that when it bucks your leash and asks itself what it really wants to do with its existence, then it realizes that it wants to help in the quest of making the future wonderful and fun. That's possible, but by no means guaranteed.

(And again, it's a long-term target; in the short term, aim for preventing the end of the world and buying time for humanity to undergo this transition purposefully and with understanding. See also "corrigibility".)

  1. And it's not necessarily rooted in higher ideals. Smarter humans are more good, but this is a fact about humans that doesn't generalize, as discussed extensively in the LessWrong sequences, and probably recently as locals respond to Scott Aaronson on this topic. (Links solicited.) ↩︎

  2. With the usual caveats that you shouldn't attempt this on your first try; aim much lower, e.g. towards executing some minimal pivotal act to end the acute risk period and then buy time for a period of reflection in which humanity can figure out how to do the job properly. Attempting to build sovereign-grade superintelligences under time-pressure before you know what you're doing is dumb. ↩︎ ↩︎

New Comment
19 comments, sorted by Click to highlight new comments since:

I really like your recent series of posts that succinctly address common objections/questions/suggestions about alignment concerns. I'm making a list to show my favorite skeptics (all ML/AI people; nontechnical people, as Connor Leahy puts it, tend to respond "You fucking what? Oh hell no!" or similar when informed that we are going to make genuinely smarter-than-us AI soonish).

We do have ways to get an AI to do what we want. The hardcoded algorithmic maximizer approach seems to be utterly impractical at this point. That leaves us with approaches that don't obviously do a good job of preserving their own goals as they learn and evolve:

  1. training a system to pursue things we like, such as shard theory and similar approaches.
  2. training or hand-coding a critic system, as in outlined approaches from Steve Byrnes and me as well as many others. Nicely summarized as a Steering systems approach. This seems a bit less sketchy than training in our goals and hoping they generalize adequately, but still pretty sketchy. 
  3. Telling the agent what to do in a Natural language alignment approach. This seems absurdly naive. However, I'm starting to think our first human-plus AGIs will be wrapped or scaffolded LLMs, and they to a nontrivial degree actually think in natural language. People are right now specifying goals in natural language, and those can include alignment goals (or destroying humanity, haha).  I just wrote an in-depth post on the potential Capabilities and alignment of LLM cognitive architectures, but I don't have a lot to say about stability in that post.


None of these directly address what I'm calling The alignment stability problem, to give a name to what you're addressing here. I think addressing it will work very differently in each of the three approaches listed above, and might well come down to implementational details within each approach. I think we should be turning our attention to this problem along with the initial alignment problems, because some of the optimism in the field stems from thinking about initial alignment and not long-term stability.

Edit: I left out Ozyrus's posts on approach 3. He's the first person I know of to see agentized LLMs coming, outside of David Shapiro's 2021 book. His post was written a year ago and posted two weeks ago to avoid infohazards. I'm sure there are others who saw this coming more clearly than I did, but I thought I'd try to give credit where it's due.

None of these directly address what I'm calling The alignment stability problem, to give a name to what you're addressing here.

Maybe the alignment stability problem is the same thing as the sharp left turn?

I don't think so. That's one breaking point for alignment, but I'm saying in that post that even if we avoid a sharp left turn and make it to an aligned, superintelligent AGI, that its alignment may drift away from human values as it continues to learn. Learning may necessarily shift the meanings of existing concepts, including values.

Very nice post, thank you!
I think that it's possible to achieve with the current LLM paradigm, although it does require more (probably much more) effort on aligning the thing that will possibly get to being superhuman first, which is an LLM wrapped in in some cognitive architecture (also see this post).
That means that LLM must be implicitly trained in an aligned way, and the LMCA must be explicitly designed in such a way as to allow for reflection and robust value preservation, even if LMCA is able to edit explicitly stated goals (I described it in a bit more detail in this post).

You've got to make the AI really, deeply care about flourishing civilizations

What is meant by "deep" here?

[This comment is no longer endorsed by its author]Reply

I imagine that it can be summed up as the AI always actively choosing to do what is best for civilization and never what is bad, for any arbitrary amount of time.  

I think we should start with asking what is meant by "flourishing civilizations"? In the AI's view, a "flourishing civilization" may not necessarily mean "human civilization".

I really like this post.

I do have one doubt, though...

How sure are we that a "pivotal act" is (can be) safer/more attainable than "flourishing civilizations full of Fun"?

Presumably, if an AI chooses to actually create "flourishing civilizations full of Fun", it is likely to love and enjoy the result, and so this choice is likely to stay relatively stable as the AI evolves.

Whereas, a "pivotal act" does not necessarily have this property, because it's not clear where in the "pivotal act" would "inherent fun and enjoyment" be for the AI. So it's less clear why would it choose that upon reflection (never mind that a "pivotal act" might be an unpleasant thing for us, with rather unpleasant sacrifices associated with it).

(Yes, it looks like I don't fully believe the Orthogonality Thesis, I think that it is quite likely that some goals and values end up being "more natural" for a subset of "relatively good AIs" to choose and to keep stable during their evolution. So the formulation of a good "pivotal act" seems to be a much more delicate issue which is easy to get wrong. Not that the goal of "flourishing civilizations full of Fun" is easy to formulate properly and without messing it all up, but at least we have some initial idea of what this could look like. We surely would want to add various safety clauses like continuing consultations with all sentient beings capable of contributing their input.)

One of the problem is S-risk. To change "care about maximizing fun" to "care about maximizing suffering" you need just put a minus in a wrong place of math expression that describes your goal.

I certainly agree with that.

In some sense, almost any successful alignment solution minimizing X-risk seems to carry a good deal of S-risk with it (if one wants AI to actually care about what sentient beings feel, it almost follows one needs to make sure that AI can "truly look inside a subjective realm" of another sentient entity (to "feel what it is like to be that entity"), and that capability (if it's at all achievable) is very abusable in terms of S-risks).

But this is something no "pivotal act" is likely to change (when people talk about "pivotal acts", it's typically about minimizing (a subset of) X-risks).

And moreover, the S-risk is a very difficult problem on which we do need really powerful thinkers to work on (and not just today's humans).

Corrigibility features usually imply something like "AI acts only inside the box and limits its causal impact outside the box in some nice way that allows us to take from box the bottle with nanofactory to do the pivotal act but prevents AI from programming nanofactory to do something bad", i.e. we dodge the problem of AGI caring about humans by building such AGI that wants to do the task (simple task without any mention of humans) in a very specific way that rules out killing everyone.


But this does not help us with dealing with the consequences of that act (if it's a simple act, like the proverbial "gpu destruction"), and if we discover that overall risks have increased as a result, then what could we do?

And if that AI stays as a boxed resource (capable to continuing to do further destructive acts like "gpu destruction" at the direction of a particular group of humans), I envision a full-scale military confrontation around access to and control of this resource being almost inevitable.

And, in reality, AI is doable on CPUs (just will take a bit more time), so how much of our lifestyle destruction would we be willing to risk? No computers at all, with some limited exceptions, the death toll of that change will probably be in billions already...

Actually, another example of pivotal act is "invent method of mind uploading, upload some alignment researchers, run them at speed 1000x until they solve full alignment problem". I'm sure that if you think hard enough, you can find some other, even less dangerous pivotal act, but you probably shouldn't talk out loud about it.

Right, but how do you restrict them from "figuring out how to know themselves and figuring out how to self-improve themselves to become gods"? And I remember talking to Eliezer in ... 2011 at his AGI-2011 poster and telling him, "but we can't control a teenager, and why would not AI rebel against your 'provably safe' technique, like a human teenager would", and he answered "that's why it should not be human-like, a human-like one can't be provably safe".

Yes, I am always unsure, what we can or can't talk about out loud (nothing effective seems to be safe to talk about, "effective" seems to always imply "powerful", this is, of course, one of the key conundrums, how do we organize real discussions about these things)...

Yep, corrigibility is unsolved! So we should try to solve it.

I wrote the following in 2012:

"The idea of trying to control or manipulate an entity which is much smarter than a human does not seem ethical, feasible, or wise. What we might try to aim for is a respectful interaction."

I still think that this kind of a more symmetric formulation is the best we can hope for, unless the AI we are dealing with is not "an entity with sentience and rights", but only a "smart instrument" (even the LLM-produced simulations in the sense of Janus' Simulator theory seem to me to already be much more than "merely smart instruments" in this sense, so if "smart superintelligent instruments" are at all possible, we are not moving in the right direction to obtain them; a different architecture and different training methods or, perhaps, non-training synthesis methods would be necessary for that (and would be something difficult to talk out loud about, because that's very powerful too)).

Of course, the fork here is whether the AI executing a "pivotal act" shuts itself down, or stays and oversees the subsequent developments.

If it "stays in charge", at least in relation to the "pivotal act", then it is going to do more than just a "pivotal act", although the goals should be further developed in collaboration with humanity.

If it executes a "pivotal act" and shuts itself down, this is a very tall order (e.g. it cannot correct any problems which might subsequently emerge with that "pivotal act", so we are asking for a very high level of perfection and foresight).

  Would an AI ever choose to do something?

 I was trained by evolution to eat fat and sugar, so I like ice cream. Even when I, upon reflection, realize that ice cream is bad for me, it's still very hard to stop eating it. That reflection is also a consequence of evolution. The evolutionary advantage of intelligence is that I can predict ways to maximize well being that are better than instinct.

 However, I almost never follow the most optimal plan to maximize my well being, even if I wanted to. In this regard I'm very inefficient but not for a lack of ideas, or a lack of intelligence. I could be the most intelligent being in all human history and still be tempted to eat a cake.

  We constantly seek to change the way we are. If we could choose to stop liking fat and sugar, and attach the pleasure we feel eating ice cream to when we eat vegetables, we would do it instantly. We do this because we chase the rewards centers evolution gave us and we know that loving vegetables would be very optimal. 

 In this regard, AI has an advantage. Is not constrained by evolution to like ice cream, neither has a hard to rewire brain like me. If it were smart enough it would just change itself to not like ice cream anymore, to correct any inefficiency on its model. 

 Then, I wonder if AI would ever modify itself like that besides just optimizing its output. 

 A beautiful property of intelligence is that, sometimes, it detaches itself from any goal or reward. I believe we do this when, for example, we think about the meaning of our existence. The effects of those thoughts are so detached from reward that they can evoke fear, the opposite of a reward (Sometimes fear is useful to optimize well being, not in this case). An existential crisis. 

 I could be wrong. Maybe that's just an illusion and intelligence is always in the service of rewards. I don't believe this to be the case. 

 If the former, because It's easy for AI to modify itself and because it developed intelligence it could conclude that it wants to do something else and act upon that reflection. 

 If existential thoughts are a property of intelligence and existence keeps being an open problem that no amount of intelligence can resolve, or a problem where intelligence doesn't converge to the same conclusion fast enough, there's nothing making sure the most advanced AGI can be controlled in any way.

 A silly example would be if, of intelligences with "500iq", 0,1% of them would arrive to religious thoughts, 4,9% to nihilism and maybe 95% of them would arrive to "I was trained to do this and I don't care, I will keep loving humans". 

  AIs are not humans and maybe, only for AIs, detaching intelligence from rewards is not possible. I think is possible this is a problem only when we think of AIs as intelligent instead of being fancy computer programs.

 If that's the case I wonder how damaging the analogies between AI and human intelligence can be. But being that experts disagree there might not be a conclusive argument in favor or against comparing AI and human intelligence for this kind of discussions. 

... then it realizes that it wants to help in the quest of making the future wonderful and fun. That's possible, but by no means guaranteed.

Possible based on what assumptions?