Author's note 1:

The following is a chapter from the story I've been writing which contains, well, it contains what I think is probably a proof that the value alignment problem is unsolvable. I know it sounds crazy, but as far as I can tell the proof seems to be correct. There are further supporting details which I can explain if anyone asks, but I didn't want to overload you guys with too much information at once, since a lot of those additional supporting details would require articles of their own to explain.

One of my friends, who I shall not name, came up with what we think is also a proof, but it's longer and more detailed and he hasn't decided whether to post it.

I haven't had time yet to extract my own less detailed version from the narrative dialogue of my story, but I thought it was really important that I share it here as soon as possible, since if I'm right, the more time wasted on AI research, the less time we have to come up with strategies and solutions that could more effectively prevent x-risk long term.

Author's note 2:

This post was originally more strongly worded, but I edited it to tone it down a little. While those who have read Inadequate Equilibria might consider that to be "epistemic humility" and therefore darkside epistemology, I'm worried not enough people on here will have read that book. Furthermore, the human brain, particularly system 1, evolved to win political arguments in the ancestral environment. I'm not sure system 1 is biologically capable of understanding the fact that epistemic humility is bad epistemology. And the contents of this post are likely to provoke strong emotional reactions, as it postulates that a particular belief is false, a belief which rationalists at large have invested a LOT of energy, resources and reputation into. I feel more certain that the contents of this post are correct than is wise to express in a context likely to trigger strong emotions. Please keep this in mind. I'm being upfront with you about exactly what I'm doing and why.

Author's note 3:

Also, HEAVY SPOILERS for the story I've been writing, Earthlings: People of the Dawn. This chapter is literally the last chapter of part 5, after which the remaining parts are basically extended epilogues. You have been warned. Also, I edited the chapter in response to the comments to make things more clear.



There were guards standing outside the entrance to the Rationality Institute. They saluted Bertie as he approached. Bertie nodded to them as he walked past. He reached the front doors and turned the handle, then pulled the door open.

He stepped inside. There was no one at the front desk. All the lights were on, but he didn’t hear anyone in the rooms he passed as he walked down the hallway, approaching the door at the end.

He finally stood before it. It was the door to Thato’s office.

Bertie knocked.

“Come in,” he heard Thato say from the other side.

Bertie turned the knob with a sweaty hand and pushed inwards. He stepped inside, hoping that whatever Thato wanted to talk to him about, that it wasn’t an imminent existential threat.

“Hello Bertie,” said Thato, somberly. He looked sweaty and tired, with bags under his puffy red eyes. Had he been crying?

“Hi Thato,” said Bertie, gently shutting the door behind him. He pulled up a chair across from Thato’s desk. “What did you want to talk to me about?”

“We finished analyzing the research notes on the chip you gave us two years ago,” said Thato, dully.

“And?” asked Bertie. “What did you find?”

“It was complicated, it took us a long time to understand it,” said Thato. “But there was a proof in there that the value alignment problem is unsolvable.”

There was a pause, as Bertie’s brain tried not to process what it had just heard. Then…

“WHAT!?” Berite shouted.

“We should have realized it earlier,” said Thato. Then in an accusatory tone, “In fact, I think you should have realized it earlier.”

“What!?” demanded Bertie. “How? Explain!”

“The research notes contained a reference to a children's story you wrote: A Tale of Four Moralities,Thato continued, his voice rising.It explained what you clearly already knew when you wrote it, that there are actually FOUR types of morality, each of which has a different game-theoretic function in human society: Eye for an Eye, the Golden Rule, Maximize Flourishing and Minimize Suffering.”

“Yes,” said Bertie. “And how does one go from that to ‘the Value Alignment problem is unsolvable’?”

“Do you not see it!?” Thato demanded.

Bertie shook his head.

Thato stared at Bertie, dumbfounded. Then he spoke slowly, as if to an idiot.

“Game theory describes how agents with competing goals or values interact with each other. If morality is game-theoretic by nature, that means it is inherently designed for conflict resolution and either maintaining or achieving the universal conditions which help facilitate conflict resolution for all agents. In other words, the whole purpose of morality is to make it so that agents with competing goals or values can coexist peacefully! It is somewhat more complicated than that, but that is the gist.”

“I see,” said Bertie, his brows furrowed in thought. “Which means that human values, or at least the individual non-morality-based values don’t converge, which means that you can’t design an artificial superintelligence that contains a term for all human values, just the moral values.”

Then Bertie had a sinking, horrified feeling accompanied by a frightening intuition. He didn’t want to believe it.

“Not quite,” said Thato cuttingly. “Have you still not realized? Do you need me to spell it out?”

“Hold on a moment,” said Bertie, trying to calm his racing anxiety.

What is true is already so, Bertie thought.

Owning up to it doesn’t make it worse.

Not being open about it doesn’t make it go away.

And because it’s true, it is what is there to be interacted with.

People can stand what is true, for they are already enduring it.

Bertie took a deep breath as he continued to recite in his mind…

If something is true, then I want to believe it is true.

If something is not true, then I want not to believe it is true.

Let me not become attached to beliefs I may not want.

Bertie exhaled, still overwhelmingly anxious. But he knew that putting off the revelations any longer would make it even harder to have them. He knew the thought he could not think would control him more than the thought he could. And so he turned his mind in the direction it was afraid to look.

And the epiphanies came pouring out. It was a stream of consciousness, no--a waterfall of consciousness that wouldn’t stop. Bertie went from one logical step to the next, a nearly perfect dance of rigorously trained self-honesty and common sense--imperfect only in that he had waited so long to start it, to notice.

“So you can’t program an intelligence to be compatible with all human values, only human moral values,” Bertie said in a rush. “Except even if you programmed it to only be compatible with human moral values, there are four types of morality, so you’d have four separate and competing utility functions to program into it. And if you did that, the intelligence would self-edit to resolve the inconsistencies between its goals and that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.”

“Right in one,” said Thato with a grimace. “And as I am sure you already know, turning a human into a superintelligence would not work either. Human values are not sufficiently stable. Yuuto deduced in his research that human values are instrumental all the way down, never terminal. Some values are merely more or less instrumental than others. That is why human values are over patterns of experiences, which are four-dimensional processes, rather than over individual destinations, which are three-dimensional end states. This is a natural implication of the fact that humans are adaptation executors rather than fitness maximizers. If you program a superintelligence to protect humans from death, grievous injury or other forms of extreme suffering without infringing on their self-determination, that superintelligence would by definition have to stay out of human affairs under most circumstances, only intervening to prevent atrocities like murder, torture or rape, or to deal with the occasional existential threat and so on. If the superintelligence was a modified human it would eventually go mad with boredom and loneliness, and it would snap.

Thato continued. “On the other hand, if a superintelligence was artificially designed it could not be programmed to do that either. Intelligences are by their very nature optimization processes. Humans typically do not realize that because we each have many optimization criteria which often conflict with each other. You cannot program a general intelligence with a fundamental drive to ‘not intervene in human affairs except when things are about to go drastically wrong otherwise, where drastically wrong is defined as either rape, torture, involuntary death, extreme debility, poverty or existential threats’ because that is not an optimization function.

“So, to summarize,” Bertie began, slowly. “The very concept of an omnibenevolent god is a contradiction in terms. It doesn’t correspond to anything that could exist in any self-consistent universe. It is logically impossible.”

“Hindsight is twenty-twenty, is it not?” asked Thato rhetorically.


“So what now?” asked Bertie.

“What now?” repeated Thato. “Why, now I am going to spend all of my money on frivolous things, consume copious amounts of alcohol, say anything I like to anyone without regard for their feelings or even safety or common sense, and wait for the end. Eventually, likely soon, some twit is going to build a God, or blow up the world in any number of other ways. That is all. It is over. We lost.”

Bertie stared at Thato. Then in a quiet, dangerous voice he asked, “Is that all? Is that why you sent me a message saying that you urgently wanted to meet with me in private?”

“Surely you see the benefit of doing so?” asked Thato. “Now you no longer will waste any more time on this fruitless endeavor. You too may relax, drink, be merry and wait for the end.”

At this point Bertie was seething. In a deceptively mild tone he asked, “Thato?”

“Yes?” asked Thato.

“May I have permission to slap you?”

“Go ahead,” said Thato. “It does not matter anymore. Nothing does.”

Bertie leaned over the desk and slapped Thato across the face, hard.

Thato seized Bertie’s wrist and twisted it painfully.

“That bloody hurt, you git!”

“I thought you said nothing matters!?” Bertie demanded. “Yet it clearly matters to you whether you’re slapped.”

Thato released Bertie’s wrist and looked away. Bertie massaged his wrist, trying to make the lingering sting go away.

"Are you done being an idiot?" he asked.

"Define 'idiot'," said Thato scathingly, still not looking at him.

"You know perfectly well what I mean," said Bertie.

Thato ignored him.


Bertie clenched his fists.

“In the letter Yuuto gave me before he died, he told me that the knowledge contained in that chip could spell Humanity’s victory or its defeat,” he said angrily, eyes blazing with determination. “Do you get it? Yuuto thought his research could either destroy or save humankind. He wouldn’t have given it to me if he didn’t think it could help. So I suggest you and your staff get back to analyzing it. We can figure this out, and we will.”

Bertie turned around and stormed out of the office.

He did not look back.


New Comment
16 comments, sorted by Click to highlight new comments since: Today at 11:31 AM

Ok, so a few things to say here.

First, I think the overall conclusion is right in a narrow sense: we can't solve AI alignment in that we can't develop an air tight proof that one agent's values are aligned with another's due to the epistemic gap between the experiences of the two agents where values are calculated. Put another way, values are insufficiently observable to be able to say whether two agents are aligned. See "Formally stating AI alignment" and "Outline of metarationality" for the models that lead me to draw this conclusion.

Second, as a consequence of what I've just described, I think any impossibility resulting from the irreconcilability of human preferences is irrelevant because the impossibility of perfect alignment is dominated by the above claim. I think you are right that we can't reconcile human preferences so long as they are irrational, and attempts at this are likely to result in repugnant outcomes if we hand the reconciled utility function to a maximizer. It's just less of a concern for alignment because we never get to the point where we could have failed because we couldn't solve this problem (although we could generate our own failure by trying to "solve" preference reconciliation without having solved alignment).

Third, I appreciate the attempt to put this in a fictional story that might be more widely read than technical material, but my expectation is that right now most of the people you need to convince of this point are more likely to engage with it through technical material than fiction, although I may be wrong about this so I appreciate the diversity of presentation styles.

And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time, that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.

I don't believe an AI which simultaneously optimized multiple utility functions using a moral parliament approach would tile the universe with tiny artificial agents as described here.

"optimizing for competing utility functions" is not the same as optimizing for conflict resolution. There are various schemes for combining utility functions (some discussion on this podcast for instance). But let's wave our hands a bit and say each of my utility functions outputs a binary approve/disapprove signal for any given action, and we choose randomly among those actions which are approved of by all of my utility functions. Then if even a single utility function doesn't approve of the action "tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves", this action will not be done.

good point, I missed that, will fix later. more likely that effect would result from programming the AI with the overlap between those utility functions, but I'm not totally sure so I'll have to think about it. I don't think that point is actually necessary for the crux of my argument, though. Like I said, I'll have to think about it. Right now it's almost 4am and Im really sick now.

1. I think there are a lot more than four different kinds of moral system.

2. If "value alignment" turns out to be possible in any sense stronger than "alignment of a superintelligence's values with those of one human or more-than-averagely-coherent group" it won't mean making it agree with all of humanity about everything, or even about every question of moral values. That's certainly impossible, and its impossibility is not news.

Way back, Eliezer had a (half-baked) scheme he called "coherent extrapolated volition", whose basic premise was that even though different people think and feel very differently about values it might turn out that if you imagine giving everyone more and better information, clearer thinking, and better channels of communication with one another, then their values might converge as you did so. That's probably wrong, but I'm not sure it's obviously wrong, and some other thing along similar lines might turn out to be right.

An example of the sort of thing that could be true: while "maximize flourishing" and "minimize suffering" look like quite different goals, it might be that there's a single underlying intuition that they stem from. (Perhaps which appeals more to you depends largely on whether you have noticed more dramatic examples of suffering or of missed opportunities to flourish.) Another: "an eye for an eye" isn't, and can't be, a moral system on its own -- it's dependent on having an idea of what sort of thing counts as putting out someone's eye. So what's distinctive about "an eye for an eye" is that we might want some people to flourish less, if they have been Bad. Well, it might turn out either that a strong policy of punishing defectors leads to things being better for almost everyone (in which case "Minnie" and "Maxie" might embrace that principle on pragmatic grounds, given enough evidence and understanding) or that it actually makes things worse even for the victims (in which case "Ivan" might abandon it on pragmatic grounds, given enough evidence and understanding).

3. Suppose that hope turns out to be illusory and there's no such thing as a single set of values that can reasonably claim to be in any sense the natural extrapolation of everyone's values. It might still turn out possible, e.g., to make a superintelligent entity whose values are, and remain, within the normal range of generally-respected human values. I think that would still be pretty good.

1. Ask yourself, what sorts of things do we humans typically refer to as "morality" and what things do we NOT refer to as "morality"? There are clearly things that do not go in the morality bucket, like your favorite flavor of ice cream. But okay, what other things do you think go in the morality bucket and why?

2. Because a) the same sorts of arguments can be made in reverse. Just as Minnie or Maxie might come to accept Eye for an Eye on pragmatic grounds because it makes society as a whole better/less bad, Goldie might accept Maximize Flourishing and/or Minimize Suffering on the grounds that it helps create the conditions that make cooperative exchanges possible, and Ivan might come to accept Maximize Flourishing and/or Minimize Suffering because lots of people are being forced to endure consequences that are way out of proportion to any wrongs they might have committed, and that isn't a fair system in the sense that Eye for an Eye entails.

Also because b) there are cases where the four types of morality do not overlap. For instance, Pure Maximize Flourishing would say to uplift the human species, no matter what and no matter how long it takes. Pure Minimize Suffering says you should do this too except with the caveat that if you find a way to end humanity's suffering sooner than that, in a way that is fast and painless, you should do that instead. In other words, a pure Suffering Minimizer might try to mercy kill the human species, while a pure Flourishing Maximizer would not.

Furthermore, if you had a situation where you could either uplift all of the rest of the human species at the cost of one person being tortured forever or have the entire human species go extinct, a pure Flourishing Maximizer would choose the former, while a pure Suffering Minimizer would choose the latter.

3. And then all values outside of that range are eliminated from existence because they weren't included in the AI's utility function.

1. Some varieties of moral thinking whose diversity doesn't seem to me to be captured by your eye-for-eye/golden-rule/max-flourish/min-suffer schema:

  • For some people, morality is all about results ("consequentialists"). For some, it's all about following some moral code ("deontologists"). For some, it's all about what sort of person you are ("virtue ethicists"). Your Minnie and Maxie are clearly consequentialists; perhaps Ivan is a deontologist; it's hard to be sure what Goldie is; but these different outlooks can coexist with a wide variety of object-level moral preferences and your four certainly don't cover all the bases here.
  • Your four all focus on moral issues surrounding _harming and benefiting_ people. Pretty much everyone does care about those things, but other very different things are important parts of some people's moral frameworks. For instance, some people believe in a god or gods and think _devotion to their god(s)_ more important than anything else; some people attach tremendous importance to various forms of sexual restraint (only within marriage! only between a man and a woman! only if it's done in a way that could in principle lead to babies! etc.); some people (perhaps this is part of where Ivan is coming from, but you can be quite Ivan-like by other means) have moral systems in which _honour_ is super-important and e.g. if someone insults you then you have to respond by taking them down as definitively as possible.

2. (You're answering with "Because ..." but I don't see what "why?" question I asked, either implicitly or explicitly, so at least one of us has misunderstood something here.) (a) I agree that there are lots of different ways in which convergence could happen, but I don't see why that in any way weakens the point that, one way or another, it _could_ happen. (b) It is certainly true that Maxie and Minnie, as they are now, disagree about some important things; again, that isn't news. The point I was trying to make is that it might turn out that as you give Maxie and Minnie more information, a deeper understanding of human nature, more opportunities to talk to one another, etc., they stop disagreeing, and if that happens then we might do OK to follow whatever system they end up with.

3. I'm not sure what you mean about "values being eliminated from existence"; it's ambiguous. Do you mean (i) there stop being people around who have those values or (ii) the world proceeds in a way that doesn't, whether or not anyone cares, tend to satisfy those values? Either way, note that "that range" was the _normal range of respected human values_. Right now, there are no agents around (that we know of) whose values are entirely outside the range of human values, and we're getting on OK. There are agents (e.g., some psychopaths, violent religious zealots, etc.) whose values are within the range of human values but outside the range of _respected human values_, and by and large we try to give them as little influence as possible. To be clear, I'm not proposing "world ruled by an entity whose values are similar to those of some particular human being generally regarded as decent" as a _triumphant win for humanity_, but it's not an _obvious catastrophe_ either and so far as I can tell the sort of issue you're raising presents no obstacle to that sort of outcome.

1a. Deontology/virtue ethics is a special case of consequentialism. The reason for following deontological rules is because the consequences that result from following deontological rules almost always tend to be better than the consequences of not following deontological rules. The exceptions where it is wiser to not follow deontological rules are generally rare.

1b. Those are social mores, not morals. If a human is brainwashed into shutting down the forces of empathy and caring within themselves, then they can be argued into treating any social more as a moral rule.

2. Sorry I should have started that paragraph by repeating what you said, just to make it clear what I was responding to. I don't think the four moralities converge when everyone has more information because....

I will also note that while Ivan might adopt Maximize flourishing and/or Minimize Suffering on pragmatic (aka instrumental) grounds, Ivan is a human, and humans don't really have terminal values. If instead Ivan was an AI programmed with Eye-for-an-Eye, it might temporarily adopt Maximize Flourishing and/or Minimize Suffering as an instrumental goal, and then go back to Eye-for-an-Eye later.

3a. "Suppose that hope turns out to be illusory and there's no such thing as a single set of values that can reasonably claim to be in any sense the natural extrapolation of everyone's values." Those were your exact words. Now, if there is no such thing as a single set of values that are the natural extrapolation of everyone's values, then choosing a subset of everyone's values which are in the normal range of respected human values for the AI to optimize for would mean that all the human values that are not in the normal range would be eliminated. If the AI doesn't have a term for something in its utility function, it has no reason to let that thing waste resources and space it can use for things that are actually in its utility function. And that's assuming that a value like "freedom and self-determination for humans" is something that can actually be correctly programmed into an AI, which I'm pretty sure it can't because it would mean that the AI would have to value the act of doing nothing most of the time and only activating when things are about to go drastically wrong. And that wouldn't be an optimization process.

3b. "Either way, note that "that range" was the _normal range of respected human values_. Right now, there are no agents around (that we know of) whose values are entirely outside the range of human values, and we're getting on OK."

You just switched from "outside the normal range of respected human values" to "entirely outside the range of human values". Those are not at all the same thing. Furthermore, the scenario you described as "pretty good" was one where it stills turns out possible to make a superintelligence whose values are, and remain, within the normal range of generally-respected human values.

Within the normal range of generally-respected human values. NOT within the entire range of human values. If we were instead talking about a superintelligence that was programmed with the entire range of human values, rather than only a subset of them, then that would be a totally different scenario and would require an entirely different argument to support it than the one you were making.

1. Neither deontology nor virtue ethics is a special case of consequentialism. Some people really, truly do believe that sometimes the action with worse consequences is better. There are, to be sure, ways for consequentialists sometimes to justify deontologists' rules, or indeed their policy of rule-following, on consequentialist grounds -- and for that matter there are ways to construct rule-based systems that justify consequentialism. ("The one moral rule is: Do whatever leads to maximal overall flourishing!") They are still deeply different ways of thinking about morality.

You consider questions of sexual continence, honour, etc., "social mores, not morals", but I promise you there are people who think of such things as morals. You think such people have been "brainwashed", and perhaps they'd say the same about you; that's what moral divergence looks like.

2. I think that if what you wrote was intended to stand after "I think there is no convergence of moralities because ..." then it's missing a lot of steps. I should maybe repeat that I'm not asserting that there is convergence; quite likely there isn't. But I don't think anything you've said offers any strong reason to think that there isn't.

3. Once again, I think you are not being clear about the distinction between the things I labelled (i) and (ii), and I think it matters. And, more generally, it feels as if we are talking past one another: I get the impression that either you haven't understood what I'm saying, or you think I haven't understood what you're saying.

Let's be very concrete here. Pick some human being whose values you find generally admirable. Imagine that we put that person in charge of the world. We'll greatly increase their intelligence and knowledge, and fix any mental deficits that might make them screw up more than they need to, and somehow enable them to act consistently according to those admirable values (rather than, e.g., turning completely selfish once granted power, as real people too often do). Would you see that as an outcome much better than many of the nightmare misaligned-AI scenarios people worry about?

I would; while there's no human being I would altogether trust to be in charge of the universe, no matter how they might be enhanced, I think putting almost any human being in charge of the universe would (if they were also given the capacity to do the job without being overwhelmed) likely be a big improvement over (e.g.) tiling the universe with paperclips or little smiley human-looking faces, or over many scenarios where a super-powerful AI optimizes some precisely-specified-but-wrong approximation to one aspect of human values.

I would not expect such a person in that situation to eliminate people with different values from theirs, or to force everyone to live according to that person's values. I would not expect such a person in that situation to make a world in which a lot of things I find essential have been eliminated. (Would you? Would you find such behaviour generally admirable?)

Any my point here is that nothing in your arguments displays shows any obstacle to doing essentially that. You argue that we can't align an AI's values with those of all of humanity because "all of humanity" has too many different diverging values, and that's true, but there remains the possibility that we could align them with those of some of humanity, to something like the extent that any individual's values are aligned with those of some of humanity, and even if that's the best we can hope for the difference between that and (what might be the default, if we ever make any sort of superintelligent AI) aligning its values with those of none of humanity is immense.

(Why am I bothering to point that out? Because it looks to me as if you're trying to argue that worrying about "value alignment" is likely a waste of time because there can be no such thing as value alignment; I say, on the contrary, that even though some notions of value alignment are obviously unachievable and some others may be not-so-obviously unachievable, still others are almost certainly achievable in principle and still valuable. Of course, I may have misunderstood what you're actually arguing for: that's the risk you take when you choose to speak in parables without explaining them.)

I feel I need to defend myself on one point. You say "You switched from X to Y" as if you think I either failed to notice the change or else was trying to pull some sort of sneaky bait-and-switch. Neither is the case, and I'm afraid I think you didn't understand the structure of my argument. I wanted to argue "we could do Thing One, and that would be OK". I approached this indirectly, by first of all arguing that we already have Thing Two, which is somewhat like Thing One, and is OK, and then addressing the difference between Thing One and Thing Two. But you completely ignored the bit where I addressed the difference, and just said "oi, there's a difference" as if I had no idea (or was pretending to have no idea) that there is one.

1. On the deontology/virtue ethics vs consequentialism thing, you're right I don't know how I missed that, thanks!

1a. I'll have to think about that a bit more.

2. Well, if we were just going off of the four moralities I described, then I already named two examples where two of those moralities are unable to converge: a pure flourishing maximizer wouldn't want to mercy kill the human species, but a pure suffering minimizer would. A pure flourishing maximizer would be willing to have one person tortured forever if that was a necessary prerequisite for uplifting the rest of the human species into a transhumanist utopia. A suffering minimizer would not. Even if the four moralities I described only cover a small fraction of moral behaviors, then wouldn't that still be a hard counterexample to the idea that there is convergence?

3. I think when you said "within the normal range of generally-respected human values", I took that literally, meaning I thought it excluded values which were not in the normal range and not generally respected even if they are things like "reading Adult My Little Pony fanfiction". Not every value which isn't well respected or in the normal range would make the world a better place through its removal. I thought that would be self-evident to everyone here, and so I didn't explain it. And then it looked to me like you were trying to justify the removal of all values which aren't generally respected or within the normal range as being "okay". So when you said " Right now, there are no agents around (that we know of) whose values are entirely outside the range of human values, and we're getting on OK." I thought it was intended to be in support of the removal of all values which aren't well respected or in the normal range. But if you're trying to support the removal of niche values in particular, saying that current humans are getting along fine with their current whole range of values, which one would presume must include the niche values, does not make sense.

About to fall asleep, I'll write more of my response later.

2. Again, there are plenty of counterexamples to the idea that human values have already converged. The idea behind e.g. "coherent extrapolated volition" is that (a) they might converge given more information, clearer thinking, and more opportunities for those with different values to discuss, and (b) we might find the result of that convergence acceptable even if it doesn't quite match our values now.

3. Again, I think there's a distinction you're missing when you talk about "removal of values" etc. Let's take your example: reading adult MLP fanfiction. Suppose the world is taken over by some being that doesn't value that. (As, I think, most humans don't.) What are the consequences for those people who do value it? Not necessarily anything awful, I suggest. Not valuing reading adult MLP fanfiction doesn't imply (e.g.) an implacable war against those who do. Why should it? It suffices that the being that takes over the world cares about people getting what they want; in that case, if some people like to write adult MLP fanfiction and some people like to read it, our hypothetical superpowerful overlord will likely prefer to let those people get on with it.

But, I hear you say, aren't those fanfiction works made of -- or at least stored in -- atoms that the Master of the Universe can use for something else? Sure, they are, and if there's literally nothing in the MotU's values to stop it repurposing them then it will. But there are plenty of things that can stop the MotU repurposing those atoms other than its own fondness for adult MLP fanfiction -- such as, I claim, a preference for people to get what they want.

There might be circumstances in which the MotU does repurpose those atoms: perhaps there's something else it values vastly more that it can't get any other way. But the same is true right here in this universe, in which we're getting on OK. If your fanfiction is hosted on a server that ends up in a war zone, or a server owned by a company that gets sold to Facebook, or a server owned by an individual in the US who gets a terrible health problem and needs to sell everything to raise funds for treatment, then that server is probably toast, and if no one else has a copy then the fanfiction is gone. What makes a superintelligent AI more dangerous here, it seems to me, is that maybe no one can figure out how to give it even humanish values. But that's not a problem that has much to do with the divergence within the range of human values: again, "just copy Barack Obama's values" (feel free to substitute someone whose values you like better, of course) is a counterexample, because most likely even an omnipotent Barack Obama would not feel the need to take away your guns^H^H^H^Hfanfiction.

To reiterate the point I think you've been missing: giving supreme power to (say) a superintelligent AI doesn't remove from existence all those people who value things it happens not to care about, and if it cares about their welfare then we should not expect it to wipe them out or to wipe out the things they value.

While alignment of superintelligent AI is probably unsolvable, or at least not provably solvable – the AI safety is solvable. Just prevent the creation of any advance AI, and you will get some form of AI safety. However, to prevent AGI creation, some forms of AI are needed, even if one wants to target AI labs with nukes, they still need guidance systems.

In other words, we could suppose that there is a level of AI development, which is enough to stop AGI development, but not enough to create AGI-related risks. I call it "Narrow AI Nanny".

Similar ideas was expressed by Roman Yampolskiy here, where he wrote about "artificial stupidity" for a low impact AI; by Goertzel who wrote about AI Nanny and by Drexler wrote about comprehensive AI services as alternative to AGI.

I think that "the value alignment problem" is not something that currently has a universally acknowledged and precise definition and a lot of the work that is currently being done is to get less confused about what is meant by this.

From what I see, in your proof you have started from a particular meaning of this term and then went on to show it is impossible.

Which means that human values, or at least the individual non-morality-based values don’t converge, which means that you can’t design an artificial superintelligence that contains a term for all human values

Here you observe that if "the value alignment problem" means to construct something which has the values of all humans at the same time, it is impossible because there exist humans with contradictory values. So you propose the new definition "to construct something with all human moral values". You continue to observe that the four moral values you give are also contradictory, so this is also impossible.

And even if somehow you could program an intelligence to optimize for those four competing utility functions at the same time,

So now we are looking at the definition "to program for the four different utility functions at the same time". As has been observed in a different comment, this is somewhat underspecified and there might be different ways to interpret and implement it. For one such way you predict

that would just cause it to optimize for conflict resolution, and then it would just tile the universe with tiny artificial conflicts between artificial agents for it to resolve as quickly and efficiently as possible without letting those agents do anything themselves.

It seems to me that the scenario behind this course of events would be: we build an AI, give it the four moralities and noticing their internal contradictions, it analyzes them to find that they serve the purpose of conflict resolution. Then it proceeds to make this its new, consistent goal and builds these tiny conflict scenarios. I'm not saying that this is implausible, but I don't think it is a course of events without alternatives (and these would depend on the way the AI is built to resolve conflicting goals).

To summarize, I think out of the possible specifications of "the value alignment problem", you picked three (all human values, all human moral values, "optimizing the four moralities") and showed that the first two are impossible and the third leads to undesired consequences (under some further assumptions).

However, I think there are many things which people would consider a solution of "the value alignment problem" and which don't satisfy one of these three descriptions. Maybe there is a subset of the human values without contradiction, such that most people would be reasonably happy with the result of a superhuman AI optimizing these values. Maybe the result of an AI maximizing only the "Maximize Flourishing"-morality would lead to a decent future. I would be the first to admit that those scenarios I describe are themselves severely underspecified, just vaguely waving at a subset of the possibility space, but I imagine that these subsets could contain things we would call "a solution of the value alignment problem".

Except that for humans, life is a journey, not a destination. If you make a maximize flourishing optimizer you would need to rigorously define what you meant by flourishing, which requires a rigorous definition of a general human utility function, which doesnt and cannot exist. Human values are instrumental all the way down. Some values are just more instrumental than others--that is the mechanism which allows for human values to be over 4d experiences rather than 3d states. I mean, what other mechanism could result in that for a human mind? This is a natural implication of "adaptation executors not fitness maximizers".

And I will note that humans tend to care a lot about their own freedom and self determination. Basically the only way for an intelligence to be "friendly" is for it to first solve scarcity and then be inactive most of the time, only waking up to prevent atrocities like murder torture or rape or to deal with the latest existential threat. In other words, not an optimization process at all, because it would have an arbitrary stopping point where it does not itself raise human values any further.

You cannot program a general intelligence with a fundamental drive to ‘not intervene in human affairs except when things are about to go drastically wrong otherwise, where drastically wrong is defined as either rape, torture, involuntary death, extreme debility, poverty or existential threats’ because that is not an optimization function.

In the extreme limit, you could create a horribly gerrymandered utility function where you assign 0 utility to universes where those bad things are happening, 1 utility to universes where they aren't, and some reduced impact thing which means that it usually prefers to do nothing.

It seems you're very optimistic that there's only four utility functions.

If that was true your could do something like a value handshake and optimize that.

Or find a stable set of changes that are endorsed by all utility functions at every step and would end up with all four utility functions being equivalent.

Or you could create different enclaves with groups that allowed people to maximize their values within those groups, that would ultimately lead to one of the groups that cared about other groups taking on their values to win over time, while the other gets were getting their values met in the meantime.

But that's assuming there are only four utility functions. Other options:

  1. Humans don't have consistent preferences, and therefore can't be expressed in a utility function.

  2. We do have terminal preferences that are consistent, but they are shaped by our experiences, such that there are as many utility functions as there are humans

  3. Evolution gave us all only one terminal preference, and the seeming difference is just based on different strategies to reach that goal. Then alignment is very easy

Point being that I think there are a lot of hidden assumptions in this post. Both in terms of the problem space and the solution space.

[+][comment deleted]2y1