Superintelligence 24: Morality models and "do what I mean"

by KatjaGrace 5y24th Feb 20155 min read47 comments


This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fourth section in the reading guideMorality models and "Do what I mean".

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Morality models” and “Do what I mean” from Chapter 13.


  1. Moral rightness (MR) AI: AI which seeks to do what is morally right
    1. Another form of 'indirect normativity'
    2. Requires moral realism to be true to do anything, but we could ask the AI to evaluate that and do something else if moral realism is false
    3. Avoids some complications of CEV
    4. If moral realism is true, is better than CEV (though may be terrible for us)
  2. We often want to say 'do what I mean' with respect to goals we try to specify. This is doing a lot of the work sometimes, so if we could specify that well perhaps it could also just stand alone: do what I want. This is much like CEV again.

Another view

Olle Häggström again, on Bostrom's 'Milky Way Preserve':

The idea [of a Moral Rightness AI] is that a superintelligence might be successful at the task (where we humans have so far failed) of figuring out what is objectively morally right. It should then take objective morality to heart as its own values.1,2

Bostrom sees a number of pros and cons of this idea. A major concern is that objective morality may not be in humanity's best interest. Suppose for instance (not entirely implausibly) that objective morality is a kind of hedonistic utilitarianism, where "an action is morally right (and morally permissible) if and only if, among all feasible actions, no other action would produce a greater balance of pleasure over suffering" (p 219). Some years ago I offered a thought experiment to demonstrate that such a morality is not necessarily in humanity's best interest. Bostrom reaches the same conclusion via a different thought experiment, which I'll stick with here in order to follow his line of reasoning.3 Here is his scenario:
    The AI [...] might maximize the surfeit of pleasure by converting the accessible universe into hedonium, a process that may involve building computronium and using it to perform computations that instantiate pleasurable experiences. Since simulating any existing human brain is not the most efficient way of producing pleasure, a likely consequence is that we all die.
Bostrom is reluctant to accept such a sacrifice for "a greater good", and goes on to suggest a compromise:
    The sacrifice looks even less appealing when we reflect that the superintelligence could realize a nearly-as-great good (in fractional terms) while sacrificing much less of our own potential well-being. Suppose that we agreed to allow almost the entire accessible universe to be converted into hedonium - everything except a small preserve, say the Milky Way, which would be set aside to accommodate our own needs. Then there would still be a hundred billion galaxies devoted to the maximization of pleasure. But we would have one galaxy within which to create wonderful civilizations that could last for billions of years and in which humans and nonhuman animals could survive and thrive, and have the opportunity to develop into beatific posthuman spirits.

    If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly. But it is consistent with placing great weight on morality. (p 219-220)

What? Is it? Is it "consistent with placing great weight on morality"? Imagine Bostrom in a situation where he does the final bit of programming of the coming superintelligence, to decide between these two worlds, i.e., the all-hedonium one versus the all-hedonium-except-in-the-Milky-Way-preserve.4 And imagine that he goes for the latter option. The only difference it makes to the world is to what happens in the Milky Way, so what happens elsewhere is irrelevant to the moral evaluation of his decision.5 This may mean that Bostrom opts for a scenario where, say, 1024 sentient beings will thrive in the Milky Way in a way that is sustainable for trillions of years, rather than a scenarion where, say, 1045 sentient beings will be even happier for a comparable amount of time. Wouldn't that be an act of immorality that dwarfs all other immoral acts carried out on our planet, by many many orders of magnitude? How could that be "consistent with placing great weight on morality"?6



1. Do What I Mean is originally a concept from computer systems, where the (more modest) idea is to have a system correct small input errors.

2. To the extent that people care about objective morality, it seems coherent extrapolated volition (CEV) or Christiano's proposal would lead the AI to care about objective morality, and thus look into what it is. Thus I doubt it is worth considering our commitments to morality first (as Bostrom does in this chapter, and as one might do before choosing whether to use a MR AI), if general methods for implementing our desires are on the table. This is close to what Bostrom is saying when he suggests we outsource the decision about which form of indirect normativity to use, and eventually winds up back at CEV. But it seems good to be explicit.

3. I'm not optimistic that behind every vague and ambiguous command, there is something specific that a person 'really means'. It seems more likely there is something they would in fact try to mean, if they thought about it a bunch more, but this is mostly defined by further facts about their brains, rather than the sentence and what they thought or felt as they said it. It seems at least misleading to call this 'what they meant'. Thus even when '—and do what I mean' is appended to other kinds of goals than generic CEV-style ones, I would expect the execution to look much like a generic investigation of human values, such as that implicit in CEV.

4. Alexander Kruel criticizes 'Do What I Mean' being important, because every part of what an AI does is designed to be what humans really want it to be, so it seems unlikely to him that AI would do exactly what humans want with respect to instrumental behaviors (e.g. be able to understand language, and use the internet and carry out sophisticated plans), but fail on humans' ultimate goals:

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

I disagree that it would be surprising for an AI to be very good at flying planes in general, but very bad at going to the right places in them. However it seems instructive to think about why this is.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Are there other general forms of indirect normativity that might outsource the problem of deciding what indirect normativity to use?
  2. On common views of moral realism, is morality likely to be amenable to (efficient) algorithmic discovery?
  3. If you knew how to build an AI with a good understanding of natural language (e.g. it knows what the word 'good' means as well as your most intelligent friend), how could you use this to make a safe AI?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about other abstract features of an AI's reasoning that we might want to get right ahead of time, instead of leaving to the AI to fix. We will also discuss how well an AI would need to fulfill these criteria to be 'close enough'. To prepare, read “Component list” and “Getting close enough” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 2 March. Sign up to be notified here.