This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-first section in the reading guideValue learning.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Value learning” from Chapter 12


  1. One way an AI could come to have human values without humans having to formally specify what their values are is for the AI to learn about the desired values from experience.
  2. To implement this 'value learning' we would need to at least implicitly define a criterion for what is valuable, which we could cause the AI to care about. Some examples of criteria:
    1.  'F' where 'F' is a thing people talk about, and their words are considered to be about the concept of interest (Yudkowsky's proposal) (p197-8, box 11)
    2. Whatever is valued by another AI elsewhere in the universe values (Bostrom's 'Hail Mary' proposal) (p198-9, box 12)
    3. What a specific virtual human would report to be his value function, given a large amount of computing power and the ability to create virtual copies of himself. The virtual human can be specified mathematically as the simplest system that would match some high resolution data collected about a real human (Christiano's proposal). (p200-1)
  3. The AI would try to maximize these implicit goals given its best understanding, while at the same time being motivated to learn more about its own values.
  4. A value learning agent might have a prior probability distribution over possible worlds, and also over correct sets of values conditional on possible worlds. Then it could choose its actions to maximize their expected value, given these probabilities.

Another view

Paul Christiano describes an alternative to loading values into an AI at all:

Most thinking about “AI safety” has focused on the possibility of goal-directed machines, and asked how we might ensure that their goals are agreeable to humans. But there are other possibilities.

In this post I will flesh out one alternative to goal-directed behavior. I think this idea is particularly important from the perspective of AI safety.

Approval-directed agents

Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action:

Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.

I’ll call this “approval-directed” behavior throughout this post, in contrast with goal-directed behavior. In this context I’ll call Hugh an “overseer.”

Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself. For example, if Arthur is playing chess he should make moves that are actually good—not moves that Hugh thinks are good.

...[However, there are many reasons Hugh would want to use the proposal]...

In most situations, I would expect approval-directed behavior to capture the benefits of goal-directed behavior, while being easier to define and more robust to errors.

If this interests you, I recommend the much longer post, in which Christiano describes and analyzes the proposal in much more depth.


1. An analogy
An AI doing value learning is in a similar situation to me if I want to help my friend but don't know what she needs. Even though I don't know explicitly what I want to do, it is defined indirectly, so I can learn more about it. I would presumably follow my best guesses, while trying to learn more about my friend's actual situation and preferences. This is also what we hope the value learning AI will do.

2. Learning what to value
If you are interested in value learning, Dewey's paper is the main thing written on it in the field of AI safety.

3. Related topics
I mentioned inverse reinforcement learning and goal inference last time, but should probably have kept them for this week, to which they are more relevant. Preference learning is another related subfield of machine learning, and learning by demonstration is generally related. Here is a quadcopter using inverse reinforcement learning to infer what its teacher wants it to do. Here is a robot using goal inference to help someone build a toy.

4. Value porosity
Bostrom has lately written about a new variation on the Hail Mary approach, in which the AI at home is motivated to trade with foreign AIs (via everyone imagining each other's responses), and has preferences that are very cheap for foreign AIs to guess at and fulfil.

5. What's the difference between value learning and reinforcement learning?
We heard about reinforcement learning last week, and Bostrom found it dangerous. Since it also relies on teaching the AI values by giving it feedback, you might wonder how exactly the proposals relate to each other.

Suppose the owner of an AI repeatedly comments that various actions are 'friendly'. A reinforcement learner would perhaps care about hearing the word 'friendly' as much as possible. A value learning AI on the other hand would take use of the word 'friendly' as a clue about a hidden thing that it cares about. This means if the value learning AI could trick the person into saying 'friendly' more, this would be no help to it—the trick would just make the person's words a less good clue. The reinforcement learner on the other hand would love to get the person to say 'friendly' whenever possible. This difference also means the value learning AI might end up doing things which it does not expect its owner to say 'friendly' about, if it thinks those actions are supported by the values that it learned from hearing 'friendly'.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. Expand upon the value learning proposal. What kind of prior over what kind of value functions should a value learning AI be given? As an input to this, what evidence should be informative about the AI's values?
  2. Analyze the feasibility of Christiano’s proposal for addressing the value-loading problem. 
  3. Analyze the feasibility of Bostrom’s “Hail Mary” approach to the value-loading problem.
  4. Analyze the feasibility of Christiano's newer proposal to avoid learning values.
  5. Investigate the applicability of the related fields mentioned above to producing beneficial AI.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the two other ways to direct the values of AI. To prepare, read “Emulation modulation” through “Synopsis” from Chapter 12. The discussion will go live at 6pm Pacific time next Monday 9 February. Sign up to be notified here.

New to LessWrong?

New Comment
33 comments, sorted by Click to highlight new comments since: Today at 7:41 PM

I think Christiano's idea sets us on the right path. One of the things that troubled me in reading Bostrom's book is the prospect that a superintelligence under human control could be at least as dangerous as an autonomous superintelligence. For example, if superintelligence were hatched by one of the competing military superpowers -- Russia, China, or the U.S., let's say -- as an oracle or genie, it might be used for short-sighted or selfish purposes with catastrophic consequences. With superintelligence on his side, the cartoon nightmare of a single man out to rule the world could, um, hardly be ruled out. Our knowledge of human nature gives us every reason to expect abuse. If, however, a superintelligence could be attuned to seeking approval from not just Hugh but humanity, that might have a very different and laudable result. With the proliferation of cell phones through nearly every part of the world (half of the African population is expected to have one by 2017), this opens the possibility that a superintelligence could seek the approval of a broad swath of humanity. Even better would be approval for decisions whose consequences are hidden behind a Rawlsian veil, so that people expressing an opinion cannot know whether they personally will be a winner or loser by the decision. Under such circumstances, people are, on average, remarkably fair. Still, there may be both practical and theoretical shortcomings to my suggestion. I'd welcome critiques.

The obvious problem is lack of human intelligence. Whether a majority of people on Earth will agree with some question the AI will ask them using the cell phone... it will depend on how specifically the question was framed.

Does it invoke popular applause lights? The majority will say "yes".

Is it complicated? The majority will say: "I do not understand the question." Now what? Does the majority of those who answered otherwise decide? Then, for sufficiently complicated questions the answers will be decided by people who are unable to see their own lack of understanding.

I agree these are problems with asking people questions, but they like much easier problems to solve than most involved with controlling AI.

Bostrom's Hail Mary approach involves the AI entirely gathering its information about what other AIs would want from its own mental modeling (p199/294 footnote 25). It seems strange then that it could do this if it thought there really was another AI out there, but not if it thought there were not. Why can't it just do what it would do if there were one?

I agree, the actual local existence of other AIs shouldn't make a difference, and the approach could work equally either way. As Bostrom says on page 198, no communication is required.

Nevertheless, for the process to yield a useful result, some possible civilization would have to build a non-HM AI. That civilization might be (locally speaking) hypothetical or simulated, but either way the HM-implementing AI needs to think of it to delegate values. I believe that's what footnote 25 gets at: From a superrational point of view, if every possible civilization (or every one imaginable to the AI we build) at this point in time chooses to use an HM approach to value coding, it can't work.

If all civilizations HailMary to value-code they would all find out the others did the same and because the game doesn't end there, in round two they would decide to use a different approach. Possibly, like undifferentiated blastula cells use an environmental asymmetric element (gravity) to decide to start differentiating, AGI's could use local information to decide whether they should HailMary again on the second hypothetical round or if they should be the ones deciding for themselves (say information about where you are located in your Hubble volume, or how much available energy there still is in your light cone or something).

Isn't it the civilization not the AGI that will need to decide what to do?

That depends on whether the AGI is told (and accepts) to HailMary once, or to HailMary to completion, or something in between. It also depends on which decision theory the AGI uses to decide I believe. There seem to be, for a large ensemble of decisions, a one-round version of the many-round decision ("No Regrets" Arntzenius2007, "TDT" Yudkowksy2010, "UDT" WeiDai 20xx).

A value learning AI on the other hand would take use of the word 'friendly' as a clue about a hidden thing that it cares about. This means if the value learning AI could trick the person into saying 'friendly' more, this would be no help to it

That's great, but do any of these approaches actually accomplish this? I still have some reading to do, but as best as I can tell, they all rely on some training data. Like a person shouting "Friendly!" and "Unfriendly!" at different things.

The AI will then just do the thing it thinks would make the person most likely to shout "Friendly!". E.g. torturing them unless they say it repeatedly.

Yudkowsky argues against a very similar idea here.

It seems to me that the only advantage of this approach is that it prevents the AI from having any kind of long-term plans. The AI only cares about how much it's "next action" will please it's creator. It doesn't care about anything that happens 50 steps from now.

Essentially we make the AI really really lazy. Maybe it wants convert the Earth to paperclips, but it never feels like working on it.

This isn't an entirely bad idea. It would mean we could create an "oracle" AI which just answers questions, based on how likely we are to like the answer. We then have some guarantee that it doesn't care about manipulating the outside world or escaping from it's box.

I think the difference is between writing an algorithm that detects the sound of a human saying "Friendly!" (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is "Friendly!" if asked about it. (I don't propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.

I don't see a distinction between these things. Shouting "Friendly!" is just the mechanism being used to add to the training data.

No matter what method you use to label the data, there is no way for the machine to distinguish it from ground truth.

E.g. the machine might learn that it should convince you to press the reward button, but it might also learn to steal the button and press it itself.

Both are perfectly valid generalizations to the problem of "predict what actions are the most likely to lead to a positive example in the training set." But only one is what we really intend.

If the AI takes your saying 'friendly' to be a consequence of something being a positive example, then it doesn't think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.

Shouting "Friendly!" isn't just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say "Friendly!" is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.

It causes something to be a training example, but it doesn't cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.

In particular, the value learning scheme - where the AI has priors over what is valuable and its observations cause it to update these - should avoid the problem, if I understand correctly.

Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)

This is the exact same situation. We've just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that's the entire goal of the learning.

So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.

Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.

Do you think the Hail Mary approach could produce much value?

Perhaps. But it is a desperate move, both in terms of predictability and in terms of the likely mind crime that would result in its implementation, since the conceptually easiest and most accurate ways to model other civilizations would involve fully simulating the minds of their members.

If we had to do it, I would be much more interested in aiming it at slightly modified versions of humanity as opposed to utterly alien civilizations. If everyone in our civilization had taken AI safety more seriously, and we could have coordinated to wait a few hundred years to work out the issues before building one, what kind of AI would our civilization have produced? I suspect the major issue with this approach is formalizing "If everyone in our civilization had taken AI safety more seriously" for the purpose of aiming an HM-implementing AI at those possibilities in particular.

What did you find most interesting this week?

The Hail Mary and Christiano's proposals, simply for not having read about them before.

What do you think of Christiano's proposals?

It's interesting, but could be as hard as the problem it tries to solve.

In Bostrom's Hail Mary approach, why is it easier to get an AI to care about another AI's values than about another civilization's values? (p198)

Powerful AIs are probably much more aware of their long-term goals and able to formalize them than a heterogenous civilization is. Deriving a comprehensive morality for post-humanity is really hard, and indeed CEV is designed to avoid the need of having humans do that. Doing it for an arbitrary alien civilization would likely not be any simpler.

Whereas with powerful AIs, you can just ask them which values they would like implemented and probably get a good answer, as proposed by Bostrom.

How would you like this reading group to be different in future weeks?

No changes that I'd recommend, at all. SPECIAL NOTE: please don't interpret the drop in the number ofcomments, the last couple of weeks, as a drop in interest by forum participants. The issues of these weeks are the heart of the reason for existence of nearly all the rest of the Bostrom book, and many of the auxiliary papers and references we've seen, ultimately also have been context, for confronting and brainstorming about the issue now at hand. I myself just as one example, have a number of actual ideas that I've been working on for two weeks, but I've been trying to write them up in white paper form, because they seem a bit longish. also I've talked to a couple of people off site who are busy thinking about this as well and have much to say. Perhaps taking a one week intermission, would give some of us a chance to organize our thoughts more efficiently for postings. There is a lot of untapped incubating that is coming to a head right now among the participants' mindse and we would like a chance to say something about these issues before moving on. (("Still waters run deep" as the cliche goes.) We're at the point of greatest intellectual depth, now. I could speak for hours, were I commenting orally, and trying to be complete -- as opposed to making a skeleton of a comment that would, without context, raise more requests for clarification than be useful. I'm sure I'm not unique. Moderation is fine, though, be assured.

Was there anything you didn't understand this week that you would like to clear up?

What do you think of Dewey's proposal?


I'm unclear on how the probability distribution over utility functions would be implemented. A complete specification of how to evaluate evidence seems hard to do right. Also, why should we expect we can produce a pool of utility functions that includes an adequate one?

Seems to me Dewey and researches in this area could benefit from learning the literature on types of imitation in humans and other primates (procedural imitation, intentional imitation, automatic imitation, and goal-sharing).


Evolution had to solve the goal of value learning before, and she always comes up with interesting ways no one would have thought about.


I don't like Dewey's portfolio approach to utility functions. It goes something like this:

Dewey proposes a utility function maximizer, who considers all possible utility functions weighted by their probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history" He concludes saying that although it solves many of the mentioned problems, this method still leaves many open questions. However it should provide a direction for future work.

That's roughly how I think about my goals, and it's definately not very good for sustaining long term positive relationships with people. My outwardly professed goals can change in what appears to be a change of temperment. When the probability that I can secure one equally valuable goal slightly surpasses another goal, then the limits of me attention kick in and I overinvest in the new goal relative to the effort that ought to be accorded to in line with its utility function. So, a more intelligence system should either be able to prefess the heirachy of preferences (values) it has with greater sophistication than me, or to have a split attention unlike me.

This is a knowledge doubt: reading your description of a value learning system, it feels like what it has above and beyond the reinforcement learner is a model not only of the other being, but of its goals.

In Dennett parlance, it has two levels of intentionality: I think that you want that the toy be built.

In psychology parlance, it has somewhat sophisticated theory of mind.

In philosophical terms it distinguishes intensions from extensions.

Are these correct inferences from being a value learner?

The kids in this video, are Value Learners or Reinforcement Learners? What about the Chimps?

What Dan Dewey describes as an optimal value learner is not what either kids or chimps do: Replacing the reinforcement learner’s sum of rewards with an expected utility over a pool of possible utility functions, we have an optimality notion for a value learning agent.

Since when we infer goals from others, we are not expectimaxing over possible goals the agent could have. we are simply maxing. The kids assume only the goal with highest likelihood.

That's probably the correct inference, if I understand you. The value learner has priors over what the world is like, and further priors over what is valuable.

The kids and the chimps both already have values, and are trying to learn how to fulfil them.

I don't follow your other points, sorry.

The kids and chimps have different priors. Kids assume the experimenter has reasons to be doing the weird non-seemingly goal oriented things he does. Humans alone can entertain fictions. This makes us powerful but also more prone to supersticious behavior (in behaviorist terminology).

If you were expectimaxing over what an agent would do (which is what Dewey suggests a value learner does) you'd end up with behaviors that are seldom useful, because some parts of your behavior would further one goal, and some others, you would not commit to all the behaviors that further the one goal you assign more likelihood to be valuable. Maxing would be find the highest value, ignore all others, expectimaxing would be a mixed hybrid which fails when all or none is relevant.

No doubt this is not my most eloquent thread in history. Sorry, give up on this if you don't understand it.