This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fifth section in the reading guideComponents list for acquiring values.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Component list” and “Getting close enough” from Chapter 13


  1. Potentially important choices to make before building an AI (p222)
    • What goals does it have?
    • What decision theory does it use?
    • How do its beliefs evolve? In particular, what priors and anthropic principles does it use? (epistemology)
    • Will its plans be subject to human review? (ratification)
  2. Incentive wrapping: beyond the main pro-social goals given to an AI, add some extra value for those who helped bring about the AI, as an incentive (p222-3)
  3. Perhaps we should indirectly specify decision theory and epistemology, like we have suggested doing with goals, rather than trying to resolve these issues now. (p224-5)
  4. An AI with a poor epistemology may still be very instrumentally smart, but for instance be incapable of believing the universe could be infinite (p225)
  5. We should probably attend to avoiding catastrophe rather than maximizing value (p227) [i.e. this use of our attention is value maximizing..]
  6. If an AI has roughly the right values, decision theory, and epistemology maybe it will correct itself anyway and do what we want in the long run (p227)

Another view

Paul Christiano argues (today) that decision theory doesn't need to be sorted out before creating human-level AI. Here's a key bit, but you might need to look at the rest of the post to understand his idea well:

Really, I’d like to leave these questions up to an AI. That is, whatever work Iwould do in order to answer these questions, an AI should be able to do just as well or better. And it should behave sensibly in the interim, just like I would.

To this end, consider the definition of a map U' : [Possible actions] → ℝ:

U'(a) = “How good I would judge the action to be, after an idealized process of reflection.”

Now we’d just like to build an “agent” that takes the action a maximizing 𝔼[U'(a)]. Rather than defining our decision theory or our beliefs, we will have to come up with some answer during the “idealized process of reflection.” And as long as an AI is uncertain about what we’d come up with, it will behave sensibly in light of its uncertainty.

This feels like a cheat. But I think the feeling is an illusion. 



1. MIRI's Research, and decision theory

MIRI focuses on technical problems that they believe can't be delegated well to an AI. Thus MIRI's technical research agenda describes many such problems and questions. In it, Nate Soares and Benja Fallenstein also discuss the question of why these can't be delegated:

Why can’t these tasks, too, be delegated? Why not, e.g., design a system that makes “good enough” decisions, constrain it to domains where its decisions are trusted, and then let it develop a better decision theory, perhaps using an indirect normativity approach (chap. 13) to figure out how humans would have wanted it to make decisions?

We cannot delegate these tasks because modern knowledge is not sufficient even for an indirect approach. Even if fully satisfactory theories of logical uncertainty and decision theory cannot be obtained, it is still necessary to have a sufficient theoretical grasp on the obstacles in order to justify high confidence in the system’s ability to correctly perform indirect normativity.

Furthermore, it would be risky to delegate a crucial task before attaining a solid theoretical understanding of exactly what task is being delegated. It is possible to create an intelligent system tasked with developing better and better approximations of Bayesian updating, but it would be difficult to delegate the abstract task of “find good ways to update probabilities” to an intelligent system before gaining an understanding of Bayesian reasoning. The theoretical understanding is necessary to ensure that the right questions are being asked.

If you want to learn more about the subjects of MIRI's research (which overlap substantially with the topics of the 'components list'), Nate Soares recently published a research guide. For instance here's some of it on the (pertinent this week) topic of decision theory:

Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they systematically achieve poor outcomes on some problems where good outcomes are possible) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not fix all of their flaws). My talk “Why ain’t you rich?” briefly touches upon both these points. To learn more, I suggest the following resources:

  1. Soares & Fallenstein’s “Toward idealized decision theory” serves as a general overview, and further motivates problems of decision theory as relevant to MIRI’s research program. The paper discusses the shortcomings of two modern decision theories, and discusses a few new insights in decision theory that point toward new methods for performing counterfactual reasoning.

If “Toward idealized decision theory” moves too quickly, this series of blog posts may be a better place to start:

  1. Yudkowsky’s “The true Prisoner’s Dilemma” explains why cooperation isn’t automatically the ‘right’ or ‘good’ option.

  2. Soares’ “Causal decision theory is unsatisfactory” uses the Prisoner’s Dilemma to illustrate the importance of non-causal connections between decision algorithms.

  3. Yudkowsky’s “Newcomb’s problem and regret of rationality” argues for focusing on decision theories that ‘win,’ not just on ones that seem intuitively reasonable. Soares’ “Introduction to Newcomblike problems” covers similar ground.

  4. Soares’ “Newcomblike problems are the norm” notes that human agents probabilistically model one another’s decision criteria on a routine basis.

MIRI’s research has led to the development of “Updateless Decision Theory” (UDT), a new decision theory which addresses many of the shortcomings discussed above.

  1. Hintze’s “Problem class dominance in predictive dilemmas” summarizes UDT’s dominance over other known decision theories, including Timeless Decision Theory (TDT), another theory that dominates CDT and EDT.

  2. Fallenstein’s “A model of UDT with a concrete prior over logical statements” provides a probabilistic formalization.

However, UDT is by no means a solution, and has a number of shortcomings of its own, discussed in the following places:

  1. Slepnev’s “An example of self-fulfilling spurious proofs in UDT” explains how UDT can achieve sub-optimal results due to spurious proofs.

  2. Benson-Tilsen’s “UDT with known search order” is a somewhat unsatisfactory solution. It contains a formalization of UDT with known proof-search order and demonstrates the necessity of using a technique known as “playing chicken with the universe” in order to avoid spurious proofs.

For more on decision theory, here is Luke Muehlhauser and Crazy88's FAQ.

2. Can stable self-improvement be delegated to an AI?

Paul Christiano also argues for 'yes' here:

“Stable self-improvement” seems to be a primary focus of MIRI’s work. As I understand it, the problem is “How do we build an agent which rationally pursues some goal, is willing to modify itself, and with very high probability continues to pursue the same goal after modification?”

The key difficulty is that it is impossible for an agent to formally “trust” its own reasoning, i.e. to believe that “anything that I believe is true.” Indeed, even the natural concept of “truth” is logically problematic. But without such a notion of trust, why should an agent even believe that its own continued existence is valuable?

I agree that there are open philosophical questions concerning reasoning under logical uncertainty, and that reflective reasoning highlights some of the difficulties. But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI. I would prefer use energy investigating other aspects of the AI safety problem... (more)


3. On the virtues of human review

Bostrom mentions the possibility of having an 'oracle' or some such non-interfering AI tell you what your 'sovereign' will do. He suggests some benefits and costs of this—namely, it might prevent existential catastrophe, and it might reveal facts about the intended future that would make sponsors less happy to defer to the AI's mandate (coherent extrapolated volition or some such thing). Four quick thoughts:

1) The costs and benefits here seem wildly out of line with each other. In a situation where you think there's a substantial chance your superintelligent AI will destroy the world, you are not going to set aside what you think is an effective way of checking, because it might cause the people sponsoring the project to realize that it isn't exactly what they want, and demand some more pie for themselves. Deceiving sponsors into doing what you want instead of what they would want if they knew more seems much, much, much much less important than avoiding existential catastrophe.

2) If you were concerned about revealing information about the plan because it would lift a veil of ignorance, you might artificially replace some of the veil with intentional randomness.

3) It seems to me that a bigger concern with humans reviewing AI decisions is that it will be infeasible. At least if the risk from an AI is that it doesn't correctly manifest the values we want. Bostrom describes an oracle with many tools for helping to explain, so it seems plausible such an AI could give you a good taste of things to come. However if the problem is that your values are so nuanced that you haven't managed to impart them adequately to an AI, then it seems unlikely that an AI can highlight for you the bits of the future that you are likely to disapprove of. Or at least you have to be in a fairly narrow part of the space of AI capability, where the AI doesn't know some details of your values, but for all the important details it is missing, can point to relevant parts of the world where the mismatch will manifest.

4) Human oversight only seems feasible in a world where there is much human labor available per AI. In a world where a single AI is briefly overseen by a programming team before taking over the world, human oversight might be a reasonable tool for that brief time. Substantial human oversight does not seem helpful in a world where trillions of AI agents are each smarter and faster than a human, and need some kind of ongoing control.

4. Avoiding catastrophe as the top priority

In case you haven't read it, Bostrom's Astronomical Waste is a seminal discussion of the topic.


In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. See MIRI's research agenda
  2. For any plausible entry on the list of things that can't be well delegated to AI, think more about whether it belongs there, or how to delegate it.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about strategy in directing science and technology. To prepare, read “Science and technology strategy” from Chapter 14. The discussion will go live at 6pm Pacific time next Monday 9 March. Sign up to be notified here.

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 8:21 AM

Do you think human oversight as described is likely to help with controlling AI? (p226)

First, humanity's cosmic endowment is astronomically large—there is plenty to go around even if our process involves some waste or accepts some unnecessary constraints. (p227)

This is saying that our values are diminishing enough in stuff that much of the universe doesn't matter to us. This seems true under some plausible values and not under others. In particular, if we pursue some kind of proportional aggregative consequentialism, then if each individual has diminishing returns, we should create more individuals so that there is not so much to go around.

This issue is complicated by the fact that we don't really know how much computation our physics will give us access to, or how relevant negentropy is going to be in the long run. In particular, our physics may allow access to (countably or more) infinite computational and storage resources given some superintelligent physics research.

For Expected Utility calculations, this possibility raises the usual issues of evaluating potential infinite utilities. Regardless of how exactly one decides to deal with those issues, the existence of this possibility does shift things in favor of prioritizing for safety over speed.

Infinity Shades for the win!

Seriously though. I'm highly in favor of infinity shades. This whole "let's burn the universe searching for the Omega point or perpetual machines" makes me unhappy.

Did you change your mind about anything as a result of this week's reading?

What did you dislike about this section?

This is not a criticism of your presentation, but rather the presuppositions of the debate itself. As someone who thinks that at the root of ethics are moral sentiments, I have a hard time picturing an intelligent being doing moral reasoning without feeling such sentiments. I suspect that researchers do not want to go out of their way to give AIs affective mental states, much less anything like the full range of human moral emotions, like anger, indignation, empathy, outrage, shame and disgust. The idea seems to be if the AI is programmed with certain preference values for ranges of outcomes, that's all the ethics it needs.

If that's the way it goes then I'd prefer that the AI not be able to deliberate about values at all, though that might be hard to avoid if it's superintelligent. What makes humans somewhat ethically predictable and mostly not monstrous is that our ethical decisions are grounded in a human moral psychology, which has its own reward system. Without the grounding, I worry that an AI left to its own devices could go off the rails in ways that humans find hard to imagine. Yes, many of our human moral emotions actually make it more difficult to do the right thing. If I were re-designing people, or designing AIs, I'd redo the weights of human moral emotions to strengthen sympathy, philanthropy and an urge for fairness. I'd basically be aiming to make an artificial, superintelligent Hume. An AI that I can trust with moral reasoning would have to have a good character - which cannot happen without the right mixture of moral emotions.

What did you like about this section?

Which of the design choices here do you think are important to choose well prior to building AI?

I did not understand what Paul means by:

I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty.

I'd also be keen on him specifying more precisely what is meant by the below and why he thinks it to be key:

The key difficulty is that it is impossible for an agent to formally “trust” its own reasoning, i.e. to believe that “anything that I believe is true.”

Your fourth virtue of human review, that it's feasible while humans outnumber and outpower the AI mentions that once there are trillions of human+ AI around, it gets hard to fiscalize them. This seems true not only for humans but also for the superintelligence herself. As has been pointed out, the Singleton would have to be really amazingly powerful it if can control (as it can, by definition) trillions of AIs, better to just keep numbers low, or not?

How do its beliefs evolve? In particular, what priors and anthropic principles does it use? (epistemology)

I just want to note that there is an underlying assumption there, namely that for pragmatics (how-to act in the world?) and morals (what should be done?) require getting our map of the world and tools for this map (epistemology) right.

This is a common assumption on Lesswrong because of the map territory jargon. But keep in mind that as seen from the inside an algorithm can be extremely different from the world and still function perfectly well in it.

Put in a different manner: There are many surviving and thriving biological entities out there who have very epistemically poor models of the world, and still get away with it.

There are also some very good humans, throughout history whose confused or wrong model of how the world works caused them to act in a moral way.