This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.
Welcome. This week we discuss the twenty-fifth section in the reading guide: Components list for acquiring values.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: “Component list” and “Getting close enough” from Chapter 13
Paul Christiano argues (today) that decision theory doesn't need to be sorted out before creating human-level AI. Here's a key bit, but you might need to look at the rest of the post to understand his idea well:
Really, I’d like to leave these questions up to an AI. That is, whatever work Iwould do in order to answer these questions, an AI should be able to do just as well or better. And it should behave sensibly in the interim, just like I would.
To this end, consider the definition of a map U' : [Possible actions] → ℝ:U'(a) = “How good I would judge the action a to be, after an idealized process of reflection.”
Now we’d just like to build an “agent” that takes the action a maximizing 𝔼[U'(a)]. Rather than defining our decision theory or our beliefs, we will have to come up with some answer during the “idealized process of reflection.” And as long as an AI is uncertain about what we’d come up with, it will behave sensibly in light of its uncertainty.
This feels like a cheat. But I think the feeling is an illusion.
1. MIRI's Research, and decision theory
MIRI focuses on technical problems that they believe can't be delegated well to an AI. Thus MIRI's technical research agenda describes many such problems and questions. In it, Nate Soares and Benja Fallenstein also discuss the question of why these can't be delegated:
Why can’t these tasks, too, be delegated? Why not, e.g., design a system that makes “good enough” decisions, constrain it to domains where its decisions are trusted, and then let it develop a better decision theory, perhaps using an indirect normativity approach (chap. 13) to figure out how humans would have wanted it to make decisions?
We cannot delegate these tasks because modern knowledge is not sufficient even for an indirect approach. Even if fully satisfactory theories of logical uncertainty and decision theory cannot be obtained, it is still necessary to have a sufficient theoretical grasp on the obstacles in order to justify high confidence in the system’s ability to correctly perform indirect normativity.
Furthermore, it would be risky to delegate a crucial task before attaining a solid theoretical understanding of exactly what task is being delegated. It is possible to create an intelligent system tasked with developing better and better approximations of Bayesian updating, but it would be difficult to delegate the abstract task of “find good ways to update probabilities” to an intelligent system before gaining an understanding of Bayesian reasoning. The theoretical understanding is necessary to ensure that the right questions are being asked.
If you want to learn more about the subjects of MIRI's research (which overlap substantially with the topics of the 'components list'), Nate Soares recently published a research guide. For instance here's some of it on the (pertinent this week) topic of decision theory:
Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they systematically achieve poor outcomes on some problems where good outcomes are possible) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not fix all of their flaws). My talk “Why ain’t you rich?” briefly touches upon both these points. To learn more, I suggest the following resources:
Soares & Fallenstein’s “Toward idealized decision theory” serves as a general overview, and further motivates problems of decision theory as relevant to MIRI’s research program. The paper discusses the shortcomings of two modern decision theories, and discusses a few new insights in decision theory that point toward new methods for performing counterfactual reasoning.
If “Toward idealized decision theory” moves too quickly, this series of blog posts may be a better place to start:
Yudkowsky’s “The true Prisoner’s Dilemma” explains why cooperation isn’t automatically the ‘right’ or ‘good’ option.
Soares’ “Causal decision theory is unsatisfactory” uses the Prisoner’s Dilemma to illustrate the importance of non-causal connections between decision algorithms.
Yudkowsky’s “Newcomb’s problem and regret of rationality” argues for focusing on decision theories that ‘win,’ not just on ones that seem intuitively reasonable. Soares’ “Introduction to Newcomblike problems” covers similar ground.
Soares’ “Newcomblike problems are the norm” notes that human agents probabilistically model one another’s decision criteria on a routine basis.
MIRI’s research has led to the development of “Updateless Decision Theory” (UDT), a new decision theory which addresses many of the shortcomings discussed above.
Hintze’s “Problem class dominance in predictive dilemmas” summarizes UDT’s dominance over other known decision theories, including Timeless Decision Theory (TDT), another theory that dominates CDT and EDT.
Fallenstein’s “A model of UDT with a concrete prior over logical statements” provides a probabilistic formalization.
However, UDT is by no means a solution, and has a number of shortcomings of its own, discussed in the following places:
Slepnev’s “An example of self-fulfilling spurious proofs in UDT” explains how UDT can achieve sub-optimal results due to spurious proofs.
Benson-Tilsen’s “UDT with known search order” is a somewhat unsatisfactory solution. It contains a formalization of UDT with known proof-search order and demonstrates the necessity of using a technique known as “playing chicken with the universe” in order to avoid spurious proofs.
For more on decision theory, here is Luke Muehlhauser and Crazy88's FAQ.
2. Can stable self-improvement be delegated to an AI?
Paul Christiano also argues for 'yes' here:
“Stable self-improvement” seems to be a primary focus of MIRI’s work. As I understand it, the problem is “How do we build an agent which rationally pursues some goal, is willing to modify itself, and with very high probability continues to pursue the same goal after modification?”
The key difficulty is that it is impossible for an agent to formally “trust” its own reasoning, i.e. to believe that “anything that I believe is true.” Indeed, even the natural concept of “truth” is logically problematic. But without such a notion of trust, why should an agent even believe that its own continued existence is valuable?
I agree that there are open philosophical questions concerning reasoning under logical uncertainty, and that reflective reasoning highlights some of the difficulties. But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI. I would prefer use energy investigating other aspects of the AI safety problem... (more)
3. On the virtues of human review
Bostrom mentions the possibility of having an 'oracle' or some such non-interfering AI tell you what your 'sovereign' will do. He suggests some benefits and costs of this—namely, it might prevent existential catastrophe, and it might reveal facts about the intended future that would make sponsors less happy to defer to the AI's mandate (coherent extrapolated volition or some such thing). Four quick thoughts:
1) The costs and benefits here seem wildly out of line with each other. In a situation where you think there's a substantial chance your superintelligent AI will destroy the world, you are not going to set aside what you think is an effective way of checking, because it might cause the people sponsoring the project to realize that it isn't exactly what they want, and demand some more pie for themselves. Deceiving sponsors into doing what you want instead of what they would want if they knew more seems much, much, much much less important than avoiding existential catastrophe.
2) If you were concerned about revealing information about the plan because it would lift a veil of ignorance, you might artificially replace some of the veil with intentional randomness.
3) It seems to me that a bigger concern with humans reviewing AI decisions is that it will be infeasible. At least if the risk from an AI is that it doesn't correctly manifest the values we want. Bostrom describes an oracle with many tools for helping to explain, so it seems plausible such an AI could give you a good taste of things to come. However if the problem is that your values are so nuanced that you haven't managed to impart them adequately to an AI, then it seems unlikely that an AI can highlight for you the bits of the future that you are likely to disapprove of. Or at least you have to be in a fairly narrow part of the space of AI capability, where the AI doesn't know some details of your values, but for all the important details it is missing, can point to relevant parts of the world where the mismatch will manifest.
4) Human oversight only seems feasible in a world where there is much human labor available per AI. In a world where a single AI is briefly overseen by a programming team before taking over the world, human oversight might be a reasonable tool for that brief time. Substantial human oversight does not seem helpful in a world where trillions of AI agents are each smarter and faster than a human, and need some kind of ongoing control.
4. Avoiding catastrophe as the top priority
In case you haven't read it, Bostrom's Astronomical Waste is a seminal discussion of the topic.
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
- See MIRI's research agenda
- For any plausible entry on the list of things that can't be well delegated to AI, think more about whether it belongs there, or how to delegate it.
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about strategy in directing science and technology. To prepare, read “Science and technology strategy” from Chapter 14. The discussion will go live at 6pm Pacific time next Monday 9 March. Sign up to be notified here.