Recommended Reading for Friendly AI Research

8ata

7Vladimir_Nesov

4djcb

4Vladimir_Nesov

4[anonymous]

3[anonymous]

3[anonymous]

2Vladimir_Nesov

2[anonymous]

0[anonymous]

1Vladimir_Nesov

0JohnDavidBustard

2Vladimir_Nesov

0JohnDavidBustard

1Vladimir_Nesov

1whpearson

0Vladimir_Nesov

1whpearson

0JohnDavidBustard

0whpearson

0JohnDavidBustard

-1JohnDavidBustard

3Vladimir_Nesov

0JohnDavidBustard

2Vladimir_Nesov

0JohnDavidBustard

2Vladimir_Nesov

-2JohnDavidBustard

0xamdam

-2XiXiDu

New Comment

30 comments, sorted by Click to highlight new comments since: Today at 8:56 AM

Thanks for posting this. Some things to add to my reading list.

If you consider this "(potentially) useful training for making progress on Friendly AI", then do you expect that a person who has worked through this material will have a good sense of whether they are qualified to actually try to make progress on FAI (or will be evaluable for that by someone with more experience working on FAI)? I want to do as much as I can to contribute to FAI, whether directly (by working on the actual problems) or indirectly (e.g. getting rich and donating a lot to SIAI), whichever I end up being most efficient at. Right now I'm not efficient at much of anything, because of some severe issues with mental energy that I'm only now starting to possibly resolve after several years, but once I am more competent at life in general, I want to at least investigate the possibility that I could be directly useful to FAI research. (I'm not a savant or a mutant supergenius, but I am at least a normal genius.) If, at that point, I can get through all of this math successfully, will that be an indication that I should look further?

My best guess at productive subgoal for FAI is development of decision theory along the lines given in the last post, in order to better understand decision-making and the impossible problem in particular (how to define preference given an arbitrary agent's program; what is a notion of preference that is general enough for human preference to be an instance).

About a year ago I was still at the "rusty technical background" stage, and my attempts to think about decision theory were not quite adequate. Studying mathematics helped significantly by allowing to think more clearly and about more complicated constructions. More recently, study of mathematical logic allowed me to see the beautiful formalizations of decision theory I'm currently working on.

I can't tell you that studying this truckload of textbooks will get any results, but reading textbooks is something I know how to do, unlike how to make progress on FAI, so unless I find something better, it's what I'll continue doing.

Ambient decision theory, as it currently stands, requires some grasp of logic to think about, but the level of Enderton's book might be adequate. I'm going deeper in the hope of developing more mathematical muscle to allow jumping over wider inferential gaps, even if I don't know in what way. Relying on creative surprises.

Interesting... I had missed Eliezer's draft Timeless Decision Theory before. Has it been discussed already?

It might be helpful for those who have learned a significant amount of mathematics to share their insights about *how* to learn math. Though individuals may have different learning patterns, I expect that some common principles and practices would arise from a focused discussion. This could increase the rate of productive learning.

ETA: I've made a discussion post where we can, um, discuss.

Are there any parts of theoretical computer science that you expect will be useful?

If you think the subject matter of mathematical logic will be useful then it might be worthwhile to take a look at the Curry-Howard correspondence. From the link:

In other words, the Curry–Howard correspondence is the simple observation that two at-the-time-seemingly-unrelated families of formalisms, the proof systems on one side and the models of computation on the other side, were, on the two examples considered by Curry and Howard, in fact structurally the same kind of objects.

Another book to consider: The Princeton Companion to Mathematics

It addresses fundamental mathematical ideas, theories, and practices from an insightful conceptual perspective. Using it one could build a large connected web of mathematical concepts, which would likely be useful for identifying the mathematical structure in new problems.

Also, IMO the writing quality is *outstanding*. The editor went out of his way to find mathematicians that were also excellent writers (like Terence Tao).

It could be useful to (already) know many things, but another question is how to efficiently get to learn them, starting from what background. Once you are at graduate level, a lot more becomes accessible, so the first step is to get there.

My sequence meant to suggest a way of reaching that level by self-study, and getting a good grasp of basic tools of logic in the process. It's probably not the best way, but if you suggest improvements, they ought to be improvements in achieving this particular goal, not just things associated with elements of the original plan, such as "more math books". Other goal could be worthwhile too, but it would be better to state the different intention before proceeding.

My sequence meant to suggest a way of reaching that level by self-study, and get a good grasp of basic tools of logic in the process. [...] Other goal could be worthwhile too, but it would be better to state the different intention before proceeding.

Ah, I see. You could make that more clear in the post fairly easily.

It might be helpful to suggest two paths of self-study, targeting those below and at graduate-level respectively. I'm not sure if that would be best done in a separate post or not.

if you suggest improvements, they ought to be improvements in achieving this particular goal

A suggested order might be useful. I'd at least recommend reading about algebra and topology before category theory, so that one builds up fundamental examples of category-theoretic objects.

A suggested order might be useful.

The books are suggested in order as given.

I'd at least recommend reading about algebra and topology before category theory, so that one builds up fundamental examples of category-theoretic objects.

I essentially followed this rule. "Conceptual mathematics" is elementary, "Sets for mathematics" deals mainly with set theory from category-theoretic perspective. The more general treatment of category theory is given in Awodey's book, which comes after Munkres that presents general and algebraic topology. Mac Lane's algebra comes before "Sets for mathematics" as well.

I am sure these are interesting references for studying pure mathematics but do they contribute significantly to solving AI?

In particular, it is interesting that none of your references mention any existing research on AI. Are there any practical artificial intelligence problems that these mathematical ideas have directly contributed towards solving?

E.g. Vision, control, natural language processing, automated theorem proving?

While there is a lot of focus on specific, mathematically defined problems on LessWrong (usually based on some form of gambling), there seems to be very little discussion of the actual technical problems of GAI or a practical assessment of progress towards solving them. If this site is really devoted to rationality should we not at least define our problem and measure progress towards its solution. Otherwise we risk being merely a mathematical social club, or worse, a probability based religion?

The main mystery in FAI, as I currently see it, is how to define its goal. The question of efficient implementation comes after that and depending on that. There is no point in learning how to efficiently solve the problem you don't want to be solved. Hence the study of decision theory, which in turn benefits from understanding math.

See the "rationality and FAI" section, Eliezer's paper for a quick introduction, also stuff from sequences, for example complexity of value.

Ok, I certainly agree that defining the goal is important. Although I think there is a definite need for a balance between investigation of the problem and attempts at its solution (as each feed into one another). Much as how academia currently functions. For example, any AI will need a model of human and social behaviour in order to make predictions. Solving how an AI might learn this would represent a huge step towards solving FAI and a huge step in understanding the problem of being friendly. I.e. whatever the solution is will involve some configuration of society that maintains and maximises some set of measurable properties from it.

If the system can predict how a person will feel in a given state it can solve for which utopia we will be most enthusiastic about. Eliezer's posts seem to be exploring this problem manually, without really taking a stab at a solution, or proposing a route to reaching one. This can be very entertaining but I'm not sure it's progress.

Unfortunately, if you think about it, "predicting how a person feels" isn't really helpful to anything, and doesn't contribute to the project of FAI at all (see Are wireheads happy? and The Hidden Complexity of Wishes, for example).

The same happens with other obvious ideas that you think up in the first 5 minutes of considering the problem, and which appear to argue that "research into nuts and bolts of AGI" is relevant for FAI. But on further reflection, it always turns out that these arguments don't hold any water.

The problem comes down the the question of understanding of what it is exactly you want FAI to do, not of how you'd manage to write an actual program that does that with reasonable efficiency. The horrible truth is that we don't have the slightest technical understanding of what it is we want.

Here is a more complex variant that I can't see how to dismiss easily.

If you can build a "predict how humans feel in situation x" function, you can do some interesting things. Lets call this function *feel*(x). Now as well as first order happiness, you can also predict how they will feel when told about situation X, so feel("told about X").

You might be able to recover something like preference if you can calculate feel("the situation where X is suggested and, told about feel(X) and told about all other possible situations") , for all possible situations, as long as you can rank the output of feel(X) in some way.

Well as long as the human simulator predictor can cope with holding in all possible situations, and not return "worn out" for all situations.

Anyway it is an interesting riff off the idea. Anyone see any holes that I am missing?

Try to figure out what maximizes this estimate method. It won't be anything you'd want implemented, it will be a wireheading stimulus. Plus, FAI needs to valuate (and work to implement) whole possible worlds, not verbal descriptions. And questions about possible worlds involve quantifies of data that a mere human can't handle.

Try to figure out what maximizes this estimate method. It won't be anything you'd want implemented, it will be a wireheading stimulus.

I'm not sure that there is a verbal description of a possible world that is also a wirehead stimulus for me. There might be, which might be enough to discount this method.

And questions about possible worlds involve quantifies of data that a mere human can't handle.

True.

I'm not sure I understand the distinction between an answer that we would want and a wireheading solution. Are not all solutions wireheading with an elaborate process to satisfy our status concerns. I.e. is there a real difference between a world that satisfies what we want and directly altering what we want? If the wire in question happens to be an elaborate social order rather than a direct connection why is that different? What possible goal could we want pursued other than the one which we want?

is there a real difference between a world that satisfies what we want and directly altering what we want?

From an evolutionary point of view those things that manage to procreate will out compete those things that change themselves to not care about that and just wirehead.

So in non-singleton situations, alien encounters and any form of resource competition it matters whether you wirehead or not. Pleasure, in an evolved creature, can be seen as the giving (very poor) information on the map to the territory of future influence for the patterns that make up you.

So, assuming survival is important, a solution that maximises survival plus wireheading would seem to solve that problem. Of course it may well just delay the inevitable heat death ending but if we choose to make that important, then sure, we can optimise for survival as well. I'm not sure that gets around the issue that any solution we produce (with or without optimisation for survival) is merely an elaborate way of satisfying our desires (in this case including the desire to continue to exist) and thus all FAI solutions are a form of wireheading.

When I say feel, I include:

I feel that is correct. I feel that is proved etc.

Regardless of the answer, it will ultimately involve our minds expressing a preference. We cannot escape our psychology. If our minds are deterministic computational machines within a universe without any objective value, all our goals are merely elaborate ways to make us feel content with our choices and a possibly inconsistent set of mental motivations. Attempting to model our psychology seems like the most efficient way to solve this problem. Is the idea that there is some other kind of answer? How would could it be shown to be legitimate?

I suspect that the desire for another answer is preventing practical progress in creating any meaningful solution. There are many problems and goals that would be relatively uncontroversial for an AI system to attempt to address. The outcome of the work need only be better than what we currently have to be useful we don't have to solve all problems before addressing some of them and indeed without attempting to address some of them I doubt we will make significant progress on the rest.

If our minds are deterministic computational machines within a universe without any objective value, all our goals are merely elaborate ways to make us feel content with our choices and a possibly inconsistent set of mental motivations. Attempting to model our psychology seems like the most efficient way to solve this problem.

Which problem? You need to define which action should AI choose, in whatever problem it's solving, including the problems that are not humanly comprehensible. This is naturally done in terms of actual humans with all their psychology (as the only available source of sufficiently detailed data about what we want), but it's not at all clear in what way you'd want to use (interpret) that human data.

"Attempting to model psychology" doesn't answer any questions. Assume you have a proof-theoretic oracle and a million functioning uploads living in a virtual world however structured, so that you can run any number of experiments involving them, restart these experiments, infer the properties of whole infinite collections of such experiments and so on. You still won't know how to even approach creating a FAI.

If there is an answer to the problem of creating an FAI, it will result from a number of discussions and ideas that lead a set of people to agreeing that a particular course of action is a good one. By modelling psychology it will be possible to determine all the ways this can be done. The question then is why choose one over any of the others? As soon as one is chosen it will work and everyone will go along with it. How could we rate each one? (they would all be convincing by definition). Is it meaningful to compare them? Is the idea that there is some transcendent answer that is correct or important that doesn't boil down to what is convincing to people?

Understanding the *actual abstract reasons* for agents' decisions (such as decisions about agreeing with a given argument) seems to me a promising idea, I'm trying to make progress on that (agents' decisions don't need to be correct or well-defined on most inputs for the reasons behind their more well-defined behaviors to lead the way to figuring out what to do in other situations or what should be done where the agents err). Note that if you postulate an algorithm that makes use of humans as its elements, you'd still have the problems of failure modes, regret for bad design decisions and of the capability to answer humanly incomprehensible questions, and these problems need to be already solved before you start the thing up.

Interesting, if I understand correctly the idea is to find a theoretically correct basis for deciding on a course of action given existing knowledge and then to make this calculation efficient and then direct towards a formally defined objective.

As distinct from a system which potentially sub optimally, attempts solutions and tries to learn improved strategies. i.e. one in which the theoretical basis for decision making is ultimately discovered by the agent over time (e.g. as we have done with the development of probability theory). I think the perspective I'm advocating is to produce a system that is more like an advanced altruistic human (with a lot of evolutionary motivations removed) than a provably correct machine. Ideally such a system could itself propose solutions to the FAI problem that would be convincing, as a result of an increasingly sophisticated understanding of human reasoning and motivations.

I realise there is a fear that such a system could develop convincing yet manipulative solutions. However the output need only be more trustworthy than a human's response to be legitimate (for example based on an analysis of its reasoning algorithm it appears to lack a Machiavellian capability, unlike humans).

Or put another way, can a robot Vladimir (Eliezer etc.) be made that solves the problem faster than their human counterparts do. And is there any reason to think this process is less safe (particularly when AI developments will continue regardless)?

Interesting, if I understand correctly the idea is to find a theoretically correct basis for deciding on a course of action given existing knowledge and then to make this calculation efficient and then direct towards a formally defined objective.

Yes, but there is only one top-level objective, to do the right thing, so one doesn't need to define an objective separately from the goal system itself (and improving state of knowledge is just another thing one can do to accomplish the goal, so again not a separate issue).

FAI really stands for a method of efficient production of goodness, as we would want it produced, and there are many landmines on this path, in particular humanity in its current form doesn't seem to be able to retain its optimization goal in the long run, and the same applies to most obvious hacks that don't have explicit notions of preference, such as upload societies. It's not just a question of speed, but also of ability to retain the original goal after quadrillions of incompletely understood self-modifications.

Ok, so how about this work around.

The current approach is to have a number of human intelligences continue to explore this problem until they enter a mental state C (for convinced they have the answer to FAI). The next stage is to implement it.

We have no other route to knowledge other than to use our internal sense of being convinced. I.e. no oracle to tell us if we are right or not.

So what if we formally define what this mental state C consists of and then construct a GAI which provably pursues only the objective of creating this state. The advantage being that we now have a means of judging our progress because we have a formally defined measurable criteria for success. (In fact this process is a valuable goal regardless of the use of AI but it now makes it possible to use AI techniques to solve it).

Related by Cousin It: http://lesswrong.com/r/discussion/lw/2sw/math_prerequisites_for_understanding_lw_stuff/

It would be helpful if the books we ordered by dependency (topological sort) & order of difficulty .

Don't forget nickbostrom.com and Moral Machines: Teaching Robots Right from Wrong.

This post enumerates texts that I consider (potentially) useful training for making progress on Friendly AI/decision theory/metaethics.## Rationality and Friendly AI

Eliezer Yudkowsky's sequences and this blog can provide solid introduction to the problem statement of Friendly AI, giving concepts useful for understanding motivation for the problem, and disarming endless failure modes that people often fall into when trying to consider the problem.

For a shorter introduction, see

## Decision theory

The following book introduces an approach to decision theory that seems to be closer to what's needed for FAI than the traditional treatments in philosophy or game theory:

Another (more technical) treatment of decision theory from the same cluster of ideas:

Following posts on Less Wrong present ideas relevant to this development of decision theory:

Towards a New Decision TheoryIngredients of Timeless Decision TheoryAI cooperation in practiceNotion of Preference in Ambient Control## Mathematics

The most relevant tool for thinking about FAI seems to be mathematics, where it teaches to work with precise ideas (in particular, mathematical logic). Starting from a rusty technical background, the following reading list is one way to start:

[

Edit Nov 2011: I no longer endorse scope/emphasis, gaps between entries, and some specific entries on this list.]S. Awodey (2006). Category Theory. Oxford Logic Guides. Oxford University Press, USA.P. G. Hinman (2005). Fundamentals of Mathematical Logic. A K Peters Ltd.