This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the tenth section in the reading guide: Instrumentally convergent goals. This corresponds to the second part of Chapter 7.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. And if you are behind on the book, don't let it put you off discussing. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

ReadingInstrumental convergence from Chapter 7 (p109-114)


  1. The instrumental convergence thesis: we can identify 'convergent instrumental values' (henceforth CIVs). That is, subgoals that are useful for a wide range of more fundamental goals, and in a wide range of situations. (p109)
  2. Even if we know nothing about an agent's goals, CIVs let us predict some of the agent's behavior (p109)
  3. Some CIVs:
    1. Self-preservation: because you are an excellent person to ensure your own goals are pursued in future.
    2. Goal-content integrity (i.e. not changing your own goals): because if you don't have your goals any more, you can't pursue them.
    3. Cognitive enhancement: because making better decisions helps with any goals.
    4. Technological perfection: because technology lets you have more useful resources.
    5. Resource acquisition: because a broad range of resources can support a broad range of goals.
  4. For each CIV, there are plausible combinations of final goals and scenarios under which an agent would not pursue that CIV. (p109-114)


1. Why do we care about CIVs?
CIVs to acquire resources and to preserve oneself and one's values play important roles in the argument for AI risk. The desired conclusions are that we can already predict that an AI would compete strongly with humans for resources, and also than an AI once turned on will go to great lengths to stay on and intact.

2. Related work
Steve Omohundro wrote the seminal paper on this topic. The LessWrong wiki links to all of the related papers I know of. Omohundro's list of CIVs (or as he calls them, 'basic AI drives') is a bit different from Bostrom's:

  1. Self-improvement
  2. Rationality
  3. Preservation of utility functions
  4. Avoiding counterfeit utility
  5. Self-protection
  6. Acquisition and efficient use of resources

3. Convergence for values and situations
It seems potentially helpful to distinguish convergence over situations and convergence over values. That is, to think of instrumental goals on two axes - one of how universally agents with different values would want the thing, and one of how large a range of situations it is useful in. A warehouse full of corn is useful for almost any goals, but only in the narrow range of situations where you are a corn-eating organism who fears an apocalypse (or you can trade it). A world of resources converted into computing hardware is extremely valuable in a wide range of scenarios, but much more so if you don't especially value preserving the natural environment. Many things that are CIVs for humans don't make it onto Bostrom's list, I presume because he expects the scenario for AI to be different enough. For instance, procuring social status is useful for all kinds of human goals. For an AI in the situation of a human, it would appear to also be useful. For an AI more powerful than the rest of the world combined, social status is less helpful.

4. What sort of things are CIVs?
Arguably all CIVs mentioned above could be clustered under 'cause your goals to control more resources'. This implies causing more agents to have your values (e.g. protecting your values in yourself), causing those agents to have resources (e.g. getting resources and transforming them into better resources) and getting the agents to control the resources effectively as well as nominally (e.g. cognitive enhancement, rationality). It also suggests convergent values we haven't mentioned. To cause more agents to have one's values, one might create or protect other agents with your values, or spread your values to existing other agents. To improve the resources held by those with one's values, a very convergent goal in human society is to trade. This leads to a convergent goal of creating or acquiring resources which are highly valued by others, even if not by you. Money and social influence are particularly widely redeemable 'resources'. Trade also causes others to act like they have your values when they don't, which is a way of spreading one's values. 

As I mentioned above, my guess is that these are left out of Superintelligence because they involve social interactions. I think Bostrom expects a powerful singleton, to whom other agents will be irrelevant. If you are not confident of the singleton scenario, these CIVs might be more interesting.

5. Another discussion
John Danaher discusses this section of Superintelligence, but not disagreeably enough to read as 'another view'. 

Another view

I don't know of any strong criticism of the instrumental convergence thesis, so I will play devil's advocate.

The concept of a sub-goal that is useful for many final goals is unobjectionable. However the instrumental convergence thesis claims more than this, and this stronger claim is important for the desired argument for AI doom. The further claims are also on less solid ground, as we shall see.

According to the instrumental convergence thesis, convergent instrumental goals not only exist, but can at least sometimes be identified by us. This is needed for arguing that we can foresee that AI will prioritize grabbing resources, and that it will be very hard to control. That we can identify convergent instrumental goals may seem clear - after all, we just did: self-preservation, intelligence enhancement and the like. However to say anything interesting, our claim must not only be that these values are better than not, but that they will be prioritized by the kinds of AI that will exist, in a substantial range of circumstances that will arise. This is far from clear, for several reasons.

Firstly, to know what the AI would prioritize we need to know something about its alternatives, and we can be much less confident that we have thought of all of the alternative instrumental values an AI might have. For instance, in the abstract intelligence enhancement may seem convergently valuable, but in practice adult humans devote little effort to it. This is because investments in intelligence are rarely competitive with other endeavors.

Secondly, we haven't said anything quantitative about how general or strong our proposed convergent instrumental values are likely to be, or how we are weighting the space of possible AI values. Without even any guesses, it is hard to know what to make of resulting predictions. The qualitativeness of the discussion also raises the concern that thinking on the problem has not been very concrete, and so may not be engaged with what is likely in practice.

Thirdly, we have arrived at these convergent instrumental goals by theoretical arguments about what we think of as default rational agents and 'normal' circumstances. These may be very different distributions of agents and scenarios from those produced by our engineering efforts. For instance, perhaps almost all conceivable sets of values - in whatever sense - would favor accruing resources ruthlessly. It would still not be that surprising if an agent somehow created noisily from human values cared about only acquiring resources by certain means or had blanket ill-feelings about greed.

In sum, it is unclear that we can identify important convergent instrumental values, and consequently unclear that such considerations can strongly help predict the behavior of real future AI agents.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.


  1. Do approximately all final goals make an optimizer want to expand beyond the cosmological horizon?
  2. Can we say anything more quantitative about the strength or prevalence of these convergent instrumental values?
  3. Can we say more about values that are likely to be convergently instrumental just across AIs that are likely to be developed, and situations they are likely to find themselves in?


If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the treacherous turn. To prepare, read “Existential catastrophe…” and “The treacherous turn” from Chapter 8The discussion will go live at 6pm Pacific time next Monday 24th November. Sign up to be notified here.

New Comment
33 comments, sorted by Click to highlight new comments since: Today at 3:24 PM

I think that if SAIs will have social part we need to think altruisticaly about them.

It could be wrong (and dangerous too) think that they will be just slaves.

We need to start thinking positively about our children. :)

See also TurnTrout's contribution in this area.

I am not that confident in the convergence properties of self-preservation as instrumental goal.

It seems that at least some goals should be pursued ballistically -- i.e., by setting an appropriate course in motion so that it doesn't need active guidance.

For example, living organisms vary widely in their commitments to self-preservations. One measure of this variety is the variety of lifespans and lifecycles. Organisms generally share the goal of reproducing, and they pursue this goal by a range of means, some of which require active guidance (like teaching your children) and some of which don't (like releasing thousands of eggs into the ocean).

If goals are allowed to range very widely, it's hard to believe that all final goals will counsel the adoption of the same CIVs as subgoals. The space of all final goals seems very large. I'm not even very sure what a goal is. But it seems at least plausible that this choice of CIVs is contaminated by our own (parochial) goals, and given the full range of weird possible goals these convergences only form small attractors.

A different convergence argument might start from competition among goals. A superintelligence might not "take off" unless it starts with sufficiently ambitious goals. Call a goal ambitious if its CIVs include coming to control significant resources. In that case, even if only a relatively small region in goal-space has the CIV of controlling significant resources, intelligences with those goals will quickly be overrepresented. Cf. this intriguing BBS paper I haven't read yet.

Can you think of more convergent instrumental values?

"Convergent Instrumental Values" seems like a very jargony phrase for "good tactics."

Patience Caution Deterred by credible threat Reputation-aware- It will extensively model what we think about it. Curiousity Reveals information when strategically appropriate ...

Your additional instrumental values spread your values and social influence become very important if we avoid rising of a decisive advantage AI.

In a society of AI enhanced humans and other AGIs altruism will become an important instrumental value for AGIs. The wide social network will recognize well-behavior and anti-social behavior without forgetting. Facebook gives a faint glimpse of what will be possible in future.

Stephen Omohundro said it in a nutshell: “AIs can control AIs.”

I wonder what kinds of insights would be available for a Superintelligence that might override it's goal-content integrity goal. To begin with, the other instrumental goals, if they emerged, would themselves tamper with its allocation of resources.

Now consider humans: sometimes we think our goals are "travel a lot" but actually they turn out to be "be on a hammock two hours a day and swim once a week".

By trial and error, we re-map our goals based on what they feel from the inside, or rewards.

Which similar processes might an AGI do?

One reason we did not go travelling might have been a resource constraint, perhaps money but also a limited ability to plan good trips because of distraction or knowledge should be counted as a limitation of planning resources.

That aside, people still have multiple drives which are not really goals, and we sort of compromise amongst these drives. The approach the mind takes is not always the best.

In people, it's really those mid-brain drives that run a lot of things, not intellect.

We could try to carefully program in some lower-level or more complex sets of "drives" into an AI. The "utility function" people speak of in these threads is really more like an incredibly overpowering drive for the AI.

If it is wrong, then there is no hedge, check or diversification. The AI will just pursue that drive.

As much as our minds often .take us in the wrong direction with our drives, at least they are diversified and checked.

Checks and diversification of drives seem like an appealing element of mind design, even at significant cost to efficiency at achieving goals. We should explore these options in detail.

In humans, goal drift may work as a hedging mechanism.

But I don't think "utility function" in the context of this post has to mean, a numerical utility explicitly computed in the code.

It could just be, the agent behaves as-if its utilities are given by a particular numerical function, regardless of whether this is written down anywhere.

People do not behave as if we have utilities given by a particular numerical function that collapses all of their hopes and goals into one number, and machines need not do it that way, either.

Often when we act, we end up 25% short of the optimum solution, but we have been hypothesizing systems with huge amounts of computing power.

If they frequently end up 25% or even 80% short of behaving optimally, so what? In exchange for an AGI that stays under control we should be willing to make the trade-off.

In fact, if their efficiency falls by 95%, they are still wildly powerful.

Eliezer and Bostrom have discovered a variety of difficulties with AGIs which can be thought of as collapsing all of their goals into a single utility function.

Why not also think about making other kinds of systems?

An AGI could have a vast array of hedges, controls, limitations, conflicting tendencies and tropisms which frequently cancel each other out and prevent dangerous action.

The book does scratch the surface on these issues, but it is not all about fail-safe mind design and managed roll-out. We can develop a whole literature on those topics.

People do not behave as if we have utilities given by a particular numerical function that collapses all of their hopes and goals into one number, and machines need not do it that way, either.

I think this point is well said, and completely correct.


Why not also think about making other kinds of systems?

An AGI could have a vast array of hedges, controls, limitations, conflicting tendencies and tropisms which frequently cancel each other out and prevent dangerous action.

The book does scratch the surface on these issues, but it is not all about fail-safe mind design and managed roll-out. We can develop a whole literature on those topics.

I agree. I find myself continually wanting to bring up issues in the latter class of issues... so copiously so, that frequently it feels like I am trying to redesign our forum topic. So, I have deleted numerous posts-in-progress that fall into that category. I guess those of us who have ideas about fail-safe mind design that are more subtle -- or to put it more neutrally -- do not fit the running paradigm in which the universe of discourse is that of transparent, low-dimensional (low dimensional function range space, not low dimensional function domain space) utility functions, need to start writing our own white papers.

When I hear the Bostrom claims only 7 people in the world are thinking full time and productively about (in essence) fail safe mind design, or that someone at MIRI wrote only FIVE people are doing so (though in the latter case, the author of that remark did say that there might be others doing this kind of work "on the margin", whatever that means), I am shocked.

It's hard to believe, for one thing. Though, the people making those statements must have good reasons for doing so.

But maybe the deriviation of such low numbers could be more understandable, if one stipulates that "work on the problem" is to be counted if and only if candidate people belong to the equivalence class of thinkers restricting their approach to this ONE, very narrow conceptual and computational vocabulary.

That kind of utility function-based discussion (remember when they were called 'heuristics' in the assigned projects, in our first AI courses?) has its value, but it's a tiny slice of the possible conceptual, logical and design pie ... about like looking at the night sky through a soda straw. If we restrict ourselves to such approaches, no wonder people think it will take 50 or 100 years to do AI of interest.

Ourside of the culture of collapsing utility functions and the like, I see lots of smart (often highly mathematical, so they count as serious) papers in whole brain chaotic resonant neurodynamics; new approachs to foundations of mental health issues and disorders of subjective empathy (even some application of deviant neurodynamics to deviant cohort value theory, and defective cohort "theory of mind" -- in the neuropsychiatric and mirror neuron sense) that are grounded in, say, pathologies with transient Default Node Network coupling... and distrubances of phase coupled equilibria across the brain.

If we run out of our own ideas to use from scratch (which I don't think is at all the case ... as your post might suggest, we have barely scratched the surface), then we can go have a look at current neurology and neurobiology, where people are not at all shy about looking for "information processing" mechanisms underlying complex personality traits, even underlying value and aesthetic judgements.

I saw a visual system neuroscientist's paper the other day offering a theory of why abstract (ie. non-representational) art is so intriguing to (not all, but some) human brains. It was a multi-layered paper, discussing some transiently coupled neurodynamical mechanisms of vision (the authors' specialties), some reward system neuromodulator concepts, and some traditional concepts expressed at a phenomenological, psychological level of description. An ambitious paper, yes!

But ambition is good. I keep saying, we can't expect to do real AI on the cheap.

A few hours or days reading such papers is good fertilizer, even if we do not seek to translate, in any direct way (like copying "algorithms" from natural brains) wetware brain research, into our goal, which presumably is to do dryware mind design --- and do it in a way where we choose our own functional limits, not have nature's 4.5 billion years of accidents choose boundary conditions on substrate platforms, for us.

Of course, not everyone is interested in doing this. I HAVE learned in this forum, that "AI" is a "big tent". Lots of uses exist for narrow AI, in thousands of indutries and fields. Thousands of narrow AI systems are already in play.

But, really... aren't most of us interested in this topic because we want the more ambitious result?

Bostrom says "we will not be concerned with the metaphysics of mind..." and "...not concern ourselves whether these entities have genuine self-awareness...."

Well, I guess we won't be BUILDING real minds anytime soon, then. One can hardly expect to create, that which one won't even openly discuss. Bostrom is wrting and speaking, using the language of "agency" and "goals" and "motivational sets", but he is only using those terms metaphorically.

Unless, that is, everyone else in here (other than me) actually is prepared to deny that we -- who spawned those concepts, to describe rich, conscious, intentionally entrained features of the lives of self-aware, genuine conscious creatures -- are different, i.e., that we are conscious and self-aware.

No one here needs a lesson in intellectual history. We all know that people did deny that , back in the behaviorism era. (I have studied the reasons -- philosophical and cultural -- and continue to uncover in great detail, mistaken assumptions out of which that intellectual fad grew.)

Only ff we do THAT again, will we NOT be using "agent" metaphorically, when we apply that to machines with no real consciousness, because ex hypothesi WE'd posess no minds either, in the sense we all know we do posess, as conscious humans.

We'd THEN be using it ('agent", "goal", "motive" ... the whole equivalence class of related nouns and predicates) in the same sense for both classes of entities (ourselves, and machines with no "awareness", where the latter is defined as anyting other than public, 3rd person observable behavior.)

Only in this case, would it not be a metaphor to use 'agent, motive', etc. in describing intelligent (but not conscious) machines, whcih evidently is the astringent conceptual model within which Bostrom wishes to frame HLAI --- proscribing considerations, as he does, of whether they are genuinely self-aware.

But, well, I always thought that that excessively positivistic attitude, had more than a little something to do with the "AI winter" (just like it is widely acknowledged to have been responsible for the neuroscience winter that paralleled it.)

Yet neuroscientists are not embarassed to now say, "That was a MISTAKE, and -- fortunately -- we are over it. We wasted some good years, but are no longer wasting time denying the existence of consciousness, the very thing that makes the brain interesting and so full of fundamental scientific interest. And now, the race is on to understand how the brain creates real mental states."

NEUROSCIENCE has gotten over that problem with discussing mental states qua mental states , clearly.

And this is one of the most striking about-faces in the modern intelllectual history of science.

So, back to us. What's wrong with computer science? Either AI-ers KNOW that real consciousness exists, just like neuroscientists do, and AI-ers just don't give a hoot about making machines that are actually conscious.

Or, AI-ers are afraid of tackling a problem that is a little more interesting, deeper, and harder (a challenge that gets thousands of neuroscientists and neurophilosophers up on the morning.)

I hope the latter is not true, because I think the depth and possibilities of the real thing -- AI with consciousnes -- are what gives it all the attraction (and holds, in the end, for reasons I won't attempt to desribe in a short post, the only possibility of making the things friendly, if not benificient.)

Isn't that what gives AI its real interest? Otherwise, why not just write business software?

Could it be that Bostrom is throwing out the baby with the bathwater, when he stipulates that the discussion, as he frames it, can be had (and meaningful progress made), without the interlocutors (us) being concerned about whether AIs have genuine self awareness, etc?

One possible explanation for the plasticity of human goals is that the goals that change aren't really final goals.

So me-now faces the question,

Should I assign any value to final goals that I don't have now, but that me-future will have because of goal drift?

If goals are interpreted widely enough, the answer should be, No. By hypothesis, those goals of me-future make no contribution to the goals of me-now, so they have no value to me. Accordingly, I should try pretty hard to prevent goal drift and / or reduce investment in the well-being of me-future.

Humans seem to answer, Yes, though. They simultaneously allow goal drift, and care about self-preservation, even though the future self may not have goals in common with the present.

This behavior can be rationalized if we assume that it's mostly instrumental goals that drift, with final goals remaining fixed. So maybe humans have the final goal of maximizing their inclusive fitness, and consciously accessible goals are just noisy instruments for this final goal. In that case, it may be rational to embrace goal drift because 1) future instrumental goals will be better suited to implementing the final goal, under changed future circumstances, and 2) allowing goals to change produces multiple independent instruments for the final goal, which may reduce statistical noise.

Should I assign any value to final goals that I don't have now, but that me-future will have because of goal drift?

The first issue is that you don't know what they will be.

What did you find most interesting this week?

I thought Bostrom's qualifications about when various alleged instrumentally convergent goals might not actually be desireable were pretty interesting.

Are you convinced that an AI will probably pursue the goals discussed in this section?

AI systems need not be architected to optimize a fully-specified, narrowly defined utility function at all.

Where does the presumption that an AGI necessarily becomes an unbounded optimizer come from if it is not architected that way? Remind me because I am confused. Tools, oracles and these neuromorphic laborers we talked about before do not seem to have this bug (although maybe they could develop something like it.)

My reading is that what Bostrom is saying is that boundless optimization an easy bug to introduce, not that any AI has it automatically.

My reading is that what Bostrom is saying is that boundless optimization an easy bug to introduce, not that any AI has it automatically.

I wouldn't call it a bug, generally. Depending on what you want your AI to do, it may very well be a feature; it's just that there are consequences, and you need to take those into account when deciding just what and how much you need the AI's final goals to do to get a good outcome.

I think I see what you're saying, but I am going to go out on a limb here and stick by "bug." Unflagging, unhedged optimization of a single goal seems like an error, no matter what.

Please continue to challenge me on this, and I'll try to develop this idea.

Approach #1:

I am thinking that in practical situations single-mindedness actually does not even achieve the ends of a single-minded person. It leads them in wrong directions.

Suppose the goals and values of a person or a machine are entirely single-minded (for instance, "I only eat, sleep and behave ethically so I can play Warcraft or do medical research for as many years as possible, until I die") and the rest are all "instrumental."

I am inclined to believe that if they allocated their cognitive resources in that way, such a person or machine would run into all kinds of problems very rapidly, and fail to accomplish their basic goal..

If you are constantly asking "but how does every small action I take fit into my Warcraft-playing?" then you're spending too much effort on constant re-optimization, and not enough on action.

Trying to optimize all of the time costs a lot. That's why we use rules of thumb for behavior instead.

Even if all you want is to be an optimal WarCraft player, it's better to just designate some time and resources for self-care or for learning how to live effectively with the people who can help. The optimal player would really focus on self-care or social skills during that time, and stop imagining WarCraft games for a while.

While the optimal Warcraft player is learning social skills, learning social skills effectively becomes her primary objective. For all practical purposes, she has swapped utility functions for a while.

Now let's suppose we're in the middle of a game of WarCraft. To be an optimal Warcraft player for more one game, we also have to have a complex series of interrupts and rules (smell smoke, may screw up important relationship, may lose job and therefore not be able to buy new joystick).

If you smell smoke, the better mind architecture seems to involve swapping out the larger goal of Warcraft-playing in favor of extreme focus on dealing with the possibility that the house is burning down.

Approach #2: Perhaps finding the perfect goal is impossible-that goals must be discovered and designed over time. Goal-creation is subject to bounded rationality, so perhaps a superintelligence, like people, would incorporate a goal-revision algorithm on purpose.

Approach #3: Goals may derive from first principles which are arrived at non-rationally (I did not say irrationally, there is a difference). If a goal is non-rational, and its consequences have yet to be fully explored, then there is a non-zero probability that, at a later time, this goal may prove self-inconsistent, and have to be altered.

Under such circumstances, single-minded drives risk disaster.

Approach #4:

Suppose the system is designed in some way to be useful to people. It is very difficult to come up with unambiguous, airtightly consistent goals in this realm.

If a goal has anything to do with pleasing people, what they want changes unpredictably with time. Changing the landscape of an entire planet, for example, would not be an appropriate response for an AI that was very driven to please its master, even if the master claimed that was they really wanted.

I am still exploring here, but I am veering toward thinking that utility function optimization, in any pure form, just plain yields flawed minds.

Approach #1: Goal-evaluation is expensive

You're talking about runtime optimizations. Those are fine. You're totally allowed to run some meta-analysis, figure out you're spending more time on goal-tree updating than the updates gain you in utility, and scale that process down in frequency, or even make it dependent on how much cputime you need for itme-critical ops in a given moment. Agents with bounded computational resources will never have enough cputime to compute provably optimal actions in any case (the problem is uncomputable); so how much you spend on computation before you draw the line and act out your best guess is always a tradeoff you need to make. This doesn't mean your ideal top-level goals - the ones you're trying to implement as best you can - can't maximize.

Approach #2: May want more goals

For this to work, you'd still need to specify how exactly that algorithm works; how you can tell good new goals from bad ones. Once you do, this turns into yet another optimization problem you can install as a (or the only) final goal, and have it produce subgoals as you continue to evaluate it.

Approach #3: Derive goals?

I may not have understood this at all, but are you talking about something like CEV? In that case, the details of what should be done in the end do depend on fine details of the environment which the AI would have to read out and (possibly expensively) evaluate before going into full optimization mode. That doesn't mean you can't just encode the algorithm of how to decide what to ultimately do as the goal, though.

Approach #4: Humans are hard.

You're right; it is difficult! Especially so if you want it to avoid wireheading (the humans, not itself), and brainwashing, keep society working indefinitely, and not accidentally squash even a few important values. It's also known as the FAI content problem. That said, I think solving it is still our best bet when choosing what goals to actually give our first potentially powerful AI.

The problem is that it's easy for any utility function to become narrow, to be narrowly interpreted.

What do you think of my devil's advocacy?

Another way to look at it: Subgoals may be offset by other subgoals. This includes convergent values.

Humans don't usually let any one of their conflicting values override all others. For example, accumulation of any given resource is moderated by other humans and by diminishing marginal returns on any one resource as compared to another.

On the other hand, for a superintelligence, particularly one with a simple terminal goal, these moderating factors would be less effective. For example, they might not have competitors.

The text is very general in its analysis, so some examples would be helpful. Why not start talking about some sets of goals that people who build an optimizing AI system might actually install in it, and see how the AI uses them?

To avoid going too broad, here's one: "AI genie, abolish hunger in India!"

So the first thing people will complain about with this is that the easiest thing for the system to do is to kill all the Indians in another way. Let's add:

"Without violating Indian national laws."

Now lesswrongers will complain that the AI will:

-Set about changing Indian national law in unpredictable ways. -Violate other country's laws -Brilliantly seek loopholes in any system of laws -Try to become rich too fast -Escape being useful by finding issues with the definition of "hunger," such as whether it is OK for someone to briefly become hungry 15 minutes before dinner. -Build an army and a police force to stop people from getting in the way of this goal.

So, we continue to debug:

"AI, insure that everyone in India is provisioned with at least 1600 calories of healthy (insert elaborate definition of healthy HERE) food each day. Allow them to decline the food in exchange for a small sum of money. Do not break either Indian law or a complex set of other norms and standards for interacting with people and the environment (insert complex set of norms and standards HERE)."

So, we continue to simulate what the AI might do wrong, and patch the problems with more and more sets of specific rules and clauses.

It feels like people will still want to add "stop using resource X/performing behavior Y to pursue this goal if we tell you to," because people may have other plans for resource X or see problems with behavior Y which are not specified in laws and rules.

People may also want to something like, "ask us (or this designated committee) first before you take steps that are too dramatic (insert definition of dramatic HERE)."

Then, I suppose, the AI may brilliantly anticipate, outsmart, provide legal quid-pro-quos or trick the committee. Some of these activities we would endorse (because the committee is sometimes doing the wrong thing), others we would not.

Thus, we continue to evolve a set of checks and balances on what the AI can and cannot do. Respect for the diverse goals and opinions of people seems to be at the core of this debugging process. However, this respect is not limitless, since people's goals are often contradictory and sometimes mistaken.

The AI is constantly probing these checks and balances for loopholes and alternative means, just as a set of well-meaning food security NGO workers would do. Unlike them, however, if it finds a loophole it can go through it very quickly and with tremendous power.

Note that if we substitute the word "NGO," "government" or "corporation" for "AI," we end up with all of the same set of issues as the AI system has. We deal with this by limiting these organization's resource level.

We could designate precisely what resources the AI has to meet its goal. That tool might work to a great extent, but the optimizing AI will still continue to try to find loopholes.

We could limit the amount of time the AI has to achieve its goal, try to limit the amount of processing power it can use or the other hardware.

We could develop computer simulations for what would happen if an AI was given a particular set of rules and goals and disallow many options based on these simulations. This is a kind of advanced consequentialism.

Even if the rules and the simulations work very well, each time we give the AI a set of rules for its behavior, there is a non-zero probability of unintended consequences.

Looking on the bright side, however, as we continue this hypothetical debugging process, the probability (perhaps its better to call it a hazard rate) seems to be falling.

Note also that we do get the same problems with governments, NGOs or corporations as well. Perhaps what we are seeking is not perfection but some advantage over existing approaches to organizing groups of people to solve their problems.

Existing structures are themselves hazardous. The threshold for supplementing them with AI is not zero hazard. It is hazard reduction.

Note that self-preservation is really a sub-class of goal-content integrity, and is worthless without it.

This is a total nit pick, but:

Suppose your AI's goal was "preserve myself". Ignoring any philosophical issues about denotation, here self-preservation is worthwhile even if the goal changed. If the AI, by changing itself into a paperclip maximizer, could maximize its chances of survival (say because of the threat of other Clippies) then it would do so. Because self-preservation is a instrumentally convergent goal, it would probably survive for quite a long time as a paperclipper - maybe much longer than as an enemy of Clippy.

I take this to be false.

To be the same and to have the same goals are two distinct, but equally possible kinds of sameness.

Most humans seem to care much more about the former (survival) then the later (that their goals be sustained in the universe)

Citing Woody Allen: "I don't want to achieve immortality through my work. I want to achieve it through not dying."

We do have distinct reasons to think machine intelligences would like to preserve their goals, and that for them, perhaps identity would feel more entangled with goals, however those reasons are far from unequivocal.

Just a little idea:

In one advertising I saw interesting pyramid with these levels (from top to bottom): vision -> mission -> goals -> strategy -> tactics -> daily planning.

I think if we like to analyse cooperation between SAI and humanity then we need interdisciplinary (philosophy, psychology, mathematics, computer science, ...) work on (vision -> mision -> goals) part. (if humanity will define vision, mission and SAI will derive goals then it could be good)

I am afraid that humanity has not properly defined/analysed nor vision nor mission. And more groups and individuals has more contradictory vision, mission and goals.

One big problem with SAI is not SAI but that we will have BIG POWER and we still dont know what we really want. (and what we really want to want)

Bostrom's book seems to have paradigm that goal is something on top, rigid and stable. Could not be dynamic and flexible like vision. Probably it could be true that one stupidly defined goal (paperclipper) could be unchangeable and ultimate. But we probably have more possibilities to define SAI's personality.