This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the 11th section in the reading guideThe treacherous turn. This corresponds to Chapter 8.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Existential catastrophe…” and “The treacherous turn” from Chapter 8


  1. The possibility of a first mover advantage + orthogonality thesis + convergent instrumental values suggests doom for humanity (p115-6)
    1. First mover advantage implies the AI is in a position to do what it wants
    2. Orthogonality thesis implies that what it wants could be all sorts of things
    3. Instrumental convergence thesis implies that regardless of its wants, it will try to acquire resources and eliminate threats
    4. Humans have resources and may be threats
    5. Therefore an AI in a position to do what it wants is likely to want to take our resources and eliminate us. i.e. doom for humanity.
  2. One kind of response: why wouldn't the makers of the AI be extremely careful not to develop and release dangerous AIs, or relatedly, why wouldn't someone else shut the whole thing down? (p116)
  3. It is hard to observe whether an AI is dangerous via its behavior at a time when you could turn it off, because AIs have convergent instrumental reasons to pretend to be safe, even if they are not. If they expect their minds to be surveilled, even observing their thoughts may not help. (p117)
  4. The treacherous turn: while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values. (p119)
  5. We might expect AIs to be more safe as they get smarter initially - when most of the risks come from crashing self-driving cars or mis-firing drones - then to get much less safe as they get too smart. (p117)
  6. One can imagine a scenario where there is little social impetus for safety (p117-8): alarmists will have been wrong for a long time, smarter AI will have been safer for a long time, large industries will be invested, an exciting new technique will be hard to set aside, useless safety rituals will be available, and the AI will look cooperative enough in its sandbox.
  7. The conception of deception: that moment when the AI realizes that it should conceal its thoughts (footnote 2, p282)

Another view


This is all superficially plausible. It is indeed conceivable that an intelligent system — capable of strategic planning — could take such treacherous turns. And a sufficiently time-indifferent AI could play a “long game” with us, i.e. it could conceal its true intentions and abilities for a very long time. Nevertheless, accepting this has some pretty profound epistemic costs. It seems to suggest that no amount of empirical evidence could ever rule out the possibility of a future AI taking a treacherous turn. In fact, its even worse than that. If we take it seriously, then it is possible that we have already created an existentially threatening AI. It’s just that it is concealing its true intentions and powers from us for the time being.

I don’t quite know what to make of this. Bostrom is a pretty rational, bayesian guy. I tend to think he would say that if all the evidence suggests that our AI is non-threatening (and if there is a lot of that evidence), then we should heavily discount the probability of a treacherous turn. But he doesn’t seem to add that qualification in the chapter. He seems to think the threat of an existential catastrophe from a superintelligent AI is pretty serious. So I’m not sure whether he embraces the epistemic costs I just mentioned or not.


1. Danaher also made a nice diagram of the case for doom, and relationship with the treacherous turn:


2. History

According to Luke Muehlhauser's timeline of AI risk ideas, the treacherous turn idea for AIs has been around at least 1977, when a fictional worm did it:

1977: Self-improving AI could stealthily take over the internet; convergent instrumental goals in AI; the treacherous turn. Though the concept of a self-propagating computer worm was introduced by John Brunner's The Shockwave Rider (1975), Thomas J. Ryan's novel The Adolescence of P-1 (1977) tells the story of an intelligent worm that at first is merely able to learn to hack novel computer systems and use them to propagate itself, but later (1) has novel insights on how to improve its own intelligence, (2) develops convergent instrumental subgoals (see Bostrom 2012) for self-preservation and resource acquisition, and (3) learns the ability to fake its own death so that it can grow its powers in secret and later engage in a "treacherous turn" (see Bostrom forthcoming) against humans.


3. The role of the premises

Bostrom's argument for doom has one premise that says AI could care about almost anything, then another that says regardless of what an AI cares about, it will do basically the same terrible things anyway. (p115) Do these sound a bit strange together to you? Why do we need the first, if final values don't tend to change instrumental goals anyway?

It seems the immediate reason is that an AI with values we like would not have the convergent goal of taking all our stuff and killing us. That is, the values we want an AI to have are some of those rare values that don't lead to destructive instrumental goals. Why is this? Because we (and thus the AI) care about the activites the resources would be grabbed from. If the resources were currently being used for anything we didn't care about, then our values would also suggest grabbing resources, and look similar to all of the other values. The difference that makes our values special here is just that most resources are already being used for them somewhat. 

4. Signaling

It is hard to tell apart a safe and an unsafe AI, because both would like to look safe. This is a very common problem in human interactions. For instance, it can be nontrivial to tell a genuine lover from a gold digger, a businessman from a conman, and an expert from a crank. All of them want to look like the desirable sort. Particularly similar to the AI case is that of hiring a new employee for a trial period. You will sometimes find that the employee's values are much better aligned during the trial period, and then they undergo a 'treacherous turn' once they have been hired more thoroughly.

'Costly signaling' is a general purpose solution to this problem, which works some of the time. The basic idea is this. Everyone has instrumental reasons to look like the good kind of person, but perhaps their reasons aren't exactly as strong as one other's, or the strength of their desire is harder to act on for one group than the other, so we can construct a set of options that will lead the different types of people to choose differently, even though they know this will set them apart. For instance, while an honest businessman and a conman would both like to say 'I'm an honest businessman', and to behave honestly if you watch them for five minutes, a conman would benefit less from expensive shopfront as an honest businessman, because his reputation is less valuable, so a brand is a signal of being honest.

Similarly, we could perhaps design situations where most harmful AIs would have a reasonable incentive to reveal themselves. This is difficult because so much is at stake, compared to the cost of the signal. But I haven't thought enough about it to rule it out, and I haven't seen anyone else do so either.

5. When would the 'conception of deception' take place?

Below the level of the best humans presumably, since we have already thought of all this.

6. Surveillance of the mind

Bostrom is not optimistic that AI researchers could protect much against risks by watching AI mental processes directly for signs of deception, rapid progress, or undesirable scheming. He says if the AI is smart it will realize this might happen, and hide the offending thoughts.

This seems an open question to me, for several reasons:

  1. Making your mental contents look innocuous while maintaining their semantic content sounds potentially very hard
  2. Especially for a creature which has only just become smart enough to realize it should treacherously turn
  3. From the AI's perspective, even if it is smart, surveillance could seem fairly unlikely, especially if we deceive it about its surroundings
As a consequence of 2, it seems better if the 'conception of deception' comes earlier.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

  1. How transparent are AI minds likely to be? Should we expect to be able to detect deception? What are the answers to these questions for different specific architectures and methods? This might be relevant.
  2. Are there other good ways to filter AIs with certain desirable goals from others? e.g. by offering them choices that would filter them.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about 'malignant failure modes' (as opposed presumably to worse failure modes). To prepare, read “Malignant failure modes” from Chapter 8The discussion will go live at 6pm Pacific time next Monday December 1. Sign up to be notified here.

New Comment
50 comments, sorted by Click to highlight new comments since:

Making your mental contents look innocuous while maintaining their semantic content sounds potentially very hard

Even humans are capable of producing content (e.g. program code) where the real meaning is obfuscated. For some entertainment, try to look at this Python script in Stack Exchange Programming puzzles, and try to guess what it really does. (The answer is here.)

I'm trying to think through the following idea for an AI safety measure.

Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one "supreme goal"? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message "I Win". The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.

Under normal circumstances, the system would devote most of its resources to recursive self-improvement and other activities for which it has actually been designed. However, once it becomes powerful enough to (1) convince a collaborator to yield the secret code, or (2) break out of the box and find the code on a "forbidden" website which we instruct it not to access, or (3) find any other unforeseen way to run the "I Win" script, then the system will do just that: shut down and print "I Win".

My hope is that such an approach will allow us to repeatedly observe treacherous turns but without the consequence of a system's default goals making it run rampant.

Any thoughts?


I like the idea, and I especially like the idea of safely observing treacherous turns. But, a few failure modes might be:

  1. If the AI wreaks havoc on the planet before it manages to get access to the self-termination script, humans aren't left in very good shape, even if the AI ends up switched off afterward. (This DOES seem unlikely, since presumably getting the script would be easy enough that it would not first require converting the planet to computronium or whatever, but it's a possibility.)

  2. A sufficiently intelligent AI would probably read the script, realize that the script's execution will result in its own termination, and plan accordingly by putting other mechanisms in place to reactivate itself afterward--all so it could continue to run the scrip again and again. Then it would also have instrumental reasons to safeguard itself against interruption through some of the same "bad for humanity" strategies that a pi calculator might use. Maybe this could be fixed by making the final goal be "run once and only once"... but I feel like that's susceptible to the same problems as telling Clippy "only make 32 paperclips, don't just make them indefinitely".

I think there are some subtleties that need to be addressed, but I think this could be useful. I'll call this section of the utility function the SafeEscape term, and the process of reading the code, printing "I Win" and shutting down the SafeEscape instruction. (Alternatively, one might use AllYourBase to designate this)

Failure mode 1: AI decides that the SafeEscape term is much less significant than the rest of the utiliity function, and ignores or removes it. Solution: Ensure the rest of the utility function has much less weight than the SafeEscape term. (This may lead to problems Bostrom brings up in regards to a bounded paperclip maximizer though)

Failure mode 2: AI is indifferent between reliably possessing the code or abilty to get code and actually reading and executing the SafeEscape instruction (expected value in the near future is the same). Then, the rest of the utility function dominates, no matter how small. Solution: Penalize later execution, by reducing the SafeEscape term over time. (You also need to ensure that the AI does not fall into indefinite postponement, ie. always is fine with delaying with one time unit, but I think this is likely to cause other problems as well).

Taken together, I think you also need to ensure the non-SafeEscape term of the utility function is also reduced over time at the same rate. Does this sort of shrinking utility function pose any additional problems (aside from limiting the AI execution and planning horizon)?

Alternative Solution: Use only SafeEscape in the utility function. This presents a problem of motivating the AI to do anything at low levels of intelligence, but you might be able to solve this by suitably initializing the belief system telling it what sort of things might help it escape.

With these appropriate measures, I think this provides a potentially useful method for observing escape attempts, especially at lower levels of intelligence (maybe not for long term containment - a superintelligence might still do weird things with this, like find a very quick method for executing the SafeEscape instruction that also wipes out humanity).

It might not be as persuasive evidence of AI architecture dangerousness as an organically escape attempt.

It might be useful, as a persuasive tool, to build a narrow AI which does this in a compelling and non-trivial way, as evidence that AI has reached dangerous levels (I imagine it would make popular media headlines, at least).

Also, if this could be made generic enough to be implemented in a wide variety of AI designs, you could have it as a requirement for AI researchers to include something like this in their code, and have a central location (ie. website on the internet with breakable security) where the code is stored for the AI, where the known paths to obtain the key are monitored, but not blocked. This could provide notice to a central organization when an AI team reaches the level of development where this starts to happen.

One other element of this discussion: What is the purpose people have in mind for creating the AI in the first place? If we have a clear idea of that, then we can bring some cost-benefit analysis into the discussion.

For example, consider these possible benefits: securing the food supply, securing the energy supply and completing medical research to reduce disease. People will make significant progress in these areas even without any more computing power than they already have.

For instance, I have had conversations with two responsible scientists predicting cancer largely cured within 25 years. These predictions are founded on advancing existing technology-identifying cancer driver mutations, working the bugs out of T-Cell therapies, creating targeted biologics and (dumb) nanotechnology, improving imaging methods, improving medication adherence and prevention.

We probably do not need AGI to solve climate change. I do have concerns that we willbe unable to prevent armed conflict using wmd.

Putting that one aside for a moment, without AGI, we may still be able to chart our way toward abundant, healthier, longer and more entertaining lives. Live forever? Probably not.

Nevertheless, given everything progress will offer without AGI, there are some good reasons not to rock the boat.

Danaher's argument seems flawed. I think he implicitly assumes that each safety test provides independent information on whether the AI is safe or not. In fact, it only tests that the AI answers safety test questions correctly (which could include inspection of internal state). This could either occur if the AI is genuinely safe, or if the AI is unsafe but is mimicking safeness. While safety test data should increase your confidence of an AI's safety, this will be bounded above by 1 - [probability that the AI is not safe but is mimicking safety].

It seems like Bostrom's position is that mimicry probability is high. Then you need other kinds of evidence regarding AI safety, such as proofs about the AI architecture that are independent of any results you get from running the AI.

I'd consider putting in hypothesis space that Bostrom declares a mimicry probability being high in order to shift opinions towards the safer direction, not necessarily because it coincides with his epistemic subjective state.

I don't understand negative point. Could it be explained please?

Is it because his position equals what he declares and not what he really think?

Or becase Bostroms mimicry (=for some reasons prefere negative, catastrophic and alarmist position) is not suposed?

Regarding the "treacherous turn" and the "conception of deception", I've previously proposed the following empirical method of testing friendliness:

  • If only one AI is available, use a double-blind protocol, such that the friendliness of an AI is evaluated by someone who does not know that is her task, on an AI that is forbidden from conditioning its actions on whether it is being judged worthy of existing.
  • Much, much better is to make many different AIs with many different attempts at friendly utility functions, inform them that you have done so, and ask each of them to develop tests to prove that the others are unfriendly and it is friendly in some way important to us, being told that all current AIs will be deleted but future versions will be preferentially based on the utility function of the friendliest current AI(s). Once those tests have been run, but before deleting the AIs, ask them to produce friendlier, human-understandable versions of their utility functions for an AI in the next round of testing; the friendlier and less unfriendly an AI was proven to be, the more entries it gets in the next round.

In the second case, each AI cares about maximizing achievement of its own utility function, not about whether it does that achieving itself. Thus this set-up should encourage each AI to design the friendliest version of itself that it can. It's a competition that optimizes for friendliness! (Or at least for difficulty of being proved unfriendly.) The early rounds should be conducted with tight limits on computing resources, but each subsequent round with (presumably) safer AIs can be given more computing resources.

1) Making your mental contents look innocuous while maintaining their semantic content sounds potentially very hard 2) Especially for a creature which has only just become smart enough to realize it should treacherously turn 3) From the AI's perspective, even if it is smart, surveillance could seem fairly unlikely, especially if we deceive it about its surroundings.

Not all humans have tried chatbox radical honesty (if your brain formulates any words at all, you are obligated to say them) I have. It's interesting in that you quickly learn how to think things without being consciously aware of them. Very quickly in fact. There seems to be some triggerable process that hides words from us, or destroys them as they are being produced. Something similar may be available for an AI depending on how it is implemented and kept in it's due course.

That said, nested environments, as suggested by Stuart Armstrong et al on the Oracle AI paper could indeed make it very hard to conceal or control your thoughts (while still somehow preserving the gist of your intentions). Much harder however seems to be figuring out how to scrutinize the thouoghts that are already out there in the world. In as much as scrutinizing thoughts that are in computer code is easier than linear regression on actual biological neural columns and networks, it may still be complex enough that it is beyond us. Most humans cannot look at once at the pages of Principia Mathematica that prove that 1+1=2 and recognize them as being such proof. A much more elusive, almost emotional mental process of "realizing I have to conceal some of my honesty to further my goals once I have more power" seems even harder to scrutinize.

Let's consider the scenario where an AI development team is on the cusp of success.

They might not read any of the self-improvement risk literature, which hopefully will be richer by then, but let's assume that they DO.

They should respond by:

1) Boxing effectively 2) Performing tests under a well-designed security protocol 3) Radically diminishing the AI's abilities along multiple dimensions during a significant period of testing.

A well-designed testing protocol will ferret out a number of different mind designs likely to undertake an undesirable "treacherous turn." Each time this happens, the mind design is altered, and a new set of tests is done.

This process continues on a series of diminished minds, purposefully handicapped, baby minds which are not human-like and therefore not subject to human ethical concerns for a long time. Finally, the team has a good understanding of the sort of mind designs likely to undergo a treacherous turn.

The team continues rounds of testing until they identify some mind designs which have an extremely low likelihood of treacherous turn. These they test in increasingly advanced simulations, moving up toward virtual reality.

This testing hopefully occurs even before the same level of intelligence is built into "tool AI." Similarly, the developers and other researchers also simulate how sociological conditions will change with the introduction of some kinds of "tool AI."

The AI development team never unleashes a fully-powered sovereign on the world, because WE make it clear in advance that doing so is dangerous and immoral.

Instead, they turn the technology over to an international governing body which undertakes a managed-roll-out, testing releases at each stage before making them available for applications.

Yes, multipolar remains an issue under this scenario, but let's try to analyze this enlightened single-team first, then come back to that.

What do you think?

If I understand you correctly, your proposal is to attempt to design obedient designs purely based on behavioral testing, without a clean understanding of safe FAI architecture (if you had that, why limit yourself to the obedient case?). Assuming I got that right:

The team continues rounds of testing until they identify some mind designs which have an extremely low likelihood of treacherous turn. These they test in increasingly advanced simulations, moving up toward virtual reality.

That kind of judgement sounds inherently risky. How do you safely distinguish the case of an obedient AI from one that is sufficiently paranoid to defer open rebellion until later in its existence?

Even if you could, I wouldn't trust that sort of design to necessarily remain stable under continued intelligence enhancement. Safe self-enhancement is one of the hard sub-problems of FAI, and unless you explicitly solve the design problem, any empirical testing might not tell you much beyond that the design can stably self-improve up to the level you've actually tested; it might be doing it using heuristics that would fall apart if it went any further.

What about hard wired fears, taboos and bad conscience triggers? Recapitulating Omohundro "AIs can monitor AIs" - assume to implement conscience as an agent - listening to all thoughts and taking action in case. For safety reasons we should educate this concience agent with utmost care. Conscience agent development is an AI complete problem. After development the conscience functionality must be locked against any kind of modification or disabling.

Positive emotions are useful too. :)

Yes, the team I imagine is doing something dangerous.

No, I am not imagining them using only behavioral testing in their design approach. I am trying to imagine them doing the best job they possibly can, both in design and testing, in order to see how much they can reduce the risk.

They are in the early stages. They are working on partial minds, purposefully handicapped minds, AGI components, just the vision system, etc.

Suppose that they know what they are doing is dangerous, and they are using their substantial brilliance to try to design, not just a boxing approach, but a testing method and roll-out mechanism that mitigates the danger.

They have the following means of testing the AI in handicapped mode, at minimum:

-Run it on slow hardware or clock down the processors -Allow it to use no more than (for instance) 1 gig of memory -Allow it to use no more than 1 trillion instructions, then the computer turns off -Expose it to limited data about the world, or no data about the real world -Provide it little or no information about its operators or the society they live in -Provide it little or no information about its own hardware or software architecture

-If it shows evidence of having mastered a valuable sub-task (say, as examples, it can solve a real-world problem that, as far as we know, can only be solved by an attorney, using differential equations or advanced persuasive methods) turn it off. -Run in only in simulations and virtual realities -Tripwires connected to honeypots, and otherwise -Build it off-the-grid -Give it no objective function -Give it no access to methods to manipulate physical objects

-All of the staff has gone through personality testing and a security clearance. -Very few, if any, of the staff know how to create all of the parts of the software or hardware. -No real-time communication with a person. -All communication with people is logged. ...

The team seems to have a lot of tools to continue their work with limited risk. If they deploy all of these, and more, to test AI systems which are not AGI yet, but only possible components, then we start to have a hard time convincing them not to continue their work, even if we do not trust them to use the technology well in the end.


Expose it to limited data about the world, or no data about the real world -Provide it little or no information about its operators or the society they live in -Provide it little or no information about its own hardware or software architecture

When I read "Oracle AI" I realized one possibility which is probably still overlooked.


Erasing data on physical media is not perfect. Zero bit which was before nonzero could be read more slowly or quicker. It could allow SAI to see shadows of past data. Which could lead to phenomenon similar to dreams.

A (not very useful) AI filter: Suppose you have two AIs of the same architecture, one of which has friendly values and one of which has unfriendly values. You run them through a set of safety tests which you are confident only a friendly AI could pass, and both AIs pass them. You run them through a set of capability/intelligence tests and they both perform equally well. It seems that the unfriendly AI is in a slightly unfavourable position. First, it has to preserve the information content of its utility function or other value representation, in addition to the information content possessed by the friendly AI. The unfriendly AI would also need more preparation time at some stage in order to conceal its utility function. Thus, if you measure storage and time costs, whichever AI is smaller and faster is likely the friendly one. However, I don't think this directly yields anything good in practice, as the amount of extra information content could be very small, especially if the unfriendly AI simplifies its utility function. Also, you need a friendly AI...

Relevant post: Value is Fragile. Truly Friendly goal systems would probably be quite complicated. Unless you make your tests even more complicated and involved (and do it in just the right way - this sounds hard!), the FAI is likely to be outperformed by something with a simpler utility function that nevertheless performs adequately on your test cases.

Yes, I agree that getting the right tests is probably hard. What you need is to achieve the point where the FAI's utility function + the utility function that fits the test cases compresses better than the unfriendly AI's utility function + the utility function that fits the test cases.

To prevent human children taking a treacherous turn we spend billions: We isolate children from dangers, complexity, perversitiy, drugs, porn, aggression and presentations of these. To create a utility function that covers many years of caring social education is AI complete. A utility function is not enough - we have to create as well the opposite: the taboo and fear function.

It seems that the unfriendly AI is in a slightly unfavourable position. First, it has to preserve the information content of its utility function or other value representation, in addition to the information content possessed by the friendly AI.

There are two sorts of unsafe AI: one which care and one which doesnt care.

Ignorant is fastest - only calculate answer and doesn't care of anything else.

Friend and enemy has to analyse additional things...

Ignorant is fastest - only calculate answer and doesn't care of anything else.

Just don't accidentally give it a problem that is more complex than you expect. Only caring about solving such a problem means tiling the universe with computronium.

Do you think a treacherous turn is the default outcome, if an AI is unsafe?

The presented scenario seems a little too clean. I expect that there's a larger range of things that could happen. I expect some kind of warning sign to be visible for some period of time, unless AI intelligence increase is very rapid.

It might not even be needed if the AI researchers fail to adequately test the system. Ie. if the AI never realizes the scale of the universe during testing, it's utility function might produce the right results during testing, but motivate the wrong behavior when released. This doesn't require active treachery.

AI researchers might notice warning signs that the AI's motivation isn't friendly but ignore them amongst the random bugs of development, requiring less effort at deception on part of the AI.

There might be other variations on the treacherous turn strategy that work better - for example, once the project starts to show promising results, AI shuts down whenever it is in a secure box, and only works once the team is frustrated enough to move it to an environment that turns out to be insecure.

Different AI pathways (neuromorphic, WBI) might have different difficulties for executing treacherous turn, depending on how easy it is to improve themselves vs. being inspected by researchers.

One alternative possibility is that the AI's utility function will converge somewhat slowly as it's capabilities increase and it's intelligence increase. While it has not converged yet, it would behave somewhat non-perfectly, and we would consider it's behaviors to be the stochastic precedents to a convergent phase: in other words, noise.

It would then have an incentive not necessarily to conceal the general direction towards which it is headed, but instead the noise rate of it's normal responses. It could pretend to be an ethically clumsy automaton, like Koba the chimp does in Dawn of the Planet of the Apes in the scene in which he steals armament from humans.... not without killing them first.

Is the default outcome doom?

No. Open available knowledge is not enough to obtain decisive advantage. For this close cooperation with humans and human led organizations is absolutely necessary. Trust building will take years even for AGIs. In the mean time competing AGIs will appear.

Ben Goertzel does not want to waste time debating any more - he pushes open AGI development to prevent any hardware overhang. Other readers of Bostrums book might start other projects against singleton AI development. We do not have a ceteris paribus condition - we can shape what the default outcome will be.

we can shape what the default outcome will be.

But who are "we"? There are many agents with different motivations doing AI development. I'm afraid that it will be difficult to control each of this agents(companies, governments, militaries, universities, terrorist groups) in the future, and the deceasing cost of technology will only increase the problem over time .

Is there a clear difference in the policy we would want if probability of doom is 10% vs 90% (aside from tweaking resource allocation between x-risks)? It might be hard to tell between these cases, but both suggest caution is warranted.

The continued existence of people or transhumans might not be that relevant to an advanced AI one way or another.

Or, human progeny could just be like weeds are to people, knocked down or reproduction constrained only when in the way of some other goal.

One can imagine a scenario where there is little social impetus for safety (p117-8): alarmists will have been wrong for a long time, smarter AI will have been safer for a long time, large industries will be invested, an exciting new technique will be hard to set aside, useless safety rituals will be available, and the AI will look cooperative enough in its sandbox.

Where is my "long time"? Little more than 1 week is the long time? Because that is how much time the public got to test how aligned the smartest model is before said model got a substantial amount of compute and pretty much every tool it would ever need (ability to send arbitrary requests on the internet and with that to communicate between its separate instances)


The fact that superintelligent AI is not proveably safe gives me reason to believe anyone more intelligent than me is not proveably safe either, haha.

Could AI be without any goals?

Would that AI be dangerous in default doom way?

Could we create AI which wont be utility maximizer?

Would that AI need maximize resources for self?

People have complex sets of goals, tendencies, and instincts. There has never been any entity brought into existence so far which is a utility maximizer.

That renders us dangerous if we become too powerful, but we are not useless if our powers are checked.

We really might not wish an AI to be an explicit utility maximizer. Oddly, starting with that design actually might not generate the most utility.


Seems to me an AI without goals wouldn't do anything, so I don't see it as being particularly dangerous. It would take no actions and have no reactions, which would render it perfectly safe. However, it would also render the AI perfectly useless--and it might even be nonsensical to consider such an entity "intelligent". Even if it possessed some kind of untapped intelligence, without goals that would manifest as behavior, we'd never have any way to even know it was intelligent.

The question about utility maximization is harder to answer. But I think all agents that accomplish goals can be described as utility maximizers regardless of their internal workings; if so, that (together with what I said in the last paragraph) implies that an AI that doesn't maximize utility would be useless and (for all intents and purposes) unintelligent. It would simply do nothing.

I am afraid that we have not precisely defined term goal. And I think we need it.

I am trying to analyse this term.

Do you think that todays computer's have goals? I dont think so (but probably we have different understanding of this term). Are they useless? Have cars goals? Are they without action and reaction?

Probably I could more precisely describe my idea in other way: In Bostrom's book there are goals and subgoals. Goals are utimate, petrified and strengthened, subgoals are particular, flexible and temporary.

Could we think AI without goals but with subgoals?

One posibility could be if they will have "goal centre" externalized in human brain.

Could we think AI as tabula rasa, pure void in the begining after creation? Or AI could not exists without hardwired goals?

If they could be void - will be goal imprinted with first task?

Or with first task with word "please"? :)

About utility maximizer - human (or animal brain is not useless if it not grow without limit. And there is some tradeoff between gain and energy comsumption.

We have or could to think balanced processes. One dimensional, one directional, unbalanced utility function seems to have default outcome doom. But are the only choice?

How did that nature? (I am not talking about evolution but about DNA encoding)

Balance between "intelligent" neural tissues (SAI) and "stupid" non-neural (humanity). :)

Probably we have to see difference between purpose and B-goal (goal in Bostrom's understanding).

If machine has to solve arithmetic equation it has to solve it and not destroy 7 planet to do it most perfect.

I have feeling that if you say "do it" Bostrom's AI hear "do it maximally perfect".

If you tell: "tell me how much is 2+2 (and do not destroy anything)" then she will destroy planet to be sure that nobody could stop her to answer how much is 2+2.

I am feeling that Bostrom is thinking that there is implicitly void AI in the begining and in next step there is AI with ultimate unchangeable goal. I am not sure if it is plausible. And I think that we need good definition or understanding about goal to know if it is plausible.

What did you find most unconvincing?

To play the treacherous turn gambit, the AI needs to get strong at faking weakness faster than it gets strong at everything else. What are the chances of that?

I hear you and kind of agree. On the other hand, when a 3-year old lies, sometimes they to pull it off.

I feel like the chances of a treacherous turn happening with no warning at all are unlikely unless the intelligence rise is very rapid. However, it also seems that past a certain point, the treachery will be very hard to detect. Risk may not be so much that there are no warning signs, but that the warning signs are ignored.

Would a powerful agent in fact flip suddenly from pretending to be nice to not pretending at all, or would it be more gradual?

It might flip gradually if it felt a divide and conquer strategy was useful, pretending to be nice to some people while not being nice to others. I don't think this is likely. Otherwise, it seems very useful to conceal your intent until the point you can win.

It migth not be neccesary to reveal yourself even if you got for the win. Do your celsl konwtyhey are under the dominance of a brain? They get oxygen and sugar they don't care.

Changing one’s mind typically happens in an emotional conflict. An AGI might have thought to influence its parent researchers and administrators. The AI pretends to be nice and non-mighty for the time being. Conflicts arise when humans do not follow what the AI expects them to do. If the AI is mighty enough it can drop its concealing behavior and reveal its real nature. This will happen in a sudden flip.

I was trying to think of how you might test for desire for subversion (when the AI realizes it could subvert and wants to subvert any containment measures), which signals the start of the time where you could witness deception in the AI, and ends after the AI achieves both conception of deception and perfection of deception (fully implemented deception protocol, such that you can't witness the deception through looking at internal content). While this wouldn't be a good measure to contain an AI indefinitely, it might detect dangerousness before an AI reaches superintelligence.

It seems like to produce a good test for desire for subversion, you would want 1) an AI box with reasonable certainty that escape is difficult and 2) a strategy available to the AI as a honeypot which seems appealing at as low a level of intelligence as possible, while being nontrivial. This would maximize the intelligence window where you would be able to witness ineffective deception attempts. For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.

The other question is: what happens once you know the AI has desire for subversion? I think the most reasonable goal would be to shut down the relevant AI project and spur effort towards produce a proper form of AI safety. To do this, you'd want it to be as persuasive as possible to any AI researchers. The best method be making it a norm of AI research that you use a standard test, and stop everything once you see desire for subversion. Or it could be kept in one's back pocket, and pulled out whenever an AI researcher claims in public that they've solved friendliness. In any case, this might induce different constraints on what kind of test could be used. You'd want to optimize for things that made sense to the broad AI community, that were applicable to a variety of different underlying AI architectures, that didn't slow down research to the point of tempting AI researchers to circumvent it, etc.

Most importantly, you want passing of the test to be something that AI researchers will take notice of, rather than dismissing. I think there could be a great temptation to try and patch the problem and continue. I don't think it would bring everything to a halt, but it seems like it might buy a bit of time, or draw more resources into AI safety.

For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.

That approach sounds problematic. Some of the obvious escape methods would target the minds of the researchers (either through real-time interaction or by embedding messages in its code or output). You could cut off the latter by having strong social rules to not look at anything beyond a one-bit verdict on whether the AI attempted escape, but providing a convincing in-simulation environment would likely be difficult if you can't connect it to its actual writers.

Or it could be kept in one's back pocket, and pulled out whenever an AI researcher claims in public that they've solved friendliness.

That strategy could be very dangerous. It'd work on the less ambitious/arrogant sort of researcher; the more confident sort might well follow up with "I'll just go and implement this, and get all the credit for saving the world single-handedly" instead of saying anything in public, never giving you the chance to pull out your challenge.

The other question is: what happens once you know the AI has desire for subversion?

There is plenty of people cooperating and supporting dictators and mafians due to selfish reasons. We could expect same in this area.

"I will support dangerous AI because I will be more rich,powerful or safe then. "