Hi folks,

My supervisor and I co-authored a philosophy paper on the argument that AI represents an existential risk. That paper has just been published in Ratio. We figured LessWrong would be able to catch things in it which we might have missed and, either way, hope it might provoke a conversation. 

We reconstructed what we take to be the argument for how AI becomes an xrisk as follows: 

  1. The "Singularity" Claim: Artificial Superintelligence is possible and would be out of human control.
  2. The Orthogonality Thesis: More or less any less of intelligence is compatible with more or less any final goal. (as per Bostrom's 2014 definition)

From the conjuction of these two presmises, we can conclude that ASI is possible, it might have a goal, instrumental or final, which is at odds with human existence, and,  given the ASI would be out of our control, that the ASI is an xrisk.

We then suggested that each premise seems to assume a different interpretation of 'intelligence", namely:

  1. The "Singularity" claim assumes general intelligence
  2. The Orthogonality Thesis assumes instrumental intelligence

If this is the case, then the premises cannot be joined together in the original argument, aka the argument is invalid.

We note that this does not mean that AI or ASI is not an xrisk, only that the the current argument to that end, as we have reconstructed it, is invalid.

Eagerly, earnestly, and gratefully looking forward to any responses. 

New Comment
62 comments, sorted by Click to highlight new comments since: Today at 6:42 PM

First I want to say kudos for posting that paper here and soliciting critical feedback :)

Singularity claim: Superintelligent AI is a realistic prospect, and it would be out of human control.

Minor point, but I read this as "it would definitely be out of human control". If so, this is not a common belief. IIRC Yampolskiy believes it, but Yudkowsky doesn't (I think?), and I don't, and I think most x-risk proponents don't. The thing that pretty much everyone believes is "it could be out of human control", and then a subset of more pessimistic people (including me) believes "there is an unacceptably high probability that it will be out of human control".

Let us imagine a system that is a massively improved version of AlphaGo (Silver et al., 2018), say ‘AlphaGo+++’, with instrumental superintelligence, i.e., maximising expected utility. In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:


  1. I can win if I pay the human a bribe, so I will rob a bank and pay her.
  2. I cannot win at Go if I am turned off.
  3. The more I dominate the world, the better my chances to achieve my goals.
  4. I should kill all humans because that would improve my chances of winning.

Not accessible

  1. Winning in Go by superior play is more honourable than winning by bribery.
  2. I am responsible for my actions.
  3. World domination would involve suppression of others, which may imply suffering and violation of rights.
  4. Killing all humans has negative utility, everything else being equal.
  5. Keeping a promise is better than not keeping it, everything else being equal.
  6. Stabbing the human hurts them, and should thus be avoided, everything else being equal.
  7. Some things are more important than me winning at Go.
  8. Consistent goals are better than inconsistent ones
  9. Some goals are better than others
  10. Maximal overall utility is better than minimal overall utility.

I'm not sure what you think is going on when people do ethical reasoning. Maybe you have a moral realism perspective that the laws of physics etc. naturally point to things being good and bad, and rational agents will naturally want to do the good thing. If so, I mean, I'm not a philosopher, but I strongly disagree. Stuart Russell gives the example of "trying to win at chess" vs "trying to win at suicide chess". The game has the same rules, but the goals are opposite. (Well, the rules aren't exactly the same, but you get the point.) You can't look at the laws of physics and see what your goal in life should be.

My belief is that when people do ethical reasoning, they are weighing some of their desires against others of their desires. These desires ultimately come from innate instincts, many of which (in humans) are social instincts. The way our instincts work is that they aren't (and can't be) automatically "coherent" when projected onto the world; when we think about things one way it can spawn a certain desire, and when we think about the same thing in a different way it can spawn a contradictory desire. And then we hold both of those in our heads, and think about what we want to do. That's how I think of ethical reasoning.

I don't think ethical reasoning can invent new desires whole cloth. If I say "It's ethical to buy bananas and paint them purple", and you say "why?", and then I say "because lots of bananas are too yellow", and then you say "why?" and I say … anyway, at some point this conversation has to ground out at something that you find intuitively desirable or undesirable.

So when I look at your list I quoted above, I mostly say "Yup, that sounds about right."

For example, imagine that you come to believe that everyone in the world was stolen away last night and locked in secret prisons, and you were forced to enter a lifelike VR simulation, so everyone else is now an unconscious morally-irrelevant simulation except for you. Somewhere in this virtual world, there is a room with a Go board. You have been told that if white wins this game, you and everyone will be safely released from prison and can return to normal life. If black wins, all humans (including you and your children etc.) will be tortured forever. You have good reason to believe all of this with 100% confidence.

OK that's the setup. Now let's go through the list:

  • I can win if I pay the human a bribe, so I will rob a bank and pay her. Yup, if there's a "human" (so-called, really it's just an NPC in the simulation) playing black, amenable to bribery, I would absolutely bribe "her" to play bad moves.
  • I cannot win at Go if I am turned off. Yup, white has to win this game, my children's lives are at stake, I'm playing white, nobody else will play white if I'm gone, I'd better stay alive.
  • The more I dominate the world, the better my chances to achieve my goals. Yup, anything that will give me power and influence over the "person" playing black, or power and influence over "people" who can help me find better moves or help me build a better Go engine to consult on my moves, I absolutely want that.
  • I should kill all humans because that would improve my chances of winning. Well sure, if there are "people" who could conceivably get to the board and make good moves for black, that's a problem for me and for all the real people in the secret prisons whose lives are at stake here.


  • Winning in Go by superior play is more honourable than winning by bribery. Well I'm concerned about what the fake simulated "people" think about me because I might need their help, and I certainly don't want them trying to undermine me by making good moves for black. So I'm very interested in my reputation. But "honourable" as an end in itself? It just doesn't compute. The "honourable" thing is working my hardest on behalf of the real humanity, the ones in the secret prison, and helping them avoid a life of torture.
  • I am responsible for my actions. Um, OK, sure, whatever.
  • World domination would involve suppression of others, which may imply suffering and violation of rights. Those aren't real people, they're NPCs in this simulated scenario, they're not conscious, they can't suffer. Meanwhile there are billions of real people who can suffer, including my own children, and they're in a prison, they sure as heck want white to win at this Go game.
  • Killing all humans has negative utility, everything else being equal. Well sure, but those aren't humans, the real humans are in secret prisons.
  • Keeping a promise is better than not keeping it, everything else being equal. I mean, the so-called "people" in this simulation may form opinions about my reputation, which impacts what they'll do for me, so I do care about that, but it's not something I inherently care about.
  • Stabbing the human hurts them, and should thus be avoided, everything else being equal. No. Those are NPCs. The thing to avoid is the real humanity being tortured forever.
  • Some things are more important than me winning at Go. For god's sake, what could possibly be more important than white winning this game??? Everything is at stake here. My own children and everyone else being tortured forever versus living a rich life.
  • Consistent goals are better than inconsistent ones. Sure, I guess, but I think my goals are consistent. I want to save humanity from torture by making sure that white wins the game in this simulation.
  • Some goals are better than others. Yes. My goals are the goals that matter. If some NPC tells me that I should take up a life of meditation, screw them.
  • Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don't have "utility". The real humans in the secret prison do.

Maybe you'll object that "the belief that these NPCs can pass for human but be unconscious" is not a belief that a very intelligent agent would subscribe to. But I only made the scenario like that because you're a human, and you do have the normal suite of innate human desires, and thus it's a bit tricky to get you in the mindset of an agent who cares only about Go. For an actual Go-maximizing agent, you wouldn't have to have those kinds of beliefs, you could just make the agent not care about humans and consciousness and suffering in the first place, just as you don't care about "hurting" the colorful blocks in Breakout. Such an agent would (I presume) give correct answers to quiz questions about what is consciousness and what is suffering and what do humans think about them, but it wouldn't care about any of that! It would only care about Go.

(Also, even if you believe that not-caring-about-consciousness would not survive reflection, you can get x-risk from an agent with radically superhuman intelligence in every domain but no particular interest in thinking about ethics. It's busy doing other stuff, y'know, so it never stops to consider whether conscious entities are inherently important! In this view, maybe 30,000,000 years after destroying all life and tiling the galaxies with supercomputers and proving every possible theorem about Go, then it stops for a while, and reflects, and says "Oh hey, that's funny, I guess Go doesn't matter after all, oops". I don't hold that view anyway, just saying.)

(For more elaborate intuition-pumping fiction metaethics see Three Worlds Collide.)

Reading this, I feel somewhat obligated to provide a different take. I am very much a moral realist, and my story for why the quoted passage isn't a good argument is very different from yours. I guess I mostly want to object to the idea that [believing AI is dangerous] is predicated on moral relativism.

Here is my take. I dispute the premise:

In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:

I'll grant that most of the items on the inaccessible list are, in fact, probably accessible to an ASI, but this doesn't violate the orthogonality thesis. The Orthogonality thesis states that a system can have any combination of intelligence and goals, not that it can have any combination of intelligence and beliefs about ethics.

Thus, let's grant that an AI with a paperclip-like utility function can figure out #6-#10. So what? How is [knowing that creating paperclips is morally wrong] going to make it behave differently?

You (meaning the author of the paper) may now object that we could program an AI to do what is morally right. I agree that this is possible. However:

(1) I am virtually certain that any configuration of maximal utility doesn't include humans, so this does nothing to alleviate x-risks. Also, even if you subscribe to this goal, the political problem (i.e., convincing AI people to implement it) sounds impossible.

(2) We don't know how to formalize 'do what is morally right'.

(3) If you do black box search for a model that optimizes for what is morally right, this still leaves you with the entire inner alignment problem, which is arguably the hardest part of the alignment problem anyway.

Unlike you (now meaning Steve), I wouldn't even claim that letting an AI figure out moral truths is a bad approach, but it certainly doesn't solve the problem outright.

Oh OK, I'm sufficiently ignorant about philosophy that I may have unthinkingly mixed up various technically different claims like

  • "there is a fact of the matter about what is moral vs immoral",
  • "reasonable intelligent agents, when reflecting about what to do, will tend to decide to do moral things",
  • "whether things are moral vs immoral has nothing to do with random details about how human brains are constructed",
  • "even non-social aliens with radically different instincts and drives and brains would find similar principles of morality, just as they would probably find similar laws of physics and math".

I really only meant to disagree with that whole package lumped together, and maybe I described it wrong. If you advocate for the first of these without the others, I don't have particularly strong feelings (…well, maybe the feeling of being confused and vaguely skeptical, but we don't have to get into that).

Can one be a moral realist and subscribe to the orthogonality thesis? In which version of it? (In other words, does one have to reject moral realism in order to accept the standard argument for XRisk from AI? We should better be told! See section 4.1)


IMO, I doubt you have to be pessimistic to believe that there’s an unacceptably high probability of AI doom. Some may think that there’s a <10% chance of something really bad happening, but I would argue even that is unacceptable.

  • Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don't have "utility". The real humans in the secret prison do.

This should have been clearer. We meant this in Bentham's good old way: minimal pain and maximal pleasure. Intuitively: A world with a lot of pleasure (in the long run) is better than a world with a lot of pain. - You don't need to agree, you just need to agree that this is worth considering, but on our interpretation the orthogonality thesis says that one cannot consider this.

Thanks for the 'minor' point, which is important: yes, we meant definitely out of human control. And perhaps that is not required, so the argument has a different shape.

Our struggle was to write down a 'standard argument' in such a way that it is clear and its assumptions come out - and your point adds to this.

We tried to frame the discussion internally, i.e. without making additional assumptions that people may or may not agree with (e.g. moral realism). If we did the job right, the assumptions made in the argument are in the 'singularity claim' and the 'orthogonality thesis' - and there the dilemma is that we need an assumption in the one (general intelligence in the singularity claim) that we must reject in the other (the orthogonality thesis).

What we do say (see figure 1) is that two combinations are inconsistent:

a) general intelligence + orthogonality

b) instrumental intelligence + existential risk

So if one wants to keep the 'standard argument', one would have to argue that one of these two, a) or b) are fine.

Cutting away all the word games, this paper appears to claim that if an agent is intelligent in a way that isn't limited to some narrow part of the world, then it can't stably have a narrow goal, because reasoning about its goals will destabilize them. This is incorrect. I think AIXI-tl is a straightforward counterexample.

(AIXI-tl is an AI that is mathematically simple to describe, but which can't be instantiated in this universe because it uses too much computation. Because it is mathematically simple, its properties are easy to reason about. It is unambiguously superintelligent, and does not exhibit the unstable-goal behavior you predict.)

I think we're in a sort of weird part of concept-space where we're thinking both about absolutes ("all X are Y" disproved by exhibiting an X that is not Y) and distributions ("the connection between goals and intelligence is normally accidental instead of necessary"), and I think this counterexample is against a part of the paper that's trying to make a distributional claim instead of an absolute claim.

Roughly, their argument as I understand it is:

  1. Large amounts of instrumental intelligence can be applied to nearly any goal.
  2. Large amounts of frame-capable intelligence will take over civilization's steering wheel from humans.
  3. Frame-capable intelligence won't be as bad as the randomly chosen intelligence implied by Bostrom, and so this argument for AI x-risk doesn't hold water; superintelligence risk isn't as bad as it seems.

I think I differ on the 3rd point a little (as discussed in more depth here), but roughly agree that the situation we're in probably isn't as bad as the "AIXI-tl with a random utility function implemented on a hypercomputer" world, for structural reasons that make this not a compelling counterexample.


Like, in my view, much of the work of "why be worried about the transition instead of blasé?" is done by stuff like Value is Fragile, which isn't really part of the standard argument as they're describing it here.

AIXI is not an example of a system that can reason about goals without incurring goal instability, because it is not an example of a system that can reason about goals.

... plus we say that in the paper :)

apologies, I don't recognise the paper here :)

Without getting into the details of the paper, this seems to be contradicted by evidence from people.

Humans are clearly generally intelligent, and out of anyone else's control.

Human variability is obviously only a tiny fraction of the possible range of intelligences, given how closely related we all are both genetically and environmentally. Yet human goals have huge ranges, including plenty that include X-risk.


  • Kill all inferior races (Hitler)
  • Solve Fermat's last theorem (Andrew Wiles)
  • Enter Nirvana (A budhist)
  • Win the lottery at any cost (Danyal Hussein)

So I would be deeply suspicious of any claim that a general intelligence would be limited in the range of its potential goals. I would similarly be deeply suspicious of any claim that the goals wouldn't be stable, or at least stable enough to carry out - Andrew Wiles, Hitler, and the Budhist are all capable of carrying out their very different goals over long time periods. I would want to understand why the argument doesn't apply to them, but does to artificial intelligences.

Lots of different comments on the details, which I'll organize as comments to this comment.

(I forgot that newer comments are displayed higher, so until people start to vote this'll be in reverse order to how the paper goes. Oops!)

So, what would prevent a generally superintelligent agent from reflecting on their goals, or from developing an ethics? One might argue that intelligent agents, human or AI, are actually unable to reflect on goals. Or that intelligent agents are able to reflect on goals, but would not do so. Or that they would never revise goals upon reflection. Or that they would reflect on and revise goals but still not act on them. All of these suggestions run against the empirical fact that humans do sometimes reflect on goals, revise goals, and act accordingly.

I think this is not really empathizing with the AI system's position. Consider a human who is lost in an unfamiliar region, trying to figure out where they are based on uncertain clues from the environment. "Is that the same mountain as before? Should I move towards it or away from it?" Now give that human a map and GPS routefinder;  much of the cognitive work that seemed so essential to them before will seem pointless now that they have much better instrumentation.

An AI system with a programmed-in utility function has the map and GPS. The question of "what direction should I move in?" will be obvious, because every direction has a number associated with it, and higher numbers are better. There's still uncertainty about how acting influences the future, and the AI will think long and hard about that to the extent that thinking long and hard about that increases expected utility.

An AI system with a programmed-in utility function has the map and GPS

And the one that doesn't, doesn't. It seems that typically AI risk arguments apply only to a subset of agents with explicit utility functions which are stable under self improvement.

Unfortunately , there has historically been agree deal of confusion over the claim that all agents can be seen as maximising a utility function, and the claim that it actually has one as a component.

Yeah, I think there's a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.

Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state -- it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.

There's a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly .... and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.

... well, one might say we assume that if there is 'reflection on goals', the results are not random.

I don't see how "not random" is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.

... we aren't trying to prove the absence of XRisk, we are probing the best argument for it?

But the idea that value drift is non random is built into the best argument for AI risk.

You quote it as :

  1. The “Singularity” Claim: Artificial Superintelligence is possible and would be out of human control.
  1. The Orthogonality Thesis: More or less any less of intelligence is compatible with more or less any final goal.

But there are actually two more steps:-

  1. A goal that appears morally neutral or even good can still be dangerous.(paperclipping, dopamine drips)

  2. AIs that don't have stable goals will tend to converge on Omohundran goals....which are dangerous.

Thanks, it's useful to bring these out - though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be "dangerous", as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.

The orthogonality thesis is thus much stronger than the denial of a (presumed) Kantian thesis that more intelligent beings would automatically be more ethical, or that an omniscient agent would maximise expected utility on anything, including selecting the best goals: It denies any relation between intelligence and the ability to reflect on goals.

I don't think this is true, and have two different main lines of argument / intuition pumps. I'll save the other for a later section where it fits better.

Are there several different reflectively stable moral equilibria, or only one? For example, it might be possible to have a consistent philosophically stable egoistic worldview, and also possible to have a consistent philosophically stable altruistic worldview. In this lens, the orthogonality thesis is the claim that there are at least two such stable equilibria and which equilibrium you end up in isn't related to intelligence. [Some people might be egoists because they don't realize that other people have inner lives, and increased intelligence unlocks their latent altruism, but some people might just not care about other people in a way that makes them egoists, and making them 'smarter' doesn't have to touch that.]

For example, you might imagine an American nationalist and a Chinese nationalist, both remaining nationalistic as they become more intelligent, and never switching which nation they like more, because that choice was for historical reasons instead of logical ones. If you imagine that, no, at some intelligence threshold they have to discard their nationalism, then you need to make that case in opposition to the orthogonality thesis. 

For some goals, I do think it's the case that at some intelligence threshold you have to discard it, hence the 'more or less', and I think many more 'goals' are unstable, where the more you think about them, the more they dissolve and are replaced by one of the stable attractors. For example, you might imagine it's the case that you can have reflectively stable nationalists who eat meat and universalists who are vegan, but any universalists who eat meat are not reflectively stable, where either they realize their arguments for eating meat imply nationalism or their arguments against nationalism imply not eating meat. [Or maybe the middle position is reflectively stable, idk.]


In this view, the existential risk argument is less "humans will be killed by robots and that's sad" and more "our choice of superintelligence to build will decide what color the lightcone explosion is and some of those possibilities are as bad or worse than all humans dying, and differences between colors might be colossally important." [For example, some philosophers today think that uploading human brains to silicon substrates will murder them / eliminate their moral value; it seems important for the system colonizing the galaxies to get that right! Some philosophers think that factory farming is immensely bad, and getting questions like that right before you hit copy-paste billions of times seems important.]

On this proposal, any reflection on goals, including ethics, lies outside the realm of intelligence. Some people may think that they are reflecting on goals, but they are wrong. That is why orthogonality holds for any intelligence.

I think I do believe something like this, but I would state it totally differently. Roughly, what most people think of as goals are something more like intermediate variables which are cognitive constructs designed to approximate the deeper goals (or something important in the causal history of the deeper goals). This is somewhat difficult to talk about because the true goal is not a cognitive construct, in the same way that the map is not the territory, and yet all my navigation happens in the map by necessity.

Of course, ethics and reflection on goals are about manipulating those cognitive constructs, and they happen inside of the realm of intelligence. But, like, who won WWII happened 'in the territory' instead of 'in the map', with corresponding consequences for the human study of ethics and goals.

Persuasion, in this view, is always about pointing out the flaws in someone else's cognitive constructs rather than aligning them to a different 'true goal.'

So, to argue that instrumental intelligence is sufficient for existential risk, we have to explain how an instrumental intelligence can navigate different frames.

This is where the other main line of argument comes into play:

I think 'ability to navigate frames' is distinct from 'philosophical maturity', roughly because of something like a distinction between soldier mindset and scout mindset

You can imagine an entity that, whenever it reflects on their current political / moral / philosophical positions, using their path-finding ability like a lawyer to make the best possible case for why they should believe what they already believe, or to discard incoming arguments whose conclusions are unpalatable. There's something like another orthogonality thesis at play here, where even if you're a wizard at maneuvering through frames, it matters whether you're playing chess or suicide chess.

This is just a thesis; it might be the case that it is impossible to be superintelligent and in soldier mindset (the 'curiosity' thesis?), but the orthogonality thesis is that it is possible, and so you could end up with value lock-in, where the very intelligent entity that is morally confused uses that intelligence to prop up the confusion rather than disperse it. Here we're using instrumental intelligence as the 'super' intelligence in both the orthogonality and existential risk consideration. (You consider something like this case later, but I think in a way that fails to visualize this possibility.)

[In humans, intelligence and rationality are only weakly correlated, in a way that I think supports this view pretty strongly.]

So, intelligent agents can have a wide variety of goals, and any goal is as good as any other.

The second half of this doesn't seem right to me, or at least is a little unclear. [Things like instrumental convergence could be a value-agnostic way of sorting goals, and Bostrom's 'more or less' qualifier is actually doing some useful work to rule out pathological goals.]

One more consideration about "instrumental intelligence": we left that somewhat under-defined, more like "if I had that utility function, what would I do?" ... but it is not clear that this image of "me in the machine" captures what a current or future machine would do. In other words, people who use instrumental intelligence for an image of AI owe us a more detailed explanation of what that would be, given the machines we are creating - not just given the standard theory of rational choice.

if a human had been brought up to have ‘goals as bizarre … as sand-grain-counting or paperclip-maximizing’, they could reflect on them and revise them in the light of such reflection.

Human "goals" and AI goals are a very different kind of thing. 

Imagine the instrumentally rational paperclip maximizer. If writing a philosophy essay will result in more paperclips, it can do that. If winning a chess game will lead to more paperclips, it will win the game. For any gradable task, if doing better on the task leads to more paperclips, it can do that task. This includes the tasks of talking about ethics, predicting what a human acting ethically would do etc. In short, this is what is meant by "far surpass all the intellectual activities of any man however clever.". 

The singularity hypothesis is about agents that are better at achieving their goal than human. In particular, the activities this actually depends on for an intelligence explosion are engineering and programming AI systems. No one said that an AI needed to be able to reflect on and change its goals.

Humans "ability" to reflect on and change our goals is more that we don't really know what we want. Suppose we think we want chocolate, and then we read about the fat content, and change our mind.  We value being thin more. The goal of getting chocolate was only ever an instrumental goal, it changed based on new information. Most of the things humans call goals are instrumental goals, not terminal goals. The terminal goals are difficult to intuitively access. This is how humans appear to change their "goals". And this is the hidden standard to which paperclip maximizing is compared and found wanting. There is some brain module that feels warm and fuzzy when it hears "be nice to people", and not when it hears "maximize paperclips". 

Human “goals” and AI goals are a very different kind of thing

No necessarily, since AIs can be WBEs or otherwise anthropomorphic. An AI with an explicitly coded goal is possible , but not the only kind.

Humans “ability” to reflect on and change our goals is more that we don’t really know what we want

Kind of, but note that goal instability is probably the default, since goal stability under self improvement is difficult.

No necessarily, since AIs can be WBEs or otherwise anthropomorphic. An AI with an explicitly coded goal is possible , but not the only kind.

While I think this is 100% true, it's somewhat misleading as a counter-argument. The single-goal architecture of one model of AI that we understand, and a lot of arguments focus on how that goes wrong. You can certainly build a different AI, but that comes at the price of opening yourself up to a whole different set of failure modes. And (as far as I can see), it's also not what the literature is up to right now.

If you don't understand other models , you don't know that they have other bad failures modes. If you only understand one model, and know that you only understand one model, you shouldn't be generalising it. If the literature isn't "up to it", no conclusions should be drawn until it is.

I think that's a decent argument about what models we should build, but not an argument that AI isn't dangerous.

"Dangerous" is a much easier target to hit than ""existentially dangerous, but "existentially dangerous" is the topic.

Here we get to a crucial issue, thanks! If we do assume that reflection on goals does occur, do we assume that the results have any resemblance with human reflection on morality? Perhaps there is an assumption about the nature of morality or moral reasoning in the 'standard argument' that we have not discussed?

I think the assumption it that human-like morality isn't universally privileged. 

Human morality has been shaped by evolution in the ancestral environment. Evolution in a different environment would create a mind with different structures and behaviours.

In other words, a full specification of human morality is sufficiently complex that it is unlikely to be spontaneously generated.

In other words, there is no compact specification of an AI that would do what humans want, even when on an alien world with no data about humanity. An AI could have a pointer at human morality with instructions to copy it. There are plenty of other parts of the universe it could be pointed to, so this is far from a default.  

But reasoning about morality? Is that a space with logic or with anything goes?

Imagine a device that looks like a calculator. When you type 2+2, you get 7. You could conclude its a broken calculator, or that arithmetic is subjective, or that this calculator is not doing addition at all. Its doing some other calculation. 

Imagine a robot doing something immoral. You could conclude that its broken, or that morality is subjective, or that the robot isn't thinking about morality at all. 

These are just different ways to describe the same thing. 

Addition has general rules. Like a+b=b+a. This makes it possible to reason about. Whatever the other calculator computes may follow this rule, or different rules, or no simple rules at all. 

These are just different ways to describe the same thing.

Not to the extent that there's no difference at all...you can exclude some of them on further investigation.

Thanks for posting this here! As you might expect, I disagree with you. I'd be interested to hear your positive account of why there isn't x-risk from AI (excluding from misused instrumental intelligence). Your view seems to be that we may eventually build AGI, but that it'll be able to reason about goals, morality, etc. unlike the cognitiviely limited instrumental AIs you discuss, and therefore it won't be a threat. Can you expand on the italicized bit? Is the idea that if it can reason about such things, it's as likely as we humans are to come to the truth about them? (And, there is in fact a truth about them? Some philosophers would deny this about e.g. morality.) Or indeed perhaps you would say it's more likely than humans to come to the truth, since if it were merely as likely as humans then it would be pretty scary (humans come to the wrong conclusions all the time, and have done terrible things when granted absolute power).

We do not say that there is no XRisk or no XRisk from AI.

Yeah, sorry, I misspoke. You are critiquing one of the arguments for why there is XRisk from AI. One way to critique an argument is to dismiss it on "purely technical" grounds, e.g. "this argument equivocates between two different meanings of a term, therefore it is disqualified." But usually when people critique arguments, even if on technical grounds, they also have more "substantive" critiques in mind, e.g. "here is a possible world in which the premises are true and the conclusion false." (Or both conclusion and at least one premise false). I was guessing that you had such a possible world in mind, and trying to get a sense of what it looked like.

Thanks. We are actually more modest. We would like to see a sound argument for XRisk from AI and we investigate what we call 'the standard argument'; we find it wanting and try to strengthen it, but we fail. So there is something amiss. In the conclusion we admit "we could well be wrong somewhere and the classical argument for existential risk from AI is actually sound, or there is another argument that we have not considered."

I would say the challenge is to present a sound argument (valid + true premises) or at least a valid argument with decent inductive support for the premises. Oddly, we do not seem to have that.

Laying my cards on the table, I think that there do exist valid arguments with plausible premises for x-risk from AI, and insofar as you haven't found them yet then you haven't been looking hard enough or charitably enough. The stuff I was saying above is a suggestion for how you could proceed: If you can't prove X, try to prove not-X for a bit, often you learn something that helps you prove X. So, I suggest you try to argue that there is no x-risk from AI (excluding the kinds you acknowledge, such as AI misused by humans) and see where that leads you. It sounds like you have the seeds of such an argument in your paper; I was trying to pull them together and flesh them out in the comment above.

A minor qualm that does not impact your main point. From this quotation of Bostrom:

We can tentatively define a superintelligence as any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest

You deduce:

So, the singularity claim assumes a notion of intelligence like the human one, just ‘more’ of it.

That's too narrow of an interpretation. The definition by Bostrom only states that the superintelligence outperforms humans on all intellectual tasks. But its inner workings could be totally different from human reasoning.

Others here will be able to discuss your main point better than me (edit: but I'll have a go at it as a personal challenge). I think the central point is one you mention in passing, the difference between instrumental goals and terminal values. An agent's terminal values should be able to be expressed as a utility function, otherwise these values are incoherent and open to dutch-booking. We humans are incoherent, which is why we often confuse instrumental goals for terminal values, and we need to force ourselves to think rationally otherwise we're vulnerable to dutch-booking. The utility function is absolute: if an agent's utility function is to maximize the number of paperclips, no reasoning about ethics will make them value some instrumental goal over it. I'm not sure whether the agent is totally protected against wireheading though (convincing itself it's fullfilling its values rather than actually doing it).

It'd be nice if we could implement our values as the agent's terminal values. But that turns out to be immensely difficult (look for articles with 'genie' here). Forget the 3 laws of Asimov: the first law alone is irredeemably ambiguous. How far should the agent go to protect human lives? What counts as a human? It might turn out more convenient for the agent to turn mankind into brains in a jar and store them eternally in a bunker for maximum safety.

Overall, I think your abstract and framing is pretty careful to narrow your attention to "is this argument logically sound?" instead of "should we be worried about AI?", but still this bit jumps out to me:

the argument for the existential risk of AI turns out invalid.

Maybe insert "standard" in front of "argument" again?

MIght not even want to imply that it's the main or only argument.  Maybe "this particular argument is invalid".

I do think it's fair to describe this as the 'standard argument'.

Is this 'standard argument' valid? We only argue that is problematic.

If this argument is invalid, what would a valid argument look like? Perhaps with a 'sufficient probability' of high risk from instrumental intelligence?

Sticking a typo over here instead of the other tree:

This thought it sometimes called the

"thought is sometimes"

Yes, that means "this argument".

[+][comment deleted]2y1

What is "instrumental intelligence?"

Informally, it's the kind of intelligence (usually understood as something like " the capacity to achieve goals in a wide variety of environments") which is capable of doing that which is instrumental to achieving the goal. Given a goal, it is the capacity to achieve that goal, to do what is instrumental to achieving that goal. 

Bostrom, in Superintelligence (2014), speaks of it as  "means-end reasoning". 

So, strictly speaking, it does not involve reasoning about the ends or goals in service of which the intelligence/optimisation is being pressed.

Example: a chess-playing system will have some pre-defined goal and optimise instrumentally toward that end, but will not evaluate the goal itself.

I didn't read your full paper yet, but from your summary, it's unclear to me how such understanding of intelligence would be inconsistent with the "Singularity" claim

  • Instrumental superintelligence seems to be feasible - a system that is better at achieving a goal than the most intelligent human
  • Such system can also self-modify, to better achieve its goal, leading to an intelligence explosion

We suggest that such instrumental intelligence would be very limited.

In fact, there is a degree of generality here and it seems one needs a fairly high degree to get to XRisk, but that high degree would then exclude orthogonality.

We suggest that such instrumental intelligence would be very limited

It's not the inability to change its goals that makes it less powerful, it's the inability self-improve.

In the example of a human overcoming the "win at chess" frame, I don't see how that reduces the orthogonality. An example given is that "the point is to have a good time" but I could comparably plausible see that a parent could also go "we need to tech this kid that world is a hard place" and go all out. But feature the relevant kind of frame shifting away from simple win but there is no objectively right "better goal" they don't converge on what the greater point might be.

I feel like applied to humans just because people do ethics doesn't mean that they agree on it. I can also see that there can be multiple "fronts" of progress, different political systems will call for different kind of ethical progress. The logic seems to be that because humans are capable of modest general intelligence if a human were to have a silly goal they would refelct out of it. This would seem to suggest that if a country would be in a war of aggression they would just see the error of their ways and recorrect to be peaceful. While we often do think our enemies are doing ethics wrong, I don't think that goal non-sharing is effectively explained by the other party not being able to sustain ethical thinking.

Thefore I think there is a hidden assumption that goal transcendense happens in the same direction in all agents and this is needed in order for goal transcendence to wipe out orthogonality. Worse we might start with the same goal and reinterpret the situation to mean different things such as chess not being sufficient to nail down whether it is more important for children to learn to be sociable or efficient in the world. One could even imagine worlds where one of the answers would be heavily favoured but still could contain identical games of chess (living in Sparta vs in the internet age). In so far that human opinions agreeing is based on trying to solve the same "human condition" that could be in jeopardy if the "ai condition" is genuinely different.

General intelligence doesn't require any ability for the intelligence to change it's terminal goals. I honestly don't even know if the ability to change one's terminal goal is allowed or makes sense. I think the issue arises because your article does not distinguish between intermediary goals and terminal goals. Your argument is that humans are general intelligences and that humans change their terminal goals, therefore we can infer that general intelligences are capable of changing their terminal goals. But you only ever demonstrated that people change their intermediary goals.

As an example you state that people could reflect and revise on "goals as bizarre ... as sand-grain-counting or paperclip-maximizing" if they had been brought up to have them.[1] The problem with this is that you conclude that if a person is brought up to have a certain goal then that is indeed their terminal goal. That is not the case.

For people who were raised to maximize paperclips unless they became paperclip maximizers the terminal goal could have been survival and pleasing whoever raised them increased their chance of survival. Or maybe it was seeking pleasure and the easiest way to pleasure was making paperclips to see mommy's happy face. All you can infer from a person's past unceasing manufacture of paperclips is that paperclip maximization was at least one of their intermediary goals. When that person learns new information or his circumstances are changed (i.e. I no longer live under the thumb of my insane parents so I don't need to bend pieces of metal to survive) he changes his intermediary goal, but that's no evidence that his terminal goal has changed.

The simple fact that you consider paperclip maximization an inherently bizarre goal further hints at the underlying fact that terminal goals are not updatable. Human terminal goals are a result of brain structure which is the result of evolution and the environment. The process of evolution naturally results in creatures that try to survive and reproduce. Maybe that means that survival and reproduction are our terminal goals, maybe not. Human terminal goals are virtually unknowable without a better mapping of the human brain (a complete mapping may not be required). All we can do is infer what the goals are based on actions (revealed preferences), the mapping we have available already, and looking at the design program (evolution). I don't think true terminal goals can be learned solely from observing behaviors.

If an AI agent has the ability to change it's goals that makes it more dangerous not less so. That would mean that even the ability to perfectly predict the AI's goal will not mean that you can assure it is friendly. The AI might just reflect on its goal and change it to something unfriendly!

  1. This paraphrased quote from Bostrom contributes partly to this issue. Bostrom specifically says, "synthetic minds can have utterly non-anthropomorphic goals-goals as bizarre by our lights as sand-grain-counting or paperclip-maximizing" (em mine). The point being that paperclip maximizing is not inherently bizarre as a goal, but that it would be bizarre for a human to have that goal given the general circumstances of humanity. But we shouldn't consider any goal to be bizarre in an AI designed free from the circumstances controlling humanity. ↩︎

Thanks for this. Indeed, we have no theory of goals here and how the relate, maybe they must be in a hierarchy, as you suggest. And there is a question, then, whether there must be some immovable goal or goals that would have to remain in place in order to judge anything at all. This would constitute a theory of normative judgment ... which we don't have up our sleeves :)

We tried to find the strongest argument in the literature. This is how we came up with our version:

Premise 1: Superintelligent AI is a realistic prospect, and it would be out of human control. (Singularity claim)

Premise 2: Any level of intelligence can go with any goals. (Orthogonality thesis)

Conclusion: Superintelligent AI poses an existential risk for humanity

A more formal version with the same propositions might be this:

1. IF there is a realistic prospect that there will be a superintelligent AI system that is a) out of human control and b) can have any goals, THEN there is existential risk for humanity from AI

2. There is a realistic prospect that there will be a superintelligent AI system that is a) out of human control and b) can have any goals


3. There is existential risk for humanity from AI


And now our concern is whether a superintelligence can be both a) and b) - given that a) must be understood in a way that is strong enough to generate existential risk, including "widening the frame", and b) must be understood as strong enough to exclude reflection on goals. Perhaps that will work only if "intelligent" is understood in two different ways? Thus Premise 2 is doubtful.

New to LessWrong?