Deconstructing Bostrom's Classic Argument for AI Doom

Nora Belrose

Deconstructing Bostrom's Classic Argument for AI Doom

by Nora Belrose

1 min read11th Mar 202414 comments

16

Object-Level AI Risk SkepticismAI

Frontpage

This is a linkpost for https://www.youtube.com/watch?v=8H3dblxkLhY

I had a pretty great discussion with social psychologist and philosopher Lance Bush recently about the orthogonality thesis, which ended up turning into a broader analysis of Nick Bostrom's argument for AI doom as presented in Superintelligence, and some related issues.

While the video is intended for a general audience interested in philosophy, and assumes no background in AI or AI safety, in retrospect I think it was possibly the clearest and most rigorous interview or essay I've done on this topic. In particular I'm much more proud of this interview than I am of our recent Counting arguments provide no evidence for AI doom post.

New to LessWrong?

Getting Started

FAQ

Library

Deconstructing Bostrom's Classic Argument for AI Doom

New Comment

14 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:44 AM

[-]ryan_greenblatt2mo70

As far as the orthogonality thesis, relevant context is:

The arbital page which defines it more precisely: "The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal."
Yudkowsky's tweet explaining what he meant. (And a variety of responses by many parties.)
A post by @tailcalled

My overall take is that in Nora's/Bush's decomposition, the orthogonality thesis corresponds to "trivial".

However, I would prefer to instead call it "extremely-obvious-from-my-perspective" as indeed some people seem to disagree with this. (Yes, it's very obvious that ASI pursing arbitrary goals is logically possible! The thesis is intended to be obvious! The strong version as defined by Yudkowsky (there need be nothing especially complicated or twisted about an agent pursuing an arbitrary goal (if that goal isn't massively complex)) is also pretty obvious IMO.)

I agree that people seem to quote the orthogonality thesis as making stronger claims than it actually directly claims (e.g. misalignment is likely which is not at all implied by the thesis). And that awkwardly people seem to redefine the term in various ways (as noted in Yudkowsky's tweet linked above). So this creates a Motte and Bailey in practice, but this doesn't mean the thesis is wrong. (Edit: Also, I don't recall cases where Yudkowsky or Bostrom did this Motte and Bailey without further argument, but I wouldn't be very surprised to see it, particular for Bostrom.)

[-]Steven Byrnes2mo65

Agree—I was also arguing for “trivial” in this EA Forum thread a couple years ago.

[-]TAG2mo42

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.”

The Ortogonality Thesis is often used in a way that "smuggles in" the idea that an AI will necessarily have a stable goal, even though goals can be very variewd. But similar reasoning shows that any combination of goal (in)stability and goallessness is possible, as well. mindspace contains agents with fixed goals, randomnly drifting goals, corrigble (externally controlable goals) , as well as non-agentive minds with no goals.

[-]Nora Belrose2mo23

The strong version as defined by Yudkowsky... is pretty obvious IMO

I didn't expect you'd say that. In my view it's pretty obviously false. Knowledge and skills are not value-neutral, and some goals are a lot harder to instill into an AI than others bc the relevant training data will be harder to come by. Eliezer is just not taking into account data availability whatsoever, because he's still fundamentally thinking about things in terms of GOFAI and brains in boxes in basements rather than deep learning. As Robin Hanson pointed out in the foom debate years ago, the key component of intelligence is "content." And content is far from value neutral.

[-]ryan_greenblatt2mo50

Hmm, maybe I'm interpreting the statement to mean something weaker and more handwavy than you are. I agree with claims like "with current technology, it can be hard to make an AI pursue some goals as competently as other goals" and "if a goal is hard to specify given available training data, then it's harder to make an AI pursue it".

However, I think how competently an AI pursues a goal is somewhat different than whether an AI tries to pursues a goal at all.(Which is what I think the strong version of the thesis is still getting at.) I was trying to get at the "hard to specify" thing with the simplicity caveat. There are also many other caveats because goals and other concepts are quite handwavy.

Doesn't seem important to discuss further.

I think I agree with everything you said. (Except for the psychologising about Eliezer on which I have no particular opinion.)

[-]DaemonicSigil2mo20

Could you give an example of knowledge and skills not being value neutral?

(No need to do so if you're just talking about the value of information depending on the values one has, which is unsurprising. But it sounds like you might be making a more substantial point?)

[-]Nora Belrose2mo2-8

As I argue in the video, I actually think the definitions of "intelligence" and "goal" that you need to make the Orthogonality Thesis trivially true are bad, unhelpful definitions. So I both think that it's false, and even if it were true it'd be trivial.

I'll also note that Nick Bostrom himself seems to be making the motte and bailey argument here, which seems pretty damning considering his book was very influential and changed a lot of people's career paths, including my own.

Edit replying to an edit you made: I mean, the most straightforward reading of Chapters 7 and 8 of Superintelligence is just a possibility-therefore-probability fallacy in my opinion. Without this fallacy, there would be little need to even bring up the orthogonality thesis at all, because it's such a weak claim.

[-]ryan_greenblatt2mo52

I mean, the most straightforward reading of Chapters 7 and 8 of Superintelligence is just a possibility-therefore-probability fallacy in my opinion.

The most relevant quote from Superintelligence (that I could find) is:

Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible— and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.

My interpretation is that Bostrom is trying to be reasonably precise here and trying to do something like:

You might have "blithely assumed" that things would necessarily be fine, but orthogonality. (Again, extremely obvious.)

Also, it (separately) seems to me (Bostrom) to be technically easier to get your AI to have a simple goal, which implies that random goals might be more likely.

I think you disagree with point (2) here (and I disagree with point 2 as well), but this seems different from the claim you made. (I didn't bother looking for Bostrom's arguments for (2), but I expect them to be weak and easily defeated, at least ex-post.)

TBC, I can see where you're coming from, but I think Bostrom tries to avoid this fallacy. It would be considerably better if he explicitly called out this fallacy and disclaimed it. So, I think he should be partially blamed for likely misinterpretations.

[-]Charlie Steiner2mo41

Thank you for posting this, and it was interesting. Also, I think the middle section is bad.

Basically starting from Lance taking a digression out of an anthropomorphic argument to castigate those who think AI might do bad things for anthropomorphising, and ending with the end of all discussion of Solomonoff induction, I think there was a lot of misconstruing ideas or arguing against nonexistent people.

Like, I personally don't agree with people who expect optimization daemons to arise in gradient descent, but I don't say they're motivated by whether the Solomonoff prior is malign.

[-]Nora Belrose2mo1-7

I do think that Solomonoff-flavored intuitions motivate much of the credence people around here put on scheming. Apparently Evan Hubinger puts a decent amount of weight on it, because he kept bringing it up in our discussion in the comments to Counting arguments provide no evidence for AI doom.

[-]Charlie Steiner2mo40

I was curious about the context and so I went over and ctrl+F'ed Solomonoff and found Evan saying

I think you're misunderstanding the nature of my objection. It's not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it's that the reasoning in this post is mathematically unsound, and I'm using the formalism to show why. If I weren't responding to this post specifically, I probably wouldn't have brought up Solomonoff induction at all.

[-]Nora Belrose2mo1411

Yeah, I think Evan is basically opportunistically changing his position during that exchange, and has no real coherent argument.

[-]ryan_greenblatt2mo40

Intuitions about simplicity in regimes where speed is unimportant (e.g. turing machines with minimal speed bound) != intuitions from the solomonoff prior being malign due to the emergence of life within these turing machines.

It seems important to not equivocate between these.

(Sorry for the terse response, hopefully this makes sense.)

[-]Noosphere892mo1-6

The thing I'll say on the orthogonality thesis is that I think it's actually fairly obvious, but only because it makes extremely weak claims, in that it's logically possible for AI to be misaligned, and the critical mistake is assuming that possibility translates into non-negligible likelihood.

It's useful for history purposes, but is not helpful at all for alignment, as it fails to answer essential questions.

Moderation Log