Philosopher John Danaher has written an explication and critique of Bostrom's "orthogonality thesis" from "The Superintelligent Will." To quote the conclusion:


Summing up, in this post I’ve considered Bostrom’s discussion of the orthogonality thesis. According to this thesis, any level of intelligence is, within certain weak constraints, compatible with any type of final goal. If true, the thesis might provide support for those who think it possible to create a benign superintelligence. But, as I have pointed out, Bostrom’s defence of the orthogonality thesis is lacking in certain respects, particularly in his somewhat opaque and cavalier dismissal of normatively thick theories of rationality.

As it happens, none of this may affect what Bostrom has to say about unfriendly superintelligences. His defence of that argument relies on the convergence thesis, not the orthogonality thesis. If the orthogonality thesis turns out to be false, then all that happens is that the kind of convergence Bostrom alludes to simply occurs at a higher level in the AI’s goal architecture. 

What might, however, be significant is whether the higher-level convergence is a convergence towards certain moral beliefs or a convergence toward nihilistic beliefs. If it is the former, then friendliness might be necessitated, not simply possible. If it is the latter, then all bets are off. A nihilistic agent could do pretty anything since, no goals would be rationally entailed.


New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 1:08 AM

Hm, the Future Tuesday Indifference example is an interesting one. The reason it seems reflectively incoherent is because it violates an expected utility axiom if interpreted the typical way. If you calculate the expected utility of an option, but forget to add in the expected utility from future Tuesdays, you simply get the wrong answer.

However, interestingly, you can't self-modify to being a normal hedonist with only causal decision theory. If it's not tuesday, then changing to include tuesdays doesn't increase what you calculate as the expected utility. If it is tuesday, then it's too late unless you have a decision theory that allows you to treat a change to optimality as a good idea no matter when you do it.

The problem is that the utility isn't constant. If you, today are indifferent to what happens on future Tuesdays, then you will also think it's a bad thing that your future self cares what happens on that Tuesday. You will therefore replace your current self with a different self that is indifferent to all future Tuesdays, including the ones that it's in, thus preserving the goal that you have today.

Good point. I have to remember not to confuse expected utility with future utility.

I am not sure what 'accurate moral beliefs' means. By analogy with 'accurate scientific beliefs', it seems as if Mr Danaher is saying there are true morals out there in reality, which I had not thought to be the case, so I am probably confused. Can anyone clarify my understanding with a brief explanation of what he means?

Well, I suppose I had in mind the fact that any cognitivist metaethics holds that moral propositions have truth values, i.e. are capable of being true or false. And if cognitivism is correct, then it would be possible for one's moral beliefs to be more or less accurate (i.e. to be more or less representative of the actual truth values of sets of moral propositions).

While moral cognitivism is most at home with moral realism - the view that moral facts are observer-indepedent - it is also compatible with some versions of anti-realism, such as the constructivist views I occasionally endorse.

The majority of moral philosophers (a biased sample) are cognitivists, as are most non-moral philosophers that I speak to (pure anecdotal evidence). If one is not a moral cognitivist, then the discussion on my blog post will of course be unpersuasive. But in that case, one might incline towards moral nihilism, which could, as I pointed out, provide some support for the orthogonality thesis.

The post needs a direct link. The current version only links to Danaher's homepage and Bostrom's article.

[This comment is no longer endorsed by its author]Reply

Oops, lol. Fixed.

And here I was wondering if this was a paper from the esteemed Brazilian jiu jitsu coach (who does in fact have a Masters degree in philosophy.)

Rather than doing pretty much anything, it seems more likely to me that a genuinely nihilistic agent would default to doing nothing.

I think that's an interesting point. I suppose I was thinking that nihilism, at least in the way its typically discussed, holds not that doing nothing is rational but, rather, that no goals are rational (a subtle difference, perhaps). This, in my opinion, might equate with all goals being equally possible. But, as you point out, if all goals are equally possible the agent might default to doing nothing.

One might put it like this: the agent would be landed in the equivalent of a Buridan's Ass dilemma. As far as I recall, the possibility that a CPU would be landed in such a dilemma was a genuine problem in the early days of computer science. I believe there was some protocol introduced to sidestep the problem.