Sorted by New

Wiki Contributions


>And now the philosopher comes and presents their "thought experiment" - setting up a scenario in which, by
>stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is
>certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly
>warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the
>train. These are your only options; what do you do?"

If you are looking out for yourself, it's an easy decision, at least in the United States. There is no legal requirement to save lives, but dealing with the legal consequences of putting the innocent guy in front of the train is likely to be a real pain in the ass. Therefore, do nothing.

I agree that this isn't the thought experiment that was originally proposed. If we take inventory of the questions available, we have:

If I'm a real person with real human desires, sit there and let the 5 guys get run over, as I suggest above.
If I'm an AI that is uniformly compassionate and immune from social consequences to my actions, and there's no compelling reason to value the one above the five, then I'd probably kill one to save five.
* If I'm a person with human desires who is pretending to be perfectly compassionate, then there's a problem to solve. In this case I prefer to unask the question by stopping the pretense.

Benja --

I disagree with Tyrrell (see below), but I can give a version of Tyrrell's "trivial" formalization:

We want to show that:

Averaging over all theories T, P(T makes correct predictions | T passes 10 tests) > P(T makes correct predictions)

By Bayes' rule,

P(T makes correct predictions | T passes 10 tests) = P(T makes correct predictions)

  • P(T passes 10 tests | T makes correct predictions) / P(T passes 10 tests)

So our conclusion is equivalent to:

Averaging over all theories T, P(T passes 10 tests | T makes correct predictions) / P(T passes 10 tests)


which is equivalent to

Averaging over all theories T, P(T passes 10 tests | T makes correct predictions) > P(T passes 10 tests)

which has to be true for any plausible definition of "makes correct predictions". The effect is only small if nearly all theories can pass the 10 tests.

I disagree with Tyrrell's conclusion. I think his fallacy is to work with the undefined concept of "the best theory", and to assume that:

  • If a theory consistent with past observations makes incorrect predictions then there was something wrong with the process by which that theory was formed. (Not true; making predictions is inherently an unreliable process.)

  • Therefore we can assume that that process produces bad theories with a fixed frequency. (Not meaningful; the observations made so far are a varying input to the process of forming theories.)

In the math above, the fallacy shows up because the set of theories that are consistent with the first 10 observations is different from the set of theories that are consistent with the first 20 observations, so the initial statement isn't really what we wanted to show. (If that fallacy is a problem with my understanding of Tyrrell's post, he should have done the "trivial" formalization himself.)

There are lots of ways to apply Bayes' Rule, and this wasn't the first one I tried, so I also disagree with Tyrrell's claim that this is trivial.

Three points in response to Eliezer's post and one of his replies:

* A limited time horizon works better than he says. If an AI wants to put its world into a state desired by humans, and it knows that the humans don't want to live in a galaxy that will be explode in a year, then an AI that closes its books in 1000 years will make sure that the galaxy won't explode one year later.

* An unbounded utility works worse than he says. Recall the ^^^^ operator originally by Knuth (see that was used in the Pascal's Mugging article at

If one allows unbounded utilities, then one has allowed a utility of about 3^^^^3 that has no low-entropy representation. In other words, there isn't enough matter to represent a utility.

Humans have heads of a limited size that don't use higher math to represent their desires, so bounding the utility function doesn't limit our ability to describe human desire.

* Ad-hominem is a fallacy. The merit of a proposed FAI solution is a function of the solution, not who proposed it or how long it took them. An essential step toward overcoming bias is to train oneself not to commit well-known fallacies. There's a good list in "The Art of Controversy" by Schopenhauer, see

Of course, I'm bothering to say this because I have a proposed solution out. See

Hugo Mercier's citation above for "Believe it or Not" by Hasson et al. wants money to give you the article. The article is available for free from Hasson's home page at:

The direct URL is:

The premise is that a rational agent would start out convinced that this story about the alien that knows in advance what they'll decide appears to be false.

The Kolomogorov complexity of the story about the alien is very large because we have to hypothesize some mechanism by which it can extrapolate the contents of minds. Even if I saw the alien land a million times and watched the box-picking connect with the box contents as they're supposed to, it is simpler to assume that the boxes are some stage magic trick, or even that they are an exception to the usual laws of physics.

Once we've done enough experiments that we're forced into the hypothesis that the boxes are an exception to the usual laws of physics, it's pretty clear what to do. The obvious revised laws of physics based on the new observations make it clear that one should choose just one box.

So a rational agent would do the right thing, but only because there's no way to get it to believe the backstory.