"If the ends don't justify the means, what does?"
—variously attributed"I think of myself as running on hostile hardware."
—Justin Corwin
Yesterday I talked about how humans may have evolved a structure of political revolution, beginning by believing themselves morally superior to the corrupt current power structure, but ending by being corrupted by power themselves—not by any plan in their own minds, but by the echo of ancestors who did the same and thereby reproduced.
This fits the template:
In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.
From this proposition, I now move on to my main point, a question considerably outside the realm of classical Bayesian decision theory:
"What if I'm running on corrupted hardware?"
In such a case as this, you might even find yourself uttering such seemingly paradoxical statements—sheer nonsense from the perspective of classical decision theory—as:
"The ends don't justify the means."
But if you are running on corrupted hardware, then the reflective observation that it seems like a righteous and altruistic act to seize power for yourself—this seeming may not be be much evidence for the proposition that seizing power is in fact the action that will most benefit the tribe.
By the power of naive realism, the corrupted hardware that you run on, and the corrupted seemings that it computes, will seem like the fabric of the very world itself—simply the way-things-are.
And so we have the bizarre-seeming rule: "For the good of the tribe, do not cheat to seize power even when it would provide a net benefit to the tribe."
Indeed it may be wiser to phrase it this way: If you just say, "when it seems like it would provide a net benefit to the tribe", then you get people who say, "But it doesn't just seem that way—it would provide a net benefit to the tribe if I were in charge."
The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can't yet say, but that would seem to be the appropriate level to handle it.)
But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it. A rule that says, "For the good of the tribe, do not cheat to seize power even for the good of the tribe." Or "For the good of the tribe, do not murder even for the good of the tribe."
And now the philosopher comes and presents their "thought experiment"—setting up a scenario in which, by stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the train. These are your only options; what do you do?"
An altruistic human, who has accepted certain deontological prohibits—which seem well justified by some historical statistics on the results of reasoning in certain ways on untrustworthy hardware—may experience some mental distress, on encountering this thought experiment.
So here's a reply to that philosopher's scenario, which I have yet to hear any philosopher's victim give:
"You stipulate that the only possible way to save five innocent lives is to murder one innocent person, and this murder will definitely save the five lives, and that these facts are known to me with effective certainty. But since I am running on corrupted hardware, I can't occupy the epistemic state you want me to imagine. Therefore I reply that, in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder the one innocent person to save five, and moreover all its peers would agree. However, I refuse to extend this reply to myself, because the epistemic state you ask me to imagine, can only exist among other kinds of people than human beings."
Now, to me this seems like a dodge. I think the universe is sufficiently unkind that we can justly be forced to consider situations of this sort. The sort of person who goes around proposing that sort of thought experiment, might well deserve that sort of answer. But any human legal system does embody some answer to the question "How many innocent people can we put in jail to get the guilty ones?", even if the number isn't written down.
As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. But I don't think that our deontological prohibitions are literally inherently nonconsequentially terminally right. I endorse "the end doesn't justify the means" as a principle to guide humans running on corrupted hardware, but I wouldn't endorse it as a principle for a society of AIs that make well-calibrated estimates. (If you have one AI in a society of humans, that does bring in other considerations, like whether the humans learn from your example.)
And so I wouldn't say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don't call it an alarm light, if an AI says that the right thing to do is sacrifice one to save five. Again, I don't go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects. I happen to be a human. But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.
I would even go further, and say that if you had minds with an inbuilt warp that made them overestimate the external harm of self-benefiting actions, then they would need a rule "the ends do not prohibit the means"—that you should do what benefits yourself even when it (seems to) harm the tribe. By hypothesis, if their society did not have this rule, the minds in it would refuse to breathe for fear of using someone else's oxygen, and they'd all die. For them, an occasional overshoot in which one person seizes a personal benefit at the net expense of society, would seem just as cautiously virtuous—and indeed be just as cautiously virtuous—as when one of us humans, being cautious, passes up an opportunity to steal a loaf of bread that really would have been more of a benefit to them than a loss to the merchant (including knock-on effects).
"The end does not justify the means" is just consequentialist reasoning at one meta-level up. If a human starts thinking on the object level that the end justifies the means, this has awful consequences given our untrustworthy brains; therefore a human shouldn't think this way. But it is all still ultimately consequentialism. It's just reflective consequentialism, for beings who know that their moment-by-moment decisions are made by untrusted hardware.
Quite often when given that problem I have heard non-answers. Even at the time of writing I do not believe it was unreasonable to give a non-answer; not just from a perceived moral perspective, but even from a utilitarian perspective, there are so many contextual elements removed that the actual problem isn't whether they will answer kill one and save the others or decline to act and save one only,
but rather the extent of the originality of the given answer. One can then sort of extrapolate the sort of thinking the individual asked may be pursuing, and this is also controlled contextually. If they say oh yes absolutely I would save the five, immediately, then they are likely too impulsive. How they answer is also valuable, in whether they say they are 'saving five' or 'killing one', or explaining the entire answer of 'I am killing one person to save five people.' When answered like that, it has a more powerful impact. If more questions arise on the context of the individuals and whether the one life is more valuable than the others, that can also tell you about the priorities of the inquired, and often point out biases or preferred traits. Adding some elements to it would muddy the thought problem, but if you know the inquired's preferences, you can make the question more difficult and require them to think longer: if you had to move a train over either five convicted murderers or one randomly selected office filer who was without family, then is the answer the same? What if the one person was a relative, or a loved one? The question gets easier or harder with further context; but that's from a still limited, biased perspective. In no instance does the question become easier or harder, because the answers available are still insufficient to concern a critical thinker.
What is most valuable to hear is not any of those, but a strict perception of a third answer. Not considering the first two as valid, since they are so without context as to deny the context of the event, too. Although it may be altruistic for the one individual to accept his death for the rest, it would be a concern if a third party did not first attempt the difficult task of understanding a way for all six of them to survive, giving the best case scenario, and creating means to justify a better end, rather than accepting the means given to you and being told the results.
If x and y are the only options, if we declined to allow z, then we have stopped trying to think and have limited ourselves to a weak framework controlled in an unfair manner towards the inquired. If we never challenged this binary answer, I don't think we would have some of the incredible alternatives we have. Though it may indeed seem like a dodge as the original post says, it's a very thoughtful one. The most dangerous answers are 'I do nothing.' and answering too quickly. Inaction and impulsive action, even in a time limited situation, indicate a desire to either neglect the problem or to assume the answer. Taking Einstein's quote and shortening it, if given sixty seconds to consider this problem, you/I should spend 55 seconds considering it and 5 seconds executing a solution, even if it's a poorer one than desired.
Interesting old post, I just think the answer is irrelevant, but rather the answer any given person has for the question is very relevant. It's difficult because the answer is obvious, but our humanity makes us doubt it as objectively true, and that's quite compelling as a concept.