"If the ends don't justify the means, what does?"
—variously attributed"I think of myself as running on hostile hardware."
—Justin Corwin
Yesterday I talked about how humans may have evolved a structure of political revolution, beginning by believing themselves morally superior to the corrupt current power structure, but ending by being corrupted by power themselves—not by any plan in their own minds, but by the echo of ancestors who did the same and thereby reproduced.
This fits the template:
In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.
From this proposition, I now move on to my main point, a question considerably outside the realm of classical Bayesian decision theory:
"What if I'm running on corrupted hardware?"
In such a case as this, you might even find yourself uttering such seemingly paradoxical statements—sheer nonsense from the perspective of classical decision theory—as:
"The ends don't justify the means."
But if you are running on corrupted hardware, then the reflective observation that it seems like a righteous and altruistic act to seize power for yourself—this seeming may not be be much evidence for the proposition that seizing power is in fact the action that will most benefit the tribe.
By the power of naive realism, the corrupted hardware that you run on, and the corrupted seemings that it computes, will seem like the fabric of the very world itself—simply the way-things-are.
And so we have the bizarre-seeming rule: "For the good of the tribe, do not cheat to seize power even when it would provide a net benefit to the tribe."
Indeed it may be wiser to phrase it this way: If you just say, "when it seems like it would provide a net benefit to the tribe", then you get people who say, "But it doesn't just seem that way—it would provide a net benefit to the tribe if I were in charge."
The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can't yet say, but that would seem to be the appropriate level to handle it.)
But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it. A rule that says, "For the good of the tribe, do not cheat to seize power even for the good of the tribe." Or "For the good of the tribe, do not murder even for the good of the tribe."
And now the philosopher comes and presents their "thought experiment"—setting up a scenario in which, by stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is certain to save the five lives. "There's a train heading to run over five innocent people, who you can't possibly warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the train. These are your only options; what do you do?"
An altruistic human, who has accepted certain deontological prohibits—which seem well justified by some historical statistics on the results of reasoning in certain ways on untrustworthy hardware—may experience some mental distress, on encountering this thought experiment.
So here's a reply to that philosopher's scenario, which I have yet to hear any philosopher's victim give:
"You stipulate that the only possible way to save five innocent lives is to murder one innocent person, and this murder will definitely save the five lives, and that these facts are known to me with effective certainty. But since I am running on corrupted hardware, I can't occupy the epistemic state you want me to imagine. Therefore I reply that, in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder the one innocent person to save five, and moreover all its peers would agree. However, I refuse to extend this reply to myself, because the epistemic state you ask me to imagine, can only exist among other kinds of people than human beings."
Now, to me this seems like a dodge. I think the universe is sufficiently unkind that we can justly be forced to consider situations of this sort. The sort of person who goes around proposing that sort of thought experiment, might well deserve that sort of answer. But any human legal system does embody some answer to the question "How many innocent people can we put in jail to get the guilty ones?", even if the number isn't written down.
As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. But I don't think that our deontological prohibitions are literally inherently nonconsequentially terminally right. I endorse "the end doesn't justify the means" as a principle to guide humans running on corrupted hardware, but I wouldn't endorse it as a principle for a society of AIs that make well-calibrated estimates. (If you have one AI in a society of humans, that does bring in other considerations, like whether the humans learn from your example.)
And so I wouldn't say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don't call it an alarm light, if an AI says that the right thing to do is sacrifice one to save five. Again, I don't go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects. I happen to be a human. But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn't spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.
I would even go further, and say that if you had minds with an inbuilt warp that made them overestimate the external harm of self-benefiting actions, then they would need a rule "the ends do not prohibit the means"—that you should do what benefits yourself even when it (seems to) harm the tribe. By hypothesis, if their society did not have this rule, the minds in it would refuse to breathe for fear of using someone else's oxygen, and they'd all die. For them, an occasional overshoot in which one person seizes a personal benefit at the net expense of society, would seem just as cautiously virtuous—and indeed be just as cautiously virtuous—as when one of us humans, being cautious, passes up an opportunity to steal a loaf of bread that really would have been more of a benefit to them than a loss to the merchant (including knock-on effects).
"The end does not justify the means" is just consequentialist reasoning at one meta-level up. If a human starts thinking on the object level that the end justifies the means, this has awful consequences given our untrustworthy brains; therefore a human shouldn't think this way. But it is all still ultimately consequentialism. It's just reflective consequentialism, for beings who know that their moment-by-moment decisions are made by untrusted hardware.
When I first read this article the imagery of corrupt hardware cause a certain memory to pop into my head. The memory is of an interaction with my college roommate about computers. Due to various discourses I had been exposed to at the time I was under the impression that computers were designed to have a life-expectancy of about 5 years. I am not immersed the world of computers, and this statement seemed feasible to me from a economic perspective of producer rationale within a capitalistic society. So I accepted it. I accepted that computers were designed to break, crash, or die within 4-5 years, if I could keep one that long. One day I got to talking to my roommate about this, and he shocked me by saying "not if you take care of them the way you should." How many people take their computers for regular checkups as they do their teeth, their cars, their children? How many people read the manuals that come with their computers to be best informed how to take care of them?
I am sure there are people that do, but I realized I was not one of them. I had assumed an intentional deficiency in the hardware, instead of grappling with the much more likely possibility that there was a deficiency in my usage/knowledge of the hardware.
I now return to your premise that "humans run on corrupted hardware." It is a new way to phrase an old idea: that humans are by nature evil. It is an idea I disagree with. I do not disagree with the beginning of your reasoning process, but I believe a lack of necessary knowledge about certain variables in the equation leads you down a faulty path. Therefore I will ignore the thought experiment that takes up the later portion of the essay, and instead focus on the variables in this statement:
-In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.
The assumption that you make is that self-interest has to be selfish/ individualistic. That variable Z (self-interest) makes individual benefit unquestionably precedent over group benefit. The assumption being that the individual self is not only real, but the foundation of human consciousness.
I would argue (along with a long list of social scientists in the fields of sociology, anthropology, evolutionary psychology, social psychology, economics, literature, theology, philosophy, and probably several more) that humans contain a social self. Meaning that the self is not individual cognition, but a networked entity constituted by a plurality of bodies, minds, and territories. Under my premise the fact that people must be self-interested is not so fatalistic. There is after all a difference between self-interest and selfishness. What is needed is for people to be taught to understand their self as a network not an individual, and be taught methods of self-extension.
I agree with you that humans cannot escape doing things out of self-interest, but surely you agree that some types of self-interest are more positive than others, and that the farther the notion of self is extended the greater the benefits for humanistic goals?
How can you say the hardware is corrupt before testing all the dispositions for action that it contains to the fullest?
The hardware is corrupted, that's not the same as evil. The corruptedness can easily lead to 'nice' or 'good' prosocial actions - 'I am doing this soup kitchen work because I am a good person' (as opposed to trying to look good or impress this potential ally or signal nurturing characteristics to a potential mate etc.).