It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.
Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?
An FAI that's one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.
Perhaps malware could be another risk factor in the type of bug I described here? Not sure.
I'm still a little dubious of Eliezer's solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.
Or what about the case where instead of maximising -U, the values that the reward function/model gives for each "thing" is multiplied by -1. E.g. AI system gets 1 point for wireheading and -1 for torture, some weird malware/human screw-up (in the reward model or some relevant database), etc. flips the signs for each individual action. AI now maximises U = W - V.
This seems a lot more nuanced than *just* avoiding cosmic rays; and the potential consequences of a hellish "I have no mouth, and I must scream"-type are far worse than human extinction. I'm not happy with *any* non-negligible probability of this happening.