So this might be a dumb question, but why not use "pruning" to avoid an AGI becoming a ruthless reward maximizer? The fear, as I understand it, is that AGI will realize it can achieve its reward best by betraying all humans and finding a loophole to the rules. But what if AGI never even considered possibilities involving harm to humans?
To try an analogy, let's say we have a chess bot and we want it to win chess games. But for whatever reason, we never want it to take pawns. So as it graphs out a tree of possible future game states, it "prunes" all the states that involve taking paws, and then maximizes the probability of winning based on the remaining game states. This seems to me like it would be reasonably effective for a chess bot, so what about for real life? If we asked AGI to maximize something but didn't want it to turn on us, could we have it aggressively "prune" anything that involves human death or betrayal or deception before it runs calculations on maximizing its reward function.
So by the time it's in a position to decide whether a "next step" that harms humans would help it get more reward, it's forgotten all possible actions that would involve harming or betraying a human. I think the "pruning" could be done manually, right? It's very easy to explicitly state short-term things we never want to happen, such as an action that leads to a human's death, or prevents us from easily shutting down the AGI. The AGI would then only examine paths that don't involve those things.
So this might be a dumb question, but why not use "pruning" to avoid an AGI becoming a ruthless reward maximizer? The fear, as I understand it, is that AGI will realize it can achieve its reward best by betraying all humans and finding a loophole to the rules. But what if AGI never even considered possibilities involving harm to humans?
To try an analogy, let's say we have a chess bot and we want it to win chess games. But for whatever reason, we never want it to take pawns. So as it graphs out a tree of possible future game states, it "prunes" all the states that involve taking paws, and then maximizes the probability of winning based on the remaining game states. This seems to me like it would be reasonably effective for a chess bot, so what about for real life? If we asked AGI to maximize something but didn't want it to turn on us, could we have it aggressively "prune" anything that involves human death or betrayal or deception before it runs calculations on maximizing its reward function.
So by the time it's in a position to decide whether a "next step" that harms humans would help it get more reward, it's forgotten all possible actions that would involve harming or betraying a human. I think the "pruning" could be done manually, right? It's very easy to explicitly state short-term things we never want to happen, such as an action that leads to a human's death, or prevents us from easily shutting down the AGI. The AGI would then only examine paths that don't involve those things.