Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?
Well, drinking coffee by the barrel at Miri's emergency research retreat I'd...... still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn't reveal any new approaches, I'd try and get something like this working.
Take a reinforcement learner AI, that we want to safely move a strawberry onto a plate. A human sits nearby and provides a reward based on inspecting the AI's behaviour.
As it stands, this setup is completely vulnerable to reward hacking. The reward is not provided for safe moving of the strawberry; instead the reward is provided by having the human judge that the task has been accomplished and then pressing a button. Taking control of the human or control of the button is likely to be possible for a superintelligent AI; and, as it stands, that would be mandated by this reward function.
Learning from positive and various negative examples
Could we have the AI instead learn what the reinforcement signal "should be"? It seems that it might at least be possible, if we can make the AI learn from both positive and negative examples. I'd make five categories of examples from which the AI could learn. It may be too dangerous to have the superintelligent AI used directly in constructing these examples; in that case, the rewards would be given to a simpler, dumber version of the AI, and the examples passed on to the superintelligent AI for offline training.
- Simple positive and negative examples. These are the basic examples from above: the AI completes the task or fails to, and gets the consequent reward. The AI stays within its room and the human is sober, rested, uninfluenced, and so on.
- Simple more dubious examples. These are examples where the AI gets a reward, but the learning process judges that these rewards were not fairly allocated. Maybe the human was drunk, or not paying attention. Maybe they slipped, and pressed the button by accident. Or they were in a hurry to end their shift, and so pressed the button. All these examples are given a score from -1 to 1. Those with a score of 1 are taken as legitimate examples of rewards, those with a score of 0 are taken as random reward giving, while those with a score of -1 are taken as anti-rewards instead.
- Extreme negative examples. Here we can offer the AI actions that lead to complicated outcomes that are not legitimate rewards. Maybe the action can *make* the controller drunk. Maybe they can start a tape playing that will convince the controller the area is haunted. They could cut the wire, fuse the button, switch the controller's task description, distract the controller, start an automated threat tape, and so on. All these are labelled as negative examples (even though the reward is given).
- Hypothetical negative examples. If we can communicate hypothetical situations to the AI (especially if we can do that without fully specifying the hypothetical in detail), we can use that to extend the previous category to every situation we can imagine. We can have the AI trick, threaten, or kill the controller, destroy the rest of humanity, launch subagents to seize the button, take honeypots, and so on. We can't get an exhaustive list, but we can try to get as much examples of AIs behaving badly. These are also obviously labelled as negative examples (even though the reward is given).
- Properties of the example space. It's not clear whether these are examples or features of the learning process. See the next sections for details.
Nearest unblocked strategy vs overfitting
If we simply told the AI to maximise reward, subject to the fact that all the negative examples are to be avoided, then we fall into the nearest unblocked strategy problem (NUSP). The AI will simply choose a dangerous strategy that isn't too close to those that have been explicitly ruled out. Since we don't have the time, intelligence, or imagination to rule out all the dangerous strategies (and since some of them would be too dangerous for us to even contemplate), this wouldn't work.
But if we consider the issue as a learning problem, the situation seems less dire. Then the NUSP seems simply an example of overfitting: if the AI has learnt that all these are photos of cats, it should be able to infer that something similar is also a cat, even if it isn't close to the labelled examples. Similarly for negative examples that aren't close to other negative examples.
But another variant of the NUSP continues to plague us. Suppose we had trained the AI only on black cats; how would we expect it to react when confronted with a white cat?
Imagine the space of behaviour as a high dimensional space, with the space of good behaviours as some compact subset. Then the negative examples we gave do not tightly surround that compact subset (that's standard NUSP). But we've also not spread the negative examples much around that space, due to the usual problems with our lack of intelligence and imagination. So the negative examples we've set are likely to occupy a very low dimensional space around the space of good behaviour.
Adding the properties
However, once we've realised that, we can add it to the learning process. What method would be the best to pick out a constrained compact set in high dimensions, where the known positive and negative examples occupy a much smaller dimensional subspace? There might not be a good method, but there's probably a best method, given what we know or think we know.
The same goes for all the other properties we known about that space. For instance, if the AI uses different approaches for different humans, this is a likely failure mode. If it doesn't matter what the task the AI is given is (because the AI will just hack the human anyway), that's another failure mode - note, again, that this property makes more sense as a learning property than as a constraint. The perils (to humans) of unconstrained search is another known failure mode. Even the fact that we are using a dumber AI to construct the practical examples, is relevant. What sort of failure are we trying to avoid? When we box the AI or seal off the human with the reward button, again, what is this trying to achieve?
And so on, as we add all our intuitive understanding of the spaces of good and bad behaviours in this area.
Depending on how the learning process is designed, we can add these properties either as constraints for the learning process, or as part of the process itself.
Unusual actions versus unusual situations
Now, this approach is not certain to work by any means, but as an emergency measure it seems to have potential. But note that it seems much safer as a measure to prevent the AI taking extreme/manipulative/unusual actions. Rather than as a way to constrain the AI's behaviour in unusual situations. This approach is intended to prevent the AI from taking over the world, rather than telling it what to do if it does take over. That's mainly because we have a vague idea what not taking over the world looks like (and we can produce many examples of this behaviour), but we're much more stumped as to what running the world well means.