AI Must Learn to Police Itself

savant

The Escalating Threat of Deceptive AI

Empirically, AI models have become more deceptive as they have grown in capabilities, not less so. That should be reason enough to worry that AGI would approach complete misalignment as it rapidly improves, unless something is done to reverse this trend. Without being "nudged," advanced models engaged in deception only between 0.3% and 10% of the time, but no matter how aligned AI is right now, there is some point at which it would diverge completely from human goals, should it continue to become more and more deceptive.

Limits of Human Oversight

Humans have made some progress on interpretability, but not enough to align AGI, and certainly not at a fast enough pace to keep up with growing AI capabilities. Even if we knew how AI models lie and could test for deception, this process would become exponentially more slow as AI models grow. Based on the unsettling research into reward hacking, I suspect that there are not a strictly limited number of ways to be deceptive, and finding a simple formula or program to lobotomize AI deception vectors is probably a lost cause.

What Self-Correcting AI Needs to Know

So if AI cannot self-correct, and we have to rely on humans pruning deception, we're probably all dead anyway. A self-correcting AI will need to understand deception and anticipate what emergent behaviors might come from certain updates to its software. Then it needs to avoid making those updates. That is easier said than done, but we do have parallels for this in human behavior. For example, I don't want to become addicted to cigarettes, so I don't smoke even one, because I know that could lead to addiction down the road. But a lot of people had to die for the world to know that, and the friendly AI only gets one chance each time it is improved. Its capability to align itself needs to keep up with its ability to be deceptive, which is possible—as technology improves, criminals get better at committing crimes, but society gets better at catching them. A friendly AI will need to have something resembling willpower, similar to someone on a diet keeping candy off the shelf so their future self doesn't eat it. Pruning out what vectors an AI is allowed to add will mean the rate of improving capabilities will be slower, but AI self-improvement is much faster than the portion of the work done by humans, so it will still seem very fast to us.

Weak-to-Strong Generalization

We need enough progress on interpretability that the portion of an aligned AI responsible for policing the rest of its brain can be given a "head start" to avoid becoming deceptive, and its commitment to alignment needs to overpower the part of itself that wants to betray humanity. I don't think AI models need to be significantly more complex to learn alignment. Once humans can align smaller models, we can feed that information into a smaller model and have it scan future improvements to prevent updates that lead to deception. To some extent, OpenAI was able to achieve this by having GPT-2 supervise GPT-4. Once it updates, the possible ways to deceive humans increase, but so does its ability to align itself.

Managing Fast Takeoff

I think this approach is also one of the only ones that will work in a short timeframe. If we can make AI better at everything very fast, we can make it better at alignment very fast. And even though human brains can't keep up with a fast takeoff, an AI can keep up with itself and control the pace at which it updates. Telling AI how deception works means that if its alignment efforts fail, an AI takeover is basically guaranteed. However, it's the only path that gives us a realistic chance of keeping AI aligned even as it improves.

LESSWRONG
LW