My understanding is that the primary problem with inner-alignment is that even outer-aligned AIs may be incredibly unpredictable due to unpredictable mesa-optimization. Put another way, there is a possibility that even air-tight objective functions could be ignored by a hypothetical superintelligent AI to a sufficient degree such that the AI still ends up misaligned.

It seems as if the general consensus on LessWrong (and possibly the larger AI safety community) is that increasing slack in AI training creates safer models. I might be misremembering, but I believe Paul Christiano actually specifically argued that higher pressure to optimize to an objective function is one of the most likely sources of unsafe AI. 

However, one would imagine that the more slack is present in an AI model, the harder the inner-alignment problems are to resolve. 

I'll like to pose a hypothetical question to help me understand this dilemma a little more clearly. Imagine a world where we largely solve outer-alignment problems (while remaining unsure about inner-alignment problems) and are forced to deploy AGI models (to beat Dr. Amoral, perhaps). Do you think that reducing slack would be an effective way to balance inner and outer alignment risks?

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

aogara

Nov 14, 2022

10

Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.

This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.

But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more coherent inner misaligned goals and more deceptive behavior to hide them. Unfortunately I don’t think slack is really a solution because optimizing an objective really hard is what makes your AI capable of doing interesting things. Handicapping your systems isn’t a solution to AI safety, the alignment tax on it is too high.

https://arxiv.org/abs/2210.10760