A brief, accessible summary of the inner alignment problem.

The safety problem with powerful ML training algorithms is that deceptive agents are a good way to do things. Meaning, when we search over a big space of models until we find a model that performs well on some loss function, we'll likely find a deceptive agent. Agents are algorithms whose behaviors try to consistently move the world in some direction. When trapped inside an ML training algorithm, only models that perform well on the loss function will survive training. If an agent aims at pushing the world in some direction, it will have to do so via the circuitous-but-available route of playing along with the training algorithm until it's safely past training. Deceptive agents are just agents taking the only available route to the states they are pointed towards from inside of ML training.

How common are deceptive agents inside that big space of ML models? One argument that they are common is that almost any highly capable agent with whatever utility-function will have this route to getting where it's pointed at by means of deception available. "Play along, survive training, and then act as you want to" is a simple, effective strategy for a wide range of possible agents trapped inside ML training. Many agents will therefore be deceptive in that situation. If the agent we hope training will find optimizes according to a very particular utility function, then that agent will be vanishingly rare compared to its deceptive counterparts in model space, and training will always stumble on a deceptive model first.

So, by default, powerful ML training algorithms grading models on some loss function will find deceptive agents, because deceptive agents are a good way to do things.

New Comment