Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.
The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:
We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.