Robustness, as used in ML, means that your model continues to perform well even for inputs that are off-distribution relative to the training set.
Inner alignment refers to the following problem: How can we ensure that the policy an AI agents ends up with is robustly pursuing the objective that we trained it on? By default, we would only expect the policy to track the objective on the training distribution.
Both a lack of robustness and inner alignment failure thus lead to an AI agent that might do unforeseen things when it encounters off-distribution inputs.
What’s the difference? I can (maybe) construct a difference if I assume that AI agents have distinct “competences” and “intent”.
There is some intuition that a lack of robustness relates to competence: The self-driving car really “wanted” to bring its passengers home safely. But then it started snowing and because the car’s vision system was only trained in sunny weather, the car didn’t spot the red traffic light and crashed. It was an honest mistake.
And there is some intuition that an inner alignment failure relates to intent: The nascent AGI never really cared about helping humans. It just play nice because it knew it would be deleted otherwise. As soon as it became powerful enough to take over the world (a situation it didn’t encounter during training), it did so.
However, the distinction between “competences” and “intent” doesn’t seem to apply to RL agents (and maybe not even to humans). RL agents just receive inputs and select actions. I wouldn’t be able to point to the “intent” of an RL agent. So what’s the difference between robustness an inner alignment then?