Why Would we get Inner Misalignment by Default?
As I understand, one of the primary ways that inner alignment could go wrong even if our objective function captures our intention (outer alignment is magically solved) is that the AI would: 1. Develop a mesa-objective which approximates the objective function or is instrumentally useful 2. Develop goal preservation and...