Towards a mechanistic understanding of corrigibility — LessWrong