Multi-dimensional rewards for AGI interpretability and control — LessWrong