x
Reward is not the optimization target — LessWrong