“Behaviorist” RL reward functions lead to scheming — LessWrong