Understanding Gato's Supervised Reinforcement Learning — LessWrong