200 COP in MI: Interpreting Reinforcement Learning — LessWrong