Welp, we're doomed (/s), as soon as someone figures out how to get 100 million tries at taking over the world so we can crush the world-taking-over problem with stochastic gradient descent.
Meta-learning and transfer learning. You take over 100 million different simulated worlds, and the actual real world is a doddle.
Yeah, it's interesting that this works so well, but I think that best way to think of this is as a middle ground between full model based RL and model free RL. Their data efficiency isn't going to be optimal, because they're effectively throwing away the information carried by the observations. However, by making that choice, they don't need to model irrelevant details, so they end up with a very accurate and effective MCTS. As a result, I'd wager that with smaller neural networks or more experience, completely model-free RL would out-preform this agent, because all the modelling power can be focused on representing the policy. Likewise, with larger networks or less experience, I would expect this to fall behind MBRL that also predicts observations because the latter would be more data efficient.
I would have liked it if they had done more investigation into why they were able to outperform AZ in go. At the moment, they seem to have left it to one line of speculation.
Posted a little reaction to this paper here.