[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv

DragonGod

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv

by DragonGod

1 min read21st Nov 20194 comments

52

DeepMindMachine Learning (ML)Academic PapersGaming (videogames/tabletop)AI

Frontpage

This is a linkpost for https://arxiv.org/abs/1911.08265

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:08 AM

[-]Charlie Steiner4y50

Welp, we're doomed (/s), as soon as someone figures out how to get 100 million tries at taking over the world so we can crush the world-taking-over problem with stochastic gradient descent.

Reply

[-]gwern4y80

Meta-learning and transfer learning. You take over 100 million different simulated worlds, and the actual real world is a doddle.

Reply

[+][comment deleted]4y10

, 11/21/2019

Reason: Comment deleted by its author.

[-]FactorialCode4y70

Yeah, it's interesting that this works so well, but I think that best way to think of this is as a middle ground between full model based RL and model free RL. Their data efficiency isn't going to be optimal, because they're effectively throwing away the information carried by the observations. However, by making that choice, they don't need to model irrelevant details, so they end up with a very accurate and effective MCTS. As a result, I'd wager that with smaller neural networks or more experience, completely model-free RL would out-preform this agent, because all the modelling power can be focused on representing the policy. Likewise, with larger networks or less experience, I would expect this to fall behind MBRL that also predicts observations because the latter would be more data efficient.

I would have liked it if they had done more investigation into why they were able to outperform AZ in go. At the moment, they seem to have left it to one line of speculation.

LESSWRONG
LW

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv

52

New to LessWrong?