Outperforming the human Atari benchmark

by Vaniver 1 min read31st Mar 20205 comments

59

Ω 22


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I haven't paid much attention to Atari in a long time, and so would appreciate takes from anyone who follows this more closely. My take:

A single architecture that can handle both the games that require significant exploration, and the games that require long-term credit assignment, and the 'easy' games, without tricks or expert performance, seems like an achievement to me. The main question then becomes "does it scale out of the simulator to problems we care about more than Atari?"

What does it use? It looks like it's a lot of 'engineering improvements' and data. That is, lots of approaches will have tons of small components that are set to some simple default. You need to explore, so you use the simplest possible method of exploration, which is every action you roll a die and epsilon percent of the time you take a random action. Of course you could do better if you thought about it more, but there are many places like that in the code, and 'thinking about it more' requires both developer effort and compute (which, since the efficiency of the whole thing depends on how much compute it can spend, might mean you're spending dollars to earn cents if you use a fancy method where a simple one would do).

That is, this doesn't look like a radical new competitor to DQN; this looks like DQN after five years of iteration and testing, where some defaults are swapped to fancier settings and others aren't, and thus is updating me a little upward on the importance of engineering and data as an input factor.

They say this isn't the end of Atari research:

This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance. We offer two views on this: firstly, analyzing the performance among percentiles gives us new insights on how general algorithms are. While Agent57 achieves strong results on the first percentiles of the 57 games and holds better mean and median performance than NGU or R2D2, as illustrated by MuZero, it could still obtain a higher average performance. Secondly, all current algorithms are far from achieving optimal performance in some games. To that end, key improvements to use might be enhancements in the representations that Agent57 uses for exploration, planning, and credit assignment.

59

Ω 22