Deepmind announced that their Agent57 beats the ‘human baseline’ at all 57 Atari games usually used as a benchmark. I think this is probably enough to resolve one of the predictions we had respondents make in our 2016 survey.

Our question was when it would be feasible to ‘outperform professional game testers on all Atari games using no game specific knowledge’.1 ‘Feasible’ was defined as meaning that one of the best resourced labs could do it in a year if they wanted to.

As I see it, there are four non-obvious things to resolve in determining whether this task has become feasible:

  • Did or could they outperform ‘professional game testers’?
  • Did or could they do it ‘with no game specific knowledge’?
  • Did or could they do it for ‘all Atari games’?
  • Is anything wrong with the result?

I. Did or could they outperform ‘professional game testers’?

It looks like yes, for at least for 49 of the games: the ‘human baseline’ seems to have come specifically from professional game testers.2 What exactly the comparison was for the other games is less clear, but it sounds like what they mean by ‘human baseline’ is ‘professional game tester’, so probably the other games meet a similar standard.

II. Did or could they do it with ‘no game specific knowledge’?

It sounds like their system does not involve ‘game specific knowledge’, under a reasonable interpretation of that term.

III. Did or could they do it for ‘all Atari games’?

Agent57 only plays 57 Atari 2600 games, whereas there are hundreds of Atari 2600 games (and other Atari consoles with presumably even more games).

Supposing that Atari57 is a longstanding benchmark including only 57 Atari games, it seems likely that the survey participants interpreted the question as about only those games. Or at least about all Atari 2600 games, rather than every game associated with the company Atari.

Interpreting it as written though, does Agent57’s success suggest that playing all Atari games is now feasible? My guess is yes, at least for Atari 2600 games.

Fifty-five of the fifty-seven games were proposed in this paper3, which describes how they chose fifty of them:

Our testing set was constructed by choosing semi-randomly from the 381 games listed on Wikipedia [http://en.wikipedia.org/wiki/List_of_Atari_2600_games (July 12, 2012)] at the time of writing. Of these games, 123 games have their own Wikipedia page, have a single player mode, are not adult-themed or prototypes, and can be emulated in ALE. From this list, 50 games were chosen at random to form the test set.

The other five games in that paper were a ‘training set’, and I’m not sure where the other two came from, but as long as fifty of them were chosen fairly randomly, the provenance of the last seven doesn’t seem important.

My understanding is that none of the listed constraints should make the subset of games chosen particularly easy rather than random. So being able to play these games well suggests being able to play any Atari 2600 game well, without too much additional effort.

This might not be true if having chosen those games (about eight years ago), systems developed in the meantime are good for this particular set of games, but a different set of methods would have been needed had a different subset of games been chosen, to the extent that more than an additional year would be needed to close the gap now. My impression is that this isn’t very likely.

In sum, my guess is that respondents usually interpreted the ambiguous ‘all Atari games’ at least as narrowly as Atari 2600 games, and that a well resourced lab could now develop AI that played all Atari 2600 games within a year (e.g. plausibly DeepMind could already do that).

IV. Is there anything else wrong with it?

Not that I know of.

~

Given all this, I think it is more likely than not that this Atari task is feasible now. Which is interesting, because the median 2016 survey response put a 10% chance on it being feasible in five years, i.e. by 2021.4 They more robustly put a median 50% chance on ten years out (2026).5

It’s exciting to start resolving expert predictions about early tasks so we know more about how to treat their later predictions about human-level science research and the obsolescence of all human labor for instance. But we should probably resolve a few more before reading much into it.

At a glance, some other tasks which we might be able to resolve soon:

By Katja Grace

Thanks to Rick Korzekwa, Jacob Hilton and Daniel Filan for answering many questions.

Notes

89