Deepmind announced that their Agent57 beats the ‘human baseline’ at all 57 Atari games usually used as a benchmark. I think this is probably enough to resolve one of the predictions we had respondents make in our 2016 survey.

Our question was when it would be feasible to ‘outperform professional game testers on all Atari games using no game specific knowledge’.1 ‘Feasible’ was defined as meaning that one of the best resourced labs could do it in a year if they wanted to.

As I see it, there are four non-obvious things to resolve in determining whether this task has become feasible:

  • Did or could they outperform ‘professional game testers’?
  • Did or could they do it ‘with no game specific knowledge’?
  • Did or could they do it for ‘all Atari games’?
  • Is anything wrong with the result?

I. Did or could they outperform ‘professional game testers’?

It looks like yes, for at least for 49 of the games: the ‘human baseline’ seems to have come specifically from professional game testers.2 What exactly the comparison was for the other games is less clear, but it sounds like what they mean by ‘human baseline’ is ‘professional game tester’, so probably the other games meet a similar standard.

II. Did or could they do it with ‘no game specific knowledge’?

It sounds like their system does not involve ‘game specific knowledge’, under a reasonable interpretation of that term.

III. Did or could they do it for ‘all Atari games’?

Agent57 only plays 57 Atari 2600 games, whereas there are hundreds of Atari 2600 games (and other Atari consoles with presumably even more games).

Supposing that Atari57 is a longstanding benchmark including only 57 Atari games, it seems likely that the survey participants interpreted the question as about only those games. Or at least about all Atari 2600 games, rather than every game associated with the company Atari.

Interpreting it as written though, does Agent57’s success suggest that playing all Atari games is now feasible? My guess is yes, at least for Atari 2600 games.

Fifty-five of the fifty-seven games were proposed in this paper3, which describes how they chose fifty of them:

Our testing set was constructed by choosing semi-randomly from the 381 games listed on Wikipedia [ (July 12, 2012)] at the time of writing. Of these games, 123 games have their own Wikipedia page, have a single player mode, are not adult-themed or prototypes, and can be emulated in ALE. From this list, 50 games were chosen at random to form the test set.

The other five games in that paper were a ‘training set’, and I’m not sure where the other two came from, but as long as fifty of them were chosen fairly randomly, the provenance of the last seven doesn’t seem important.

My understanding is that none of the listed constraints should make the subset of games chosen particularly easy rather than random. So being able to play these games well suggests being able to play any Atari 2600 game well, without too much additional effort.

This might not be true if having chosen those games (about eight years ago), systems developed in the meantime are good for this particular set of games, but a different set of methods would have been needed had a different subset of games been chosen, to the extent that more than an additional year would be needed to close the gap now. My impression is that this isn’t very likely.

In sum, my guess is that respondents usually interpreted the ambiguous ‘all Atari games’ at least as narrowly as Atari 2600 games, and that a well resourced lab could now develop AI that played all Atari 2600 games within a year (e.g. plausibly DeepMind could already do that).

IV. Is there anything else wrong with it?

Not that I know of.


Given all this, I think it is more likely than not that this Atari task is feasible now. Which is interesting, because the median 2016 survey response put a 10% chance on it being feasible in five years, i.e. by 2021.4 They more robustly put a median 50% chance on ten years out (2026).5

It’s exciting to start resolving expert predictions about early tasks so we know more about how to treat their later predictions about human-level science research and the obsolescence of all human labor for instance. But we should probably resolve a few more before reading much into it.

At a glance, some other tasks which we might be able to resolve soon:

By Katja Grace

Thanks to Rick Korzekwa, Jacob Hilton and Daniel Filan for answering many questions.


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 7:37 AM

I think it shouldn't matter much which definition was used, but the World Series of Poker has one "Main Event" consisting of no-limit Texas Hold-Em, and several smaller events for the different styles of poker. I would have interpreted the question as asking about the Main Event only if it didn't specify.

I’m not familiar enough with Poker to say whether any of the differences between Texas Hold’em, Omaha Hold’em and Seven Card Stud should make the latter two difficult if the first is now feasible.

I've played all of these and my sense is that Seven Card Stud would be relatively easy for computers to learn because it has fixed bet sizings just like Limit Holdem, which was solved long before No Limit. Some of the cards are exposed in Stud which creates a new dynamic, but I don't think that should be difficult for computers to reason about that.

Omaha seems like it would be about as difficult as Texas holdem. It has the same sequence of actions and the same concepts. The bet sizings are more restricted (the maximum is determined by the size of the pot instead of no limit), but there are more cards.

As far as I'm aware, none of the top poker bots so far were built in a way that they could learn other variants of poker without requiring a lot of fine-tuning from humans. It's interesting to think about whether building a generalized poker bot would be easier or harder than building the generalized Atari bot. I'm not sure I know enough about Atari games to have good intuitions about that. But my guess is that if it works for Atari games, it should also work for poker. The existing poker bots already rely on self-play to become good.

I believe that the poker bots met the mark for two player games and that Omaha/7-Stud are both not much of an issue and wouldn't actually be required in any way, but actually winning the WSOP requires mostly winning at 9-10 person tables. I do realize that there are claims they've been able to handle that, but doing it in person is... trickier. Probably need another level of improvement before they have a reasonable shot at winning.

(Note that WSOP is a good structure but even so is still pretty random, so e.g. I would win it some % of the time if I entered, whereas if I played the type of match they used to test the poker bots, my chances would be about epsilon.)

Another thing is that the bots never make exploits. So when there's a bad player at the table playing 95% of their hands, the bot would never try to capitalize on that, whereas any human professional player would be able to make extra money off the bad player. Therefore, the bot's advantages over human professionals are highest if the competition is especially though.