Abstract
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players' beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, November, eade9097. https://doi.org/10.1126/science.ade9097.
Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand.
So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis, put forward by people like Gary Marcus, that scaling alone as the sole technique has run out of steam, and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)
Another way to explain my lack of surprise would be to say that Cicero is a just super-human board game playing engine that has been equipped with a voice synthesizer. But I might be downplaying the achievement here.
I have not read any of the authors' or Meta's messaging around this, so I am not sure if they make that point, but the sub-components of Cicero that somewhat competently and 'honestly' explain its currently intended moves seem to have beneficial applications too, if they were combined with an engine which is different from a game engine that absolutely wants to win and that can change it's mind about moves to play later. This is a dual-use technology with both good and bad possible uses.
That being said, I agree that this is yet another regulatory wake-up call, if we would need one. As a group, AI researchers will not conveniently regulate themselves: they will move forward in creating more advanced dual-use technology, while openly acknowledging (see annex A.3 of the paper) that this technology might be used for both good and bad purposes downstream. So it is up to the rest of the world to make sure that these downstream uses are regulated.