Abstract
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players' beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, November, eade9097. https://doi.org/10.1126/science.ade9097.
Thanks, that does a lot to clarify your viewpoints. Your reply calls for some further remarks.
I'll start off by saying that I value your technology tracking writing highly because you are one of those blogging technology trackers who is able to look beyond the press releases and beyond the hype. But I have the same high opinion of the writings of Gary Marcus.
For the record: I am not trying to handwave the progress-via-hybrid-approaches hypothesis of Marcus into correctness. The observations I am making here are much more in the 'explains everything while predicting nothing' department.
I am observing out that both your progress-via-scaling hypothesis and the progress-via-hybrid-approaches hypothesis of Marcus can be made to explain the underlying Cicero facts here. I do not see this case as a clear victory for either one of these hypotheses. What we have here is an AI design that cleverly combines multiple components while also being impressive in the scaling department.
Technology tracking is difficult, especially about the future.
The following observation may get to the core of how I may be perceiving the elephant differently. I interpret an innovation like GANs not as a triumph of scaling, but as a triumph of cleverly putting two components together. I see GANs as an innovation that directly contradicts the message of the Bitter Lesson paradigm, one that is much more in the spirit of what Marcus proposes.
Here is what I find particularly interesting in Marcus. In pieces like like The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence, Marcus is advancing the hypothesis that the academic Bitter-Lesson AI field is in a technology overhang: these people could make make a lot of progress on their benchmarks very quickly, faster than mere neural net scaling will allow, if they were to ignore the Bitter Lesson paradigm and embrace a hybrid approach where the toolbox is much bigger than general-purpose learning, ever-larger training sets, and more and more compute. Sounds somewhat plausible to me.
If you put a medium or high probability on this overhang hypothesis of Marcus, then you are in a world where very rapid AI progress might happen, levels of AI progress much faster than those predicted by the progress curves produced by Bitter Lesson AI research.
You seem to be advancing an alternative hypothesis, one where advances made by clever hybrid approaches will always be replicated a few years later by using a Bitter Lesson style monolithic deep neural net trained with a massive dataset. This would conveniently restore the validity of extrapolating Bitter Lesson driven progress curves, because you can use them as an upper bound. We'll see.
I am currently not primarily in the business of technology tracking, I am an AI safety researcher working on safety solutions and regulation. With that hat on, I will say the following.
Bitter-lesson style systems consisting of a single deep neural net, especially if these systems are also model-free RL agents, have huge disadvantages in the robustness, testability, and interpretability departments. These disadvantages are endlessly talked about on this web site of course. By contrast, systems built out of separate components with legible interfaces between them are usually much more robust, interpretable and testable. This is much less often mentioned here.
In safety engineering for any high-risk application, I would usually prefer to work with an AI system built out of many legible sub-components, not with some deep neural net that happens to perform equally or better on an in-training-distribution benchmark. So I would like to see more academic AI research that ignores the Bitter Lesson paradigm, and the paradigm that all AI research must be ML research. I am pleased to say that a lot of academic and applied AI researchers, at least in the part of the world where I live, never got on board with these paradigms in the first place. To find their work, you have look beyond conferences like NeurIPS.