Abstract
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players' beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, November, eade9097. https://doi.org/10.1126/science.ade9097.
Well, first I would point out that I predicted it would happen soon, while literally writing "The Scaling Hypothesis", and the actual researchers involved did not, predicting it would take at least a decade; if it is not predicted by the scaling hypothesis, but is in fact predicted better by neurosymbolic views denying scaling, it's curious that I was the one who said that and not Gary Marcus.
Second, it's hard to see how a success using models which would have been among the largest NNs ever trained just 3 years ago, finetuned using 128-256 GPUs on a dataset which would be among the largest compute-clusters & datasets ever used in DL also up until a few years ago, requiring 80 CPU-cores & 8 GPUs at runtime (which would be &etc), is really a vindication for hand-engineered neurosymbolic approaches emphasizing the importance of symbolic reasoning and complicated modular approaches, as opposed to approaches emphasizing the importance of increasing data & compute for solving previously completely intractable problems. Nor do the results indicate that the modules are really all that critical. I do not think CICERO is going to need even an AlphaZero level of breakthrough to remove various parts, when eg. the elaborate filter system representing much of the 'neurosymbolic' part of the system is only removing 15.5% of all messages.
I regard this sort of argument as illegitimate because it is a fully-general counterargument which allows strategy-stealing and explains everything while predicting nothing: no matter how successful scaling is, it will always be possible (short of a system being strongly superhuman) to take a system and improve it in some way by injecting hand-engineering or borrowing other systems and lashing them together. This is fine for purely pragmatic purposes, but the problem is when one then claims it is a victory for 'neurosymbolic programming' or whatever catchphrase is in vogue. Somehow, the scaled-up system is never a true scotsman of scaling, but the modifications or hybrids are always true scotsmen of neurosymbolics... Those systems will then be surpassed a few years later by larger (and often simpler) systems, thereby refuting the previous argument - only for the same thing to happen again, ratcheting upwards. The question is not whether some hand-engineering can temporarily leapfrog the Bitter Lesson (it is quite explicitly acknowledged by Sutton and scaling proponents that you can gain constant factors by doing stuff which is not scaling), but whether it will progress faster. CICERO would only be a victory for neurosymbolic approaches if it had used nothing you couldn't've done in, say, 2015.
This seems to be what you are doing here: you handwave away the use of BART and extremely CPU/GPU-intensive search as not a victory for scaling (why? it was not a foregone conclusion that a BART would be any good, nor that the search could be made to work, without far more extensive engineering of theory of mind and symbolic reasoning), and then only count the lashing-together of a few scaled-up components as a victory for neurosymbolic approaches instead.