I do think Meta AI is a bad actor in the sense that it puts too little attention and resources on existential risks from AI
The following are my own opinions. I understand yours are just as valid and just as well thought-out and researched.
I find that the credible experts who know the most about AI are worried the least about Hollywood-style existential threats. I am personally more worried about ID theft and Russian cyber hackers making use of AI -- and any other computer system attached to wealth and identity -- than I am about AI. Why worry so much about AI systems that have no proven motive to harm people when we already have so much proof of other human beings demonstrating exactly that? We have such a long way to go yet before AGI, (whatever that actually means) and we will get much value out of the research along the way.
Regarding the worries about deception: I think we've seen the risk of deception grow more from the development of ChatGPT and deep-fake models that are able to convincingly fabricate information and scenes, than we've seen in Meta's Cicero. Reason? Same as above: human bad actors can use those tools to their advantage at other peoples' expense. Cicero was designed to play a game in which deception is a byproduct of game-play. If you take it out of context, there's no deception capabilities.
Disclaimer: I'm not an expert at machine learning, AI safety or Diplomacy, so there may be errors here, though hopefully no major ones. For previous discussion of CICERO on here, see the comments in this post, this rundown and this commentary.
Summary
Introduction
It's been a good few years for game-playing AIs. In 2016 and 2017 we saw AlphaGo and AlphaZero achieve excellent performance at go, chess and shogi using reinforcement learning with neural networks. In 2019, MuZero learned to play Atari games without any built-in knowledge of the rules, and Pluribus beat top human poker players at Texas hold 'em. Most of these systems learned to play from scratch, without any data of human play as input.
Last month, Meta AI announced a new Diplomacy-playing AI, CICERO, which shows good (though not superhuman) performance at a board game that (1) features more than two competing players and (2) requires not only strategic and tactical thinking, but also the ability to communicate using natural language. To do this, CICERO combines a strategic model with a GPT-style language model (more on which later).
Diplomacy, reportedly a favourite of Henry Kissinger and Demis Hassabis, was released at the height of the Cold War in 1959. You play as one of seven great powers during early-20th-century Europe, moving military units around with the ultimate aim of capturing half of the continent. What sets it apart from other wargames is its negotiation phase, which focuses on private negotiation and diplomacy as players form alliances, make deals, glean information from and occasionally lie to their opponents.
Though Diplomacy is normally played with communication (known as Full-press Diplomacy), it can also be played without it (No-press Diplomacy).[1] No-press Diplomacy is easier to master using machine learning models than Full-press because it doesn't involve natural language communication. There's already been impressive performance in No-press Diplomacy for a while. Bakhtin et al. (2021) achieved superhuman performance in 2-player No-press Diplomacy. Bakhtin et al. (2022) achieved human-level performance in 7-player No-press Diplomacy. (You'll recognise some of the names on these papers from the CICERO paper.) And now Meta AI et al. (2022) has achieved human-level performance in Full-press 7-player Diplomacy (specifically the Blitz version where players are only allowed to communicate for 5 minutes each round).
How CICERO Works
CICERO comprises two main parts:
Actual human Diplomacy involves lying and withholding information. Can CICERO also lie and withhold information?
Meta's promotional video implies that CICERO is "fundamentally honest and fundamentally collaborative".[3] Mike Lewis, one of CICERO's creators, puts it like this: "It's designed to never intentionally backstab -- all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind ..."
From what I gather, CICERO can definitely withhold information, and it can kind of lie.
It can clearly cease to communicate with a player. As sanxiyn points out, CICERO can just not say what it intends to do (either by just not striking up a conversation with a player, or by conversing with a player but not mentioning parts of its plan). I think that happens when the filters adjoined to the dialogue model don't let any of the candidate messages through, e.g. because they aren't strategically valuable or sensible enough.
When CICERO communicates with another player, it always tries to communicate what it actually intends to do at that moment -- it doesn't intentionally try to throw other players off by stating that it will do a thing it has no intention of doing. (Or rather, it's trained not to do that. It's still a language model -- there's no guarantee that it'll always produce messages that match the intent.[4]) But it can change its intent from one moment to the next, e.g. in response to a message from another player. As a result, you can have a conversation with CICERO where it first expresses its intent, then seconds later changes its mind, and then keeps conversing with you without letting you know it has changed its mind. That may not feel like lying on the inside, but it sure looks like lying from the outside.[5]
Gwern speculates that the training process may have smuggled deception in as a kind of superstructure around the honest intent-to-message function:
If I understand the paper correctly, it's not exactly true to say that CICERO was trained end-to-end, or rather, it is true but this end-to-end training didn't involve the dialogue model as such.[6] Gwern's point is still plausible, I think, as CICERO could have picked up deceptive heuristics in the way that it chooses (and then discards) intents.
The extent to which CICERO actually did pick up deceptive heuristics is unclear to me. Adam Lerer from the CICERO team writes: "One reason that CICERO did not use deception effectively -- and why we abandoned it -- is that it wasn't very good at reasoning about the long-term cost of lying, i.e. knowing exactly how much a particular lie would hurt its ability to cooperate with the other player in the future." But I'm confused about what he means when he says they "abandoned" deception -- did they originally try to make CICERO deceptive and then reverse course only because it wasn't actually to its advantage?
(Diplomacy players sometimes seem to emphasise that deceptive and treacherous behaviour is often not conducive to success, because, while it produces a short-term gain, it also produces a loss in the long term as other players lose trust in you and are less willing to cooperate with you. Andrew Goff, who's won three Diplomacy world championships, and who was also involved in the development and publication of CICERO, writes: "Backstabbing tends to get devalued by CICERO. It has long been my thinking that backstabbing is a poor option in the game and I always feel like I fail when I have to do it, and CICERO seems to agree with me. It gets clearly better results when it is honest and collaborates with allies over the long term." I downweight this somewhat though as everyone likes to think that what they do is good and lawful, and Goff seems to be an unusually cooperative Diplomacy player. And I think even Goff would agree that deception does play an important role in Diplomacy.)
Is CICERO Dangerous?
Some people who are concerned with risks from advanced AI seem to find CICERO an unusually bad idea. E.g. Form of Plato (1.9K likes):
Or Erik Brynjolfsson (1.5K likes):
Or Gwern (36 karma):
I can think of at least three things that may concern people about CICERO:
I weakly think (1) and (2) are false, and am on the fence about (3).
The reason why I think (1) and (2) are false is not that CICERO isn't good at strategy, tactics, deception or persuasion. Zvi Mowshowitz seems to argue something like this:
Though he has since edited his post to amend a qualification:
That seems to be a reference to Diplodocus, an earlier Diplomacy AI which was created by some of the same people as CICERO and, I believe, shares some of its code (Bakhtin et al. 2022). About Diplodocus, one high-level Diplomacy player says in a video commentary: "It was exceptionally strong tactically, [...] it could cooperate, it could signal and it could stab. Oh boy could it stab! I saw player after player get decimated by this AI after aligning themselves with it." That sounds like it's strong at both tactics and deception.
Given that we've seen AIs similar to CICERO achieve superhuman performance at 2-player No-press Diplomacy (Bakhtin et al. 2021), and strong though not superhuman performance at 7-player No-press Diplomacy (Bakhtin et al. 2022), and that expert players seem to rate CICERO's and its ancestors' tactical skills highly, I (who know little about Diplomacy) would guess that CICERO is better at tactics than ≥90% of tournament-going Diplomacy players and better at strategy than ≥70% of them. Its deceptive and persuasive capabilities are murkier since I've seen less of them.
But even if CICERO is better than previous game-playing AIs at tactics, strategy, deception and/or persuasion, I still think it doesn't present marked advances in any of those capabilities, because I can discern no novel insights that led to those improvements. Instead, I think they're mostly the product of using reinforcement learning to master a game where tactics, strategy, deception and persuasion are useful. But this looks less like a technological advance than the application of old technologies to a new (and ultimately irrelevant) domain.
(Exceptions could be the way CICERO combines a strategic model with a Transformer-based language model, or the way CICERO grounds its strategies in human models in order to cooperate with and anticipate human moves -- those may be technologically novel and significant capabilities advancements, but I lean towards not.)
As for (3), again I'll quote Zvi:
I do think Meta AI is a bad actor in the sense that it puts too little attention and resources on existential risks from AI, but I'm not convinced that CICERO should update our opinion on that. If (1) and (2) are false, CICERO is no more dangerous than any other AI research project, and we already knew Meta is pouring money into those. (Actually, we already knew Meta had been pouring money into Diplomacy-playing AIs, too.) So it seems to me that the research project that produced CICERO isn't more dangerous than a baseline AI research project of the same scale (which is to say, not not dangerous).
Simulated Worlds
This section is speculative, and I'd be especially happy to get comments on it.
Systems like CICERO are trained through reinforcement learning with self-play. Roughly speaking, you have the AI play against copies of itself, and then improve it based on how each copy fared. This process is repeated hundreds of thousands or millions of times. It's possible to repeat it rapidly because these models don't play out there in the real world, they play in simulated worlds (training environments).
The world of chess is really simple -- a board, 32 pieces and some meta-information like whose turn it is -- and is correspondingly simple to simulate. The real world is astoundingly complex and correspondingly hard to simulate. We can learn a lot from studying AIs in the isolated environments of board games. But the dream (or nightmare) is that we can scale up AIs like AlphaZero and CICERO, train them in more complex simulated worlds (perhaps even simulated approximations of the real world) and attain transformative AI that way.
Is that possible? I think this pathway is present, explicitly or implicitly, in many AGI scenarios, but you can imagine some reasons why it may not be feasible:
But Ajeya Cotra thinks the costs of simulation[7] probably won't prevent us from creating transformative AI:
(Cotra goes into more detail on each of these in the linked document.)
I think Cotra's reasons presuppose that an intelligent enough AI can preserve its capabilities out-of-distribution, i.e. if you train a neural network in a simulated world not much more complex than Diplomacy with enough parameters and compute, such that it becomes just as intelligent and general as a human, then it would retain its capabilities also in other environments, such as the real world.
Had CICERO represented a capability advancement, I'd say this would've been it. The world of Diplomacy, because it includes communication with natural languages, is substantially more complex than that of chess or go. But CICERO was not actually trained in an environment that included dialogue; the dialogue part of the game was approximated during training (apparently because the computational costs of generating dialogue made it too cumbersome to do during training; the language model was instead optimised separately). So CICERO doesn't seem like an advancement in this respect either (though it's easy to imagine the next generation of Diplomacy-playing AIs being just that).[8]
The Future of Diplomacy AIs
The Metaculus community's median estimate for superhuman performance at Full-press Diplomacy is May 2024. My own 90% confidence interval is spring 2023 to spring 2027. I don't think you'd need any major technological breakthroughs to get there. Maybe you could tweak the current architecture, use a larger language model and throw more compute at the whole thing and end up with a superhuman AI next year already. (Maybe Meta AI is already doing this.)
What I'm less sure about is how much time and money Meta and its competitors are willing to invest in a Diplomacy-playing AI at this point -- whether anyone will bother improving on CICERO, or if they'll move on to other games instead. On the one hand, (1) Diplomacy is far less popular than chess or go; (2) it's unclear whether Meta AI (the most likely lab) feels that it has more to accomplish in Diplomacy after CICERO; and (3) future Diplomacy-playing AIs may need to deal with being recognised and possibly exploited or targeted by human players. On the other hand, (4) CICERO got a lot of attention for mere human-level performance; (5) a lot of the groundwork has already been done (CICERO's code is open source); and (6) compute seems to have been an important bottleneck.
References
Bakhtin, Anton, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, and Noam Brown. 2022. “Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning.” https://doi.org/10.48550/ARXIV.2210.05492
Bakhtin, Anton, David Wu, Adam Lerer, and Noam Brown. 2021. “No-Press Diplomacy from Scratch.” https://doi.org/10.48550/ARXIV.2110.02924
Meta AI, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, eade9097.
This post is written in a personal capacity and doesn't necessarily reflect the views of my employer (Rethink Priorities).
There's also a mode called public press, where players do communicate, but only publicly, never privately. There may also be other modes that I don't know of. ↩︎
Meta AI et al. (2022): "In finite [two-player zero-sum] games, certain reinforcement learning algorithms that learn by playing against themselves -- a process known as self-play -- will converge to a policy that is unbeatable in expectation in balanced games. In other words, any finite [two-player zero-sum] game can be solved through self-play with sufficient compute and model capacity. However, in games that involve cooperation, self-play without human data is no longer guaranteed to find a policy that performs well with humans, even with infinite compute and model capacity, because the self-play agent may converge to a policy that is incompatible with human norms and expectations." ↩︎
Though perhaps all that means is that Meta's promotional video is not fundamentally honest or fundamentally collaborative.
Also see Meta AI et al. (2022): "CICERO conditions its dialogue on the action that it intends to play for the current turn. This choice maximizes CICERO's honesty and its ability to coordinate, but risks leaking information that the recipient could use to exploit it (e.g., telling them which of their territories CICERO plans to attack)." ↩︎
In the validation data for CICERO's language model, about 7% of generated messages did not match the given intent (Meta AI et al. 2022). The post-processing filters likely catch some, but not all, of these. ↩︎
Watching games, it seems there are also instances where CICERO is honest about what it's going to do, but dishonest or confused about its reasons for doing so. E.g. at 1:11:00 in this game, where Turkey says it'll move its fleet into the Black Sea in order to support the human player into Romania, even though it's already supporting the human player into Romania from Bulgaria this round, as it has also verbally confirmed; presumably the real reason was to use the Black Sea fleet as a power base against the human player (the game ends soon thereafter, Turkey not having gone against the human player). ↩︎
Meta AI et al. (2022): "One challenge in doing self-play in Diplomacy is that players may adapt their actions substantially on the basis of dialogue with other players, including coordinating joint actions. Explicitly simulating conversations would be extremely expensive in [reinforcement learning]. However, a key insight is that a joint, shared [behavioral cloning] policy trained on the joint action distribution of the human data already implicitly captures the effects of dialogue on the action distribution of human players by modeling that action distribution directly." ↩︎
Cotra's report doesn't assume that an AI would be trained in simulated environments (generally unsupervised learning); it could also be trained on data (generally supervised learning). ↩︎
I do think Diplomacy is more complex than chess or go in the sense that its branching factor is many orders of magnitude larger, however I don't think it's orders of magnitude more difficult to simulate. This may be a point in favour of the "we can fairly simply simulate fairly complex worlds" thesis. ↩︎