Abstract
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players' beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, November, eade9097. https://doi.org/10.1126/science.ade9097.
That's an interesting example to pick: watching GAN research firsthand as it developed from scratch in 2014 to bust ~2020 played a considerable role in my thinking about the Bitter Lesson & Scaling Hypothesis & 'blessings of scale', so I regard GANs differently than you do, and have opinions on the subject.
First, GANs may not be new. Even if you do not entirely buy Schmidhuber's claim that his predictability-minimization arch 30 years ago or whatever is identical to GANs (and better where not identical), there's also that 2009 blog post or whatever by someone reinventing it. And given the distribution of multiple-discoveries, that implies there's probably another 2 or 3 reinventions out there somewhere. If what matters about GANs is the 'innovation' of the idea, and not scaling, why were all the inventions so sterile and yielded so little until so late? (And why have such big innovations been so thoroughly superseded in their stronghold of image generation by approaches which look nothing at all like GANs and appear to owe little to them, like diffusion models?)
Second, GANs in their successful period were clearly a triumph of compute. All the biggest successes of GANs use a lot of compute historically: BigGAN is trained on literally a TPUv3-512 pod. None of the interesting success stories of GANs look like 'train on a CPU with levels of compute available to a GOFAI researcher 20 years ago because scaling doesn't matter'. Cases like StyleGAN where you seem to get good results from 'just' 4 GPU-months of compute (itself a pretty big amount) turn out to scale infamously badly to more complex datasets, and to be the product of literally hundreds or thousands of fullscale runs, and deadends. (The Karras group sometimes reports 'total compute used during R&D' in their papers, and it's an impressively large multiple.) For all the efficiency of StyleGAN on faces, no one is ever going to make it work well on LAION-4b etc. (At Tensorfork, our StyleGAN2-ext worked mostly by throwing away as much of the original StyleGAN2 hand engineering as possible, which held it back, in favor of increasing n/param/compute in tandem. The later distilled ImageNet StyleGAN approach relies on simplifying the problem drastically by throwing out hard datapoints, and still only delivers bad results on ImageNet.) And that's the most impressive 'innovation' in GAN architecture. This is further buttressed by the critical meta-science observations in GAN research that when you compared GAN archs on a fair basis, in a single codebase with equal hyperparam tuning etc, fully reproducibly, you typically found... arch made little difference to the top scores, and mostly just changed the variance of runs. (This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures... and the gains disappear when you control for compute/n/param. The results were better - but just because later researchers used moar dakka, and either didn't realize that all the fruits of their hard work was actually due to harder-working GPUs or quietly omitted that observation, in classic academic fashion. I feel like every time I get to ask a researcher who did something important in private how it really happened, I find out that the story told in the paper is basically a fairy tale for small children and feel like the angry math student who complained about Gauss, "he makes [his mathematics] like a fox, wiping out the traces in the sand with his tail.")
Third, to further emphasize the triumph of GANs due to compute, their fall also appears to be due to compute too. The obsolescence of GANs is, as far as I can tell, due solely to historical happenstance. BigGAN never hit a ceiling; no one calculated scaling laws on BigGAN FIDs and showed it was doomed; BigGAN doesn't get fatally less stable with scale, Brock demonstrated it gets more stable and scales to at least n = 0.3b without a problem. What happened was simply that the tiny handful of people who could and would do serious GAN scaling happened to leave the field (eg Brock) or run into non-fundamental problems which killed their efforts (eg Tensorfork), while the people who did continue DL scaling happened to not be GAN people but specialized in alternatives like autoregressive, VAE, and then diffusion models. So they scaled up all of those successfully, while no one tried with GANs, and now the 'fact' that 'GANS don't work' is just a thing That Is Known: "It is well known GANs are unstable and cannot be trained at scale, which is why we use diffusion models instead..." (Saw a version of that in a paper literally yesterday.) There is never any relevant evidence given for that, just bare assertion or irrelevant cites. I have an essay I should finish explaining why GANs should probably be tried again.
Fourth, we can observe that this is not unique to GANs. Right now, it seems like pretty much any image arch you might care to try works. NeRF? A deep VAE? An autoregressive Transformer on pixels? An autoregressive on VAE tokens? A diffusion on pixels? A diffusion on latents? An autoencoder? Yeah sure, they all work, even (perhaps especially) the ones which don't work on small datasets/compute. When I look at Imagen samples vs Parti samples, say, I can't tell which one is the diffusion model and which one is the autoregressive Transformer. Conceptually, they have about as much in common as a fungus with a cockroach, but... they both work pretty darn well anyway. What do they share in common, besides 'working'? Compute, data, and parameters. Lots and lots and lots of those. (Similarly for video generation.) I predict that if GANs get any real scaling-up effort put into improving them and making them run at similar scale, we will find that GANs work well too. (And will sample a heckuva lot faster than diffusion or AR models...)
'Innovative' archs are just not that important. An emphasis on arch, much less apologizing for 'neurosymbolic' approaches, struggles to explain this history. Meanwhile, an emphasis on scaling can cleanly explain why GANs succeeded at Goodfellow's reinvention, why GANs fell out of fashion, why their rivals succeeded, and may yet predict why they get revived.
We don't need to see. Just look at the past, which has already happened.
Yeah, he's wrong about that because it'd only be true if academic markets were maximizing for long-term scaling (they don't) rather than greedily myopically optimizing for elaborate architectures that grind out an improvement right now. The low-hanging fruit has already been plucked and everyone is jostling to grab a slightly higher-hanging fruit they can see, while the long-term scaling work gets ignored. This is why various efforts to take a breakthrough of scaling, like AlphaGo or GPT-3, and hand-engineer it, yield moderate results but nothing mindblowing like the original.
No. You'd predict that there'd potentially be more short-term progress (as researchers pluck the fruits exposed by the latest scaling bout) but then less long-term (as scaling stops, because everyone is off plucking but with ever diminishing returns).
Here too I disagree. What we see with scaled up systems is increasingly interpretability of components and less issues with things like polysemanticity or relying on brittle shortcuts, increasingly generalizability and solving edge cases (almost by definition), increasing capabilities allowing for meaningful tests at all, and reasons to think that things like adversarial robustness will come 'for free' with scale (see isoperimetry). Just like with the Bitter Lesson for capability, at any fixed level of scale or capability, you can always bolt on more shackles and gadgets and gewgaws to get a 'safer' system, but you then are ever increasing risk of obsolescence and futility because a later scaled-up system may render your work completely irrelevant both in terms of economic deployment and in terms of what techniques you need to understand it and what safety properties you can attain. (Any system so simple it can be interpreted in a 'symbolic' classical way may inherently be too simple to solve any real problems - if the solutions to those problems were that simple, why weren't they created symbolically before bringing DL into the picture...? 'Bias/variance' applies to safety just as much as anything else: a system which is too small and too dumb to genuinely understand things can be a lot more dangerous than a scaled-up system which does understand them but is shaky on how much it cares. Or more pointedly: hybrid systems which do not solve the problem, which is 'all of them' in the long run, cannot have any safety properties since they do not work.)