Note: everything stated about 2021 and earlier is actually the case in the real world; everything stated about the post-2021 world is what I'd expect to see contingent on this scenario being true, and something I would give decently high probabilities of in general. I believe there is a fairly high chance of AGI in the next 10 years.
12 July 2031, Retrospective from a post-AGI world:
By 2021, it was blatantly obvious that AGI was immanent. The elements of general intelligence were already known: access to information about the world, the process of predicting part of the data from the rest and then updating one's model to bring it closer to the truth (note that this is precisely the scientific method, though the fact that it operates in AGI by human-illegible backpropagation rather than legible hypothesis generation and discarding seems to have obscured this fact from many researchers at the time), and the fact that predictive models can be converted into generative models by reversing them: running a prediction model forwards predicts levels of X in a given scenario, but running it backwards predicts which scenarios have a given level of X. A sufficiently powerful system with relevant data, updating to improve prediction accuracy and the ability to be reversed to generate optimization of any parameter in the system is a system that can learn and operate strategically in any domain.
Data wasn't exactly scarce in 2021. The internet was packed with it, most of it publicly available, and while the use of internal world-simulations to bootstrap an AI's understanding of reality didn't become common in would-be general programs until 2023, it was already used in more narrow neural nets like AlphaGo by 2016; certainly researchers at the time were already familiar with the concept.
Prediction improvement by backpropagation was also well known by this point, as was the fact that this is the backbone of human intelligence. While there was a brief time when it seemed like this might be fundamentally different than the operation of the brain (and thus less likely to scale to general intelligence) given that human neurons only feed forwards, it was already known by 2021 that the predictive processing algorithm used by the human neocortex is mathematically isomorphic to backpropagation, albeit implemented slightly differently due to the inability of neurons to feed backwards.
The interchangeability of prediction and optimization or generation was known as well, indeed it wasn't too uncommon to use predictive neural nets to produce images (one not-uncommon joke application was using porn filters to produce the most pornographic images possible according to the net), and the rise of DeepMind's complementary AIs DALL-E (image from text) and CLIP (text from image) showed the interchangeability in a striking way (though careful observers might note that CLIP wasn't reversed DALL-E; the twin nets merely demonstrated that the calculation can go either way; the reversed porn filter was a more rigorous demonstration of optimization from prediction).
Given that all the pieces for AGI thus existed in 2021, why didn't more people realize what was coming? For that matter, given that all the pieces existed already, why did true AGI take until 2023, and AGI with a real impact on the world until 2025? The answer to the second question is scale. All animal brains operate on virtually identical principles (though there are architectural differences, e.g. striatum vs pallium), yet the difference between a human and a chimp, let alone a human and a mouse, is massive. Until the rise of neural nets, it was commonly assumed that AGI would be a matter primarily of more clever software, rather than simply scaling up relatively simple algorithms. The fact that greater performance is primarily the result of simple size, rather than brilliance on the part of the programmers even became known as the Bitter Lesson, as it wasn't exactly easy on designers' egos. With the background assumption of progress as a function of algorithms rather than scale, it was easy to miss that AlphaGo already had nearly everything a modern superintelligence needs; it was just small.
From 2018 through 2021, neural nets were built at drastically increasing scales. GPT (2018) had 117 million parameters, GPT-2 (2019) had 1.5 billion, GPT-3 (2020) had 175 billion, ZeRO-Infinity (2021) had 32 trillion. By comparison to animal brains (a neural net's parameter is closely analogous to a brain's synapse), that is similar to an ant (very wide error bars on this one; on the other comparisons I was able to find synapse numbers, but for an ant I could only find the number of neurons), bee, mouse and cat respectively. Extrapolating this trend, it should not have been hard to see human-scale nets coming (100 trillion parameters, reached by 2022), nor AIs orders of magnitude more powerful than this.
Moreover, neural nets are much more powerful in many ways than their biological counterparts. Part of this is speed (computers can operate around a million times faster than biological neurons), but a more counterintuitive part of this is encephalization. Specifically, the requirements of operating an animal's body are sufficiently intense that available intelligence for other things scales not with brain size, but with the ratio of brain size to body size, called the encephalization quotient (this is why elephants are not smarter than humans, despite having substantially larger brains). An artificial neural net, of course, is not trying to control a body, and can use all of its power on the question at hand. This allowed even a relatively small net like GPT-3 to do college-level work in law and history by 2021 (subjects that require actual understanding remained out of reach of neural nets until 2022, though the 2021 net Hua Zhibing, based on the Chinese Wu Dao 2.0 system, came very close). Given that a mouse-sized neural net can compete with college students, it should have been clear that human-sized nets would posses most elements of general intelligence, and the nets that soon followed at ten and one hundred times the scale of the human brain would be capable of it.
Given the (admittedly retrospective) obviousness of all this, why wasn't it widely recognized at the time? As previously stated, much of the lag in recognition was driven by the belief that progress would be driven more by advances in algorithms than by scale. Given this belief, AGI would appear extraordinarily difficult, as one would try to imagine algorithms capable of general intelligence at small scales (DeepMind's AlphaOmega proved in 2030 that this is mathematically impossible; you can't have true general intelligence at much below cat-scale, and it's very difficult to have it below human-scale!) Even among those who understood the power of scaling, the fact that it's almost impossible to have AI do anything in the real world beyond very narrow applications like self-driving cars without reaching the general intelligence threshold made it appear plausible that simply building larger GPT-style systems wouldn't be enough without another breakthrough. However, in 2021 DeepMind published a landmark paper entitled "Reward is Enough", recognizing that reward-based reinforcement learning was in fact capable of scaling to general intelligence. This paper was the closest thing humanity ever got to a fire alarm for general AI: a fairly rigorous warning that existing models could scale up without limit, and that AGI was now only a matter of time, rather than requiring any further real breakthroughs.
After that paper, 2022 brought human-scale neural nets (not quite fully generally intelligent, due to lacking human instincts and only being trained on internet data, which leaves some gaps that require substantially superhuman capacity to bridge through inference alone), and 2023 brought the first real AGI, with a quadrillion parameters, powerful enough to develop an accurate map of the world purely through a mix of internet data and internal modeling to bootstrap the quality of its predictions. After that, AI was considered to have stalled, as alignment concerns prohibited the use of such nets to optimize the real world, until 2025 when a program that trained agents on modeling each others' full terminal values from limited amounts of data allowed the safe real-world deployment of large-scale neural nets. Mankind is eternally grateful to those who raised the alarm about the value alignment problem, without which DeepMind would not have conducted that crucial hiatus, and without which our entire light cone would now be paperclips (instead of just the Horsehead Nebula, which Elon Musk converted to paperclips as a joke).
Thanks! This is the sort of thing that we aimed for with Vignettes Workshop. The scenario you present here has things going way too quickly IMO; I'll be very surprised if we get to human-scale neural nets by 2022, and quadrillion parameters by 2023. It takes years to scale up, as far as I can tell. Gotta build the supercomputers and write the parallelization code and convince the budget committee to fund things. If you have counterarguments to this take I'd be interested to hear them!
(Also I think that the "progress stalled because people didn't deploy AI because of alignment concerns" is way too rosy-eyed a view of the situation, haha)
Given the timing of Jensen's remarks about expecting trillion+ models and the subsequent MoEs of Switch & Wudao (1.2t) and embedding-heavy models like DLRM (12t), with dense models still stuck at GPT-3 scale, I'm now sure that he was referring to MoEs/embeddings, so a 100t MoE/embedding is both plausible and also not terribly interesting. (I'm sure Facebook would love to scale up DLRM another 10x and have embeddings for every SKU and Internet user and URL and video and book and song in the world, that sort of thing, but it will mean relatively little for AI capabilities or risk.) After all, he never said they were dense models, and the source in question is marketing, which can be assumed to accentuate the positive.
More generally, it is well past time to drop discussion of parameters, and switch to compute-only as we can create models with more parameters than we can train (you can fit a 100t-param with ZeRo into your cluster? great! how you gonna train it? Just leave it running for the next decade or two?) and we have no shortage of Internet data either: compute, compute, compute! It'll only get worse if some new architecture with fast weights comes into play, and we have to start counting runtime-generated parameters as 'parameters' too. (eg Schmidhuber back in like the '00s showed off archs which used... Fourier transforms? to have thousands of weights generate hundreds of thousands of weights or something. Think stuff like hypernetworks. 'Parameter' will mean even less than it does now.)